i generated spectrogram of "seven" utterance using "egs/tidigits" code kaldi, using 23 bins, 20khz sampling rate, 25ms window, , 10ms shift. spectrogram appears below visualized via matlab imagesc function:
i experimenting using librosa alternative kaldi. set code below using same number of bins, sampling rate, , window length / shift above.
time_series, sample_rate = librosa.core.load("7a.wav",sr=20000) spectrogram = librosa.feature.melspectrogram(time_series, sr=20000, n_mels=23, n_fft=500, hop_length=200) log_s = librosa.core.logamplitude(spectrogram) np.savetxt("7a.txt", log_s.t)
however when visualize resulting librosa spectrogram of same wav file looks different:
can please me understand why these different? across other wav files i've tried notice librosa script above, fricatives (like /s/ in "seven" in above example) being cutoff , affecting digit classification accuracy. thank you!
kaldi applies lifter default on dct output, thats why upper coefficients attenuated. see details here.
Comments
Post a Comment