This post introduces the reader to speech signal processing. We will mostly focus on speaker verification. First the term speaker verification will be explained. Next, we will talk about the very basics of the human vocal system as well as the human auditory system. The understanding of both topics is important, in order to understand some of the techniques common in speech processing. After that, we will have a look at the signal representation of speech. Finally, we will talk about the basic elements of a speaker verification system.
The most straightforward example would be a system where the voices of different test persons is used to predict the gender or age of the test persons. The information to determine the gender or age is hidden in the speech signal itself and not part of the speech in a linguistic sense. This example shows the difference between speaker verification and speech recognition very clearly. In an optimal case the speaker verification system does not depend on the input text at all, since the information is part of the speech signal itself. However, text independent systems are far more complicated than text dependent systems. Due to this fact there are also text dependent speaker verification systems, even though they are much easier to spoof.
Human vocal system
The basic elements of the human vocal system are the lung, trachea, mouth cavity and the nasal cavity. Each of these elements consists of various sub elements which can change shape and influence airflow in many different ways. Due to the various elements and various influences, the vocal tract resembles a highly dynamic, non-linear and non stationary system. In order to be able to model the vocal tract, we have to make assumptions and simplifications to it. We will talk about modeling techniques and feature extraction in more detail at a later point.
Process of an utterance
The lung builds up a pressure that causes an airflow through the vocal system. In the larynx, the airflow is influenced by the vocal folds. They can either be completely closed or opened if different tension is applied to the folds. During speech, the vocal folds go through many opening and closing sessions. The tension usually determines the pitch of the momentary utterance. Next, in the mouth cavity, there are several elements which can change influence the airflow. First, the tongue is a very flexible, agile and strong muscle and is crucial for speech. Second, the lips and teeth also play a major role in articulation. The nasal cavity is used as a room for resonances and generation of nasal sounds.
Human auditory system
I will not talk much about the organics of the human auditory system. We will focus more on the perception of the auditory system and what one can derive from these facts. The human ear is made up of different organic elements which can very effectively transform pressure waves to a mechanic movement. This mechanic movement can be measured in the inner ear. The sound is filtered by different filterbanks, very much like it is done in signal processing to calculate a spectogram. The whole process is very complicated, also because there are various feedback loops from the brain which can influence the inner ear and even generate resonant frequencies. Due to this complexity, we will focus on a few simple facts that have been scientifically proven and can improve the quality of our speech signal processing. First, the human ear has a partially logarithmic perception of frequency. This means that the ear has a higher pitch resolution for lower pitch tones compared to high pitch tones. Science has come up with a technique called Mel frequency warping to account for the partially logarithmic pitch perception. Second, also the loudness perception is logarithmic, once the sound pressure reaches a certain minimal level. Below this level, the ear is not capable of detecting any sound. Therefore, it would be a good idea to scale the loudness in a logarithmic scale and also account for the minimal sound threshold. By mimicking the ear’s performance, there is a lower chance that we are wasting precious processing time with data which is actually not important for the performance of our speaker verification system.
The question we will try to answer in this chapter is, what does a speech sample look like once it is in the signal domain? What are common methods for the inspection of a speech signal? As mentioned before, the speech signal which is recorded by the microphone shows the pressure fluctuation over time. However, the pressure signal is not stationary, which means that it is not a good idea to simply take the discrete fourier transform (DFT) of the whole speech sample. One way of observing a speech signal is to split it into short, shifted and concatenated frames of speech, short enough that the signal can be considered stationary within the frame. Next, the discrete-short-time fourier transform is computed. Finally, the generated spectrum of each frame can be represented in a spectogram.
In the picture below, a sound sample with the utterance “This is a test” spoken by a male speaker, is shown in a spectogram. We can see that speech is made up of several harmonic frequencies (horizontal lines) which is best visible at the occurrence of a vowel. These harmonics are produced by the oscillating vocal folds. If there is a gradual change in the harmonic frequencies, then the sound is called a diphtongue (visible between 0.85s to 0.95s). During the utterance of a diphtongue, the position of the tongue is moved, which changes the frequencies, Additionally, we observe noise like patterns with about the same duration as vowels have. These phones are called fricatives, one example is the phone ‘s’. Finally, the plosives are phones, which are of a very short duration but usually stretch out over the whole frequency band. Examples of plosives are the phones ‘t’ or ‘p’.
Basic elements of a speaker verification system
In this section, we will show the basic building blocks of a speaker verification system. We will group the whole signal processing chain into two groups: Physical signal to digital signal conversion, feature extraction and further high level processing.
Physical to digital signal conversion
First, the sound pressure wave is converted to an electrical analog signal by a microphone. The type of microphone depends greatly on the usage of the speaker verification system (ex. distance to the speaker). It is important, that the transmission curve of the microphone is as flat as possible. In other words, the microphone should not alter the signal in any way. Next, a high pass filter after the anti-aliasing filter can help to remove a DC bias. Such a DC bias could falsify calculations in a later stage of the processing chain. The pre-emphasis block alters the frequency spectrum of the recorded sound, in order to mimic a similar behavior of the ear or rather a collaboration of the ear and brain. As mentioned before, the brain can produce certain resonant frequencies in the inner ear, which amplifies the higher frequencies of the audible spectrum. Therefore, we would like to achieve a similar behavior for our speaker verification system, which is the purpose of the pre-emphasis block.
Preprocessing, feature extraction, further processing
First, the signal is split into separate frames of about 20-30ms duration. If the duration of a single frame would be longer, very short phones (such as plosives) could not be detected anymore. On the other hand, the duration of vowels is usually about 80ms long. Each of the frames has an overlap of about 10ms, to ensure that the beginning and ending of the phones are captured correctly. Next, depending on the feature extraction method used, a frequency and/or magnitude warping stage is implemented. Both techniques try to mimic the behavior of the ear. Examples for frequency warping are the Mel scale or bark scale. After that, the basic information for speaker verification is extracted form the speech signal. This process is called feature extraction. There exist various methods, two common methods are Mel Frequency Cepstral Coefficients (MFCC) and Linear Predictive Cepstral Coefficients (LPCC). Finally, clustering and higher level processing algorithms such as Gaussian Mixture Models or Neural Networks are applied which process the extracted features of the speech signal. The output of these higher level algorithms will depend on the usage of the speaker verification system. In the case of a system which categorizes speakers into gender, the output might be two different output vectors one for each gender (male, female).
In this post, I have given an introduction in the field of speech signal processing, in particular for the application of a speaker verification system. The post shows how speech can be represented in the signal space. Furthermore, different common methods which are used in speech signal processing have been named. In the next post, the feature extraction method MFCC will be implemented in a matlab/octave script.
- Fundamentals of Speaker Recognition, Homayoon Beigi, Springer, ISBN: 978-0-387-77591-3
- Academic Press Library in Signal Processing, Volume 4, Rama Chellappa & Sergios Theodoridis, ISBN: 978-0-12-396501-1