Sunday, March 20, 2011

Speech Technology : Phonetics

Last year around this time I was working on one of my projects related to speech recognition. It was a personal side project and I really did not give the project enough attention. Plus it had been a long time since I had seen stuff like Fast Fourier Transform and I kind of procrastinated .Well - it is time to again start where I left and true to its original meaning - blog. This is a series of blogs is where I speak whatever is on mind about speech technology, phonetics and various options available for implementing it. 
I first got introduced to the subject of phonetics back in 1992. I was in college and used to visit my aunt every weekend. My uncle used to teach Communication Skills  in a technical college. He must be having a couple of dozen of books on english language and teaching english. While going through his bookshelf  I found a book on english sound system. That was the start of a fascination for phonetics and languages. My dad also had bought all volumes of Marathi Vishwakosh (Encyclopedia) which I used to read a lot in my free time. (And I had a lot of free time back then). I then read about phonetics and grammar of my language and about human language in general. That fascination still remains ...     
The language ability of our race has made us the only species able to successfully transition to civilization. Every human being has capacity of language and it is impossible to think of human society with out language. 
[Without language we would be still in sub saharan desert and probably extinct by now. As a side note please do watch   Journey of Man if you get some time. Anthropology is yet another subject I am fascinated with.]
How human voice is produced.
Human speech articulation begins in the larynx. The vibrations created there are modified through out the vocal tract which ultimately comes out of the mouth as series of sounds which we understand as sequence of phonemes. Larynx is where the "voice" starts but that sound is progressively modified in the mouth by forming various shapes with our tongue and lips. I have modified the standard diagram to show the schematic of the vocal tract. 
The most important key concept is this: Our tongue is a very flexible muscle that can form various shapes and that profoundly affects the way we perceive the basic sound that begins at the larynx. 
[Edited from original here :]

That brings us to the next concept. There are two types of sounds viz. vowels and consonants.
When the airflow from larynx is not obstructed we perceive that sound. Anyone how has dealt with waveform of human sound knows that most of the waveform is basically sound of vowels. However the most interesting sounds are generated when the airflow is blocked (or substantially modified) for the brief amount of time. This time frame is seriously tiny compared to rest of the vowel sound and we perceive it as consonant
In terms of digital signal processing the waveform of the voice that starts in larynx is modified by the resonance of the vocal tract and therefore we apply the convolution of the simplified model on the right side.

Vowels take most of the time in speech signal. All consonants are meaningful only in the context of background of robust vowel signal.  The larynx is actively producing "voice" during the vowels.
The quality of vowel depends on  four attributes.
  1. Height of tongue position- In other words whether the gap between tongue and roof of mouth is closed or open. generally four positions are considered. Open, Open mid,Close mid and Close.
  2. Position of tongue - Front, central or back
  3. Roundness of lips - The roundness of lips add another effect.  They are generally rounded for back vowels.
  4. Nasalization ? are the vowels nasalized.
Below is the list of known vowels and their IPA symbols.

Edited from original here :

The best way to analyze vowels is to look at their frequency spectrogram and look at the peaks. The first three formant are specifically interesting. The roundness of lips changes the positions of formant.
Again there are further diphthongs and glides that are combination of  two vowels.

 The consonants are interesting because they are numerous, show great variation in how they are produced and the signal is available for only the short time. They are "negative" because the voicing and vowel are interrupted or blocked by the position of tongue. Here I am going to use somewhat non-standard terminology and classification only because that is how I understand it. Also because i was introduced to the subject via Sanskrit Grammar.
There are three main things to look for - (A) Properties of original sound that is obstructed (B) manner in which the air flow is obstructed/modified
(C) place of obstruction/modification
Properties of original sound
  1. Voiced vs Unvoiced - whether voicing continues or not . [b -p]. [k -g][t-d] are some examples . (saghosh and aghosh.सघोष- अघोष )
  2. Aspirated vs Non aspirated - this difference is important in Indic langauges it effectively doubles the consonants available for use in languages.   (alpa prana- maha prana अल्पप्राण- महाप्राण)
Manner in which flow is obstructed /Modified
  1. Stops - The airflow completely stops
  2. Friction - Sound produced by friction is introduced in the mix.
  3. Affricates - these are combination of stop, followed by the friction at the same place.
  4. liquids - sibilant and approximants - where air flow is partially blocked
  5. Nasals - the airflow goes through the nasal channel
Place of articulation

  • Labial 
    • Bilabial
    • Labio-dental
  • Coronal 
    • Dental
    • Alveolar
  • Palatal 
    • Alveo-palatal
    • Retroflex
    • Palatal
  • Dorsal
    • Velar
    • Uvular 
  • Radial 
    • Pharyngeal /Epiglotal
    • Glotal

Edited from original here :

Common Tasks in Speech Recognition
  1. Audio signal capture
  2. Spectral Analysis
  3. Phonetic Feature Identification
  4. Phoneme Identification
  5. Phoneme to word translation

1 comment: