Automatic Speech Recognition (ASR) systems play an important role in human-computer interaction. Although ASR systems utilizing only audio information attain high recognition rates on noiseless data, recognition rates decrease significantly in situations where there is environmental auditory noise.
One possible solution to overcome the noise problem is to use the visual speech information. Considering that the visual information is not affected by the audio noise, ASR systems utilizing both the audio and visual information achieve increased recognition accuracy and improved noise robustness. To combine audio and visual information sources, efficient information fusion techniques are required which is the focus of this research.