Audio-visual speech recognition in vehicular noise using a multi-classifier approach
Karabalkan, Harun and Erdoğan, Hakan
Biennial on DSP for in-Vehicle and Mobile Systems
Speech recognition accuracy can be increased and noise robustness can be improved by taking advantage of the visual speech information acquired from the lip region. To combine audio and visual information sources, efficient information fusion techniques are required. In this paper, we propose a novel SVM-HMM tandem hybrid feature extraction and combination method for an audio-visual speech recognition system. From each stream, multiple one-versus-rest support vector machine (SVM) binary classifiers are trained where each word is considered as a class in a limited vocabulary speech recognition scenario. The outputs of the binary classifiers are treated as a vector of features to be combined with the vector from the other stream and new combining binary classifiers are built. The outputs of the classifiers are used as observed features in hidden Markov models (HMM) representing words. The whole process can be considered as a nonlinear feature dimension reduction system which extracts highly discriminatory features from limited amounts of training data. To simulate the performance of the system in a real-world environment, we add vehicular noise at different SNRs to speech data and perform extensive experiments.