Tandem Approach for Information Fusion in Audio Visual Speech Recognition
Speech is the most frequently preferred medium for humans to interact with their environment making it an ideal instrument for human-computer interfaces. However, for the speech recognition systems to be more prevalent in real life applications, high recognition accuracy together with speaker independency and robustness to hostile conditions is necessary. One of the main preoccupation for speech recognition systems is acoustic noise. Audio Visual Speech Recognition systems intend to overcome the noise problem utilizing visual speech information generally extracted from the face or in particular the lip region. Visual speech information is known to be a complementary source for speech perception and is not impacted by acoustic noise. This advantage brings in two additional issues into the task which are visual feature extraction and information fusion. There is extensive research on both issues but an admissable level of success has not been reached yet. This work concentrates on the issue of information fusion and proposes a novel methodology. The aim of the proposed technique is to deploy a preliminary decision stage at frame level as an initial stage and feed the Hidden Markov Model with the output posterior probabilities as in tandem HMM approach. First, classiﬁcation is performed for each modality separately. Sequentially, the individual classiﬁers of each modality are combined to obtain posterior probability vectors corresponding to each speech frame. The purpose of using a preliminary stage is to integrate acoustic and visual data for maximum class separability. Hidden Markov Models are employed as the second stage of modelling because of their ability to handle temporal evolutions of data. The proposed approach is investigated in a speaker independent scenario for digit recognition with the existence of diverse levels of car noise. The method is compared with a principal information fusion framework in audio visual speech recognition which is Multiple Stream Hidden Markov Models (MSHMM). The results on M2VTS database show that the novel method achieves resembling performance with less processing time as compared to MSHMM.