TUBITAK 107E015: Novel Approaches in Audio-Visual Speech Recognition
Erol Ozgur (alumni)
Speech recognition is a maturing technology. Especially in noisy environments, speech recognition systems have degraded performance. In addition to audio features, it has been proposed to use lip information to aid in speech recognition. Such methods are called audio-visual speech recognition. Lip-reading is also used for security purposes to understand what a remote person is saying. There are lip-reading experts even. So, this shows that lip movements can be used to recognize speech even in the absence of speech itself.
In this project, we aim to improve speech recognition performance using audio and visual information together. We propose novel approaches to the problem and we hope that the proposed system will over-perform the conventional systems. There are studies in the world and in our country on audio-visual speech recognition. In this project, we are proposing a multi-stream multi-classifier approach that is novel and we focus on large vocabulary phonetic model based audio-visual speech recognition which is not tackled in the literature.
The novelty in this proposal is in the following areas:
We will extract both boundary curve and texture based features from lips. We will normalize features to eliminate variations due to the person and environment (illuminations etc.). We will make use of dynamic changes in features as well.
We will use multiple classifier systems to classify both speech and lip data and obtain class posterior probabilities, which will in turn be given to multi-stream HMMs as observations. Thus, we will follow a tandem approach to the problem, achieving discrimination through using classifiers.
We will experiment with one and two stream HMMs. In the two-stream HMM, the streams corresponding to audio and visual features will be synchronous in phonetic boundaries, but asynchronous or semi-synchronous in sub-phonetic boundaries. We will experiment with different topologies.
The system will be phonetic and context dependent. English or Turkish will be recognized by the system. We plan to collect audio-visual data for Turkish. We plan to purchase English data. We will also use Turkish and English TV data to train the audio-visual systems.
In the second phase of the project, we will experiment with remote audio-visual speech recognition. . This phase will experiment with security based uses of the technology. We will use stereo cameras to localize a person’s face, a pan/tilt/zoom camera to focus on the face of the person and we will use microphone arrays to focus on the speech.
The project will enable proficiency about this topic in Turkey. The application areas of the technology can be listed as follows:
Achieve robust speech recognition in noisy environments such as in a car or plane.
Improve performance of speech recognition in TV news and programs.
Remote surveillance application of lip reading and speech recognition to improve security.
We believe there is a high publication potential for the project since it involves a hot research area and two different modalities.
In the framework of the "Novel Approaches in Audio-Visual Speech Recognition" project, the following papers are published: