Building A Lip Reading System To Recognise Visual Speech Using Python
kanika_96 |
Description:
Automatic Visual Speech Recognition comes very handily in scenarios that have noisy audio signals. A video image of a person talking is analyzed and shapes made by the lips are examined which are then turned into sounds by comparing to a dictionary to create matches to the words being spoken. In this talk, we will use a VGG+GRU network which is based on CNN+LSTM layers to predict the text spoken by the speaker and classify it into 20 classes from audio-less videos, consisting of 10 words and 10 phrases. This will be done on the audiovisual MIRACL-VC1 dataset.
The talk will cover how a CNN+LSTM can be used to recognize a sequence of shapes formed by the mouth and then match it to a specific word or sequence of words spoken from Visual Feed. It will include data-preprocessing, creation of CNN and LSTM layers using Python and applying them on the dataset.
Prerequisites:
Basics of Python Syntax, Tensorflow, Keras, Neural Networks
Content URLs:
Will share the code, slides, and resources as a GitHub repository after the talk.
Speaker Info:
Kanika Modi holds a Bachelor's in Computer Engineering from Netaji Subhas Institute of Technology, University of Delhi. Having finished her coursework, she will join Amazon as a Software Development Engineer(SDE). She is an open source enthusiast and has contributed to organizations such as Systers, Fossasia, etc. She is also a Google Summer of Code'18 mentor at Systers, a GirlScript Summer of Code'18 mentor and mentor at RightApprise. Her interests also extend to the fields of artificial intelligence and machine learning. She prefers Python as her weapon of choice.