Building A Lip Reading System To Recognise Visual Speech Using Python





Automatic Visual Speech Recognition comes very handily in scenarios that have noisy audio signals. A video image of a person talking is analyzed and shapes made by the lips are examined which are then turned into sounds by comparing to a dictionary to create matches to the words being spoken. In this talk, we will use a VGG+GRU network which is based on CNN+LSTM layers to predict the text spoken by the speaker and classify it into 20 classes from audio-less videos, consisting of 10 words and 10 phrases. This will be done on the audiovisual MIRACL-VC1 dataset.

The talk will cover how a CNN+LSTM can be used to recognize a sequence of shapes formed by the mouth and then match it to a specific word or sequence of words spoken from Visual Feed. It will include data-preprocessing, creation of CNN and LSTM layers using Python and applying them on the dataset.


Basics of Python Syntax, Tensorflow, Keras, Neural Networks

Content URLs:

Will share the code, slides, and resources as a GitHub repository after the talk.

Speaker Info:

Kanika Modi holds a Bachelor's in Computer Engineering from Netaji Subhas Institute of Technology, University of Delhi. Having finished her coursework, she will join Amazon as a Software Development Engineer(SDE). She is an open source enthusiast and has contributed to organizations such as Systers, Fossasia, etc. She is also a Google Summer of Code'18 mentor at Systers, a GirlScript Summer of Code'18 mentor and mentor at RightApprise. Her interests also extend to the fields of artificial intelligence and machine learning. She prefers Python as her weapon of choice.

Id: 768
Section: Data science
Type: Talks
Target Audience: Intermediate
Last Updated: