Intelligent Vehicles Made Simply: An Analysis to Determine The Best Captioning Models for Self-Driving Vehicles

Abhipsha Das (~chiral-carbon)


Self-driving vehicles are a very fascinating concept to most of us, regardless of whether we are engineers or not. They are one of the pivotal aims of artificial intelligence and a lot of research is being carried out to realize that.

But for us to actually understand how they work - behind the mechanics and hardware - we need to understand the AI, and what really makes them intelligent to manoeuvre safely on roads without supervision. The AI in action behind any intelligent vehicle should encompass two essential things: an understanding of the path being traversed, and the communication of that understanding to the physical vehicle.

This is where computer vision and language understanding comes into play. The AI should see and convert the visuals into natural language. In other words, it requires a description of real-time video to be generated - using a captioning model.

An outline of the talk:

  • Images: an example of unstructured data
  • Processing images for extracting information
  • Computer vision: how machines are trained for sight and observing details in images
  • Language modelling: using statistics to predict character and word sequences
  • Natural language processing: an overview
  • What is image captioning?
  • Building an image captioning model using convolutional neural network: converts images to vector representations and long short term memory: converts vectors to characters
  • Analysis of various captioning architectures: encoder-decoder, compositional, similarity model, policy network based
  • Evaluating the language model on various metrics to find the accuracy of our model's captions against human labels
  • An extension of image captioning: video captioning
  • Image extraction from videos
  • Object detection and comparison of images over time steps to generate sentences

The talk will have an example demonstration of the model in action.


The talk is going to be very intuitive and knowledge of the nuances of any of the things listed below is absolutely not required! It is for the audience's context and clarity that they are encouraged to know the basics of computer vision to understand how we arrive at the conclusions. A decent knowledge of Python and the Keras framework is recommended.

Preferably a basic understanding of:

will also be helpful for a better grasp of the talk.

The links can be accessed for some source material on those topics.

Content URLs:

The talk is an in-depth analysis and application of my preliminary works in image captioning which can be found at:

  • my Academia report
  • my initial image captioning implementation on GitHub

Speaker Info:

Abhipsha Das is a final year CS undergrad at IIIT Bhubaneswar. She is a core team member of the Programming Society at IIIT Bhubaneswar and an ACM Chapter Official. She has worked on machine learning algorithms in the past for research undertakings:

  1. using unsupervised learning for automatic gait event detection from motion-capturing cameras, in the Machine Intelligence and Bio-Motion Lab at NIT Rourkela
  2. using supervised + reinforcement learning in neural image captioning for intelligent vehicles, at the AI Department of IIT Kharagpur.

Her love for Python started 2 years ago with her enthusiasm for Artificial Intelligence. It has drawn her to participate as a mentor in her college community as well in the 4th edition of the Learn IT, Girl program. She has done projects in gold price prediction, quantum computing algorithms and medical image segmentation in the past. She is an ardent Pythonista, so much so that she even appears for coding interviews in them.

Speaker Links:

Public profiles on:

Id: 1448
Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: