Taming the transformers - Recipes for making transformers 200x faster for inference on CPU.

Logesh Kumar Umapathi (~logesh_kumar)



Transformers are the state of the art of deep learning architectures, That have shown promising results across different sub-fields of NLP and Computer vision. Because of their extensive pre-training and at least over a million learnable parameters, the transformer models are sample efficient. They can be fine-tuned with few samples compared to previous architectures.

However, The sample efficiency comes at a cost of high inference/prediction latency and high memory requirements. This makes them costly for low latency use cases. Because of this, the potential of the transformers is not been fully leveraged in these low latency applications.

I would like to share recipes that I found effective at my current work. These recipes could bring about a 200x faster inference improvement on CPU while retaining 95-98% of the original accuracy in this talk. This talk is inspired by few research papers (listed below) on this topic. I would like to distill (pun not intended :) ) the techniques discussed in these papers and share them as easy recipes in my talk.

What is covered? :

Explanation of the below techniques ( demos and code snippets will be included as necessary) :

  1. Adaptive batching
  2. Distillation
  3. Pruning
  4. Quantization
  5. Runtime optimization


Talk is about techniques to improve transformer inference speed up to 200 times.

Content References:

  1. Fastformers - https://arxiv.org/abs/2010.13382
  2. TinyBERT - https://arxiv.org/abs/1909.10351
  3. ONNX Runtime - https://github.com/microsoft/onnxruntime/tree/61fa5476d55d98f6fb66b5d7b076169073bdb2c8
  4. DynaBERT - https://arxiv.org/abs/2004.04037


Exposure to basics of NLP, Deep learning, and Transformers

Video URL:


Speaker Info:

Logesh is an ML Research engineer at Saama technologies, building and implementing NLP products for top pharma companies. His work involves bringing Deep learning-based NLP solutions from early prototypes to production. He is also a Mentor at Springboard, helping future Machine Learning engineers and Data Scientists in his free time. When he is not in front of a computer screen, he can be found playing badminton, photographing, or reading a book.

LinkedIn : https://in.linkedin.com/in/logeshkumaru Website: http://logeshumapathi.com/

Speaker Links:

Notable previous Talks

  1. Demystifying BERT: How to Interpret NLP Models? at Data Hack Summit conducted by Analytics Vidhya
  2. Building a Question Answering System with State of the art NLP Models
  3. 'Get your feet wet with ML' at GDG CBE DevFest'19

Blog Posts:

  1. What does a fine tuned BERT model look at ? - Towards datascience
  2. A visual intuition to Bayes rule

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Advanced
Last Updated: