Natural Language Toolkit for Indic Languages - iNLTK

Gaurav Arora (~gaurav77)


Description:

Natural Language Toolkit for Indic Languages (iNLTK) is an open source Deep Learning library built on top of Pytorch in python and aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. iNLTK currently supports 12 Indic languages, including Hindi, Punjabi, Sanskrit, Gujarati, Kannada, Malyalam, Nepali, Odia, Marathi, Bengali, Tamil and Urdu.

The presentation will share insights into why and how this library came into existence, what power it provides to developers and the technical details of working in NLP with low resource languages - directly from the creator of iNLTK.

iNLTK is built from ULMFiT language models, classifiers trained in repositories NLP for Hindi, NLP for Punjabi , NLP for Sanskrit, NLP for Gujarati, NLP for Kannada, NLP for Malayalam, NLP for Nepali, NLP for Odia, NLP for Marathi, NLP for Bengali, NLP for Tamil, NLP for Urdu. These repositories contain all of the code, links to Dataset which was used to train the models, scripts which were used to scrape/clean Dataset, Trained Language Models, Classifiers and Tokenizer models trained on Google's sentencepiece.

The presentation will paint end to end picture of building Deep Learning models for low resource languages - starting from data collection to technical details of building language models, using transfer learning etc.

iNLTK has been widely appreciated by the community, including by people like Jeremy Howard on Twitter, by people on LinkedIn: here, here, here, here and on reddit. The library has 300+ stars, 50+ Forks, 20+ watchers on GitHub and has had 11,000+ Downloads from PyPi till June 2019.

Basic Outline of Talk

  • iNLTK introduction [3-4 minutes]
  • Why do we need such a library for Indic Languages [1-2 minutes]
  • What can you do with iNLTK [6-8 minutes]
    • Get Embedding Vectors for every token in text
    • Tokenize your text
    • Predict next 'n' words
    • Identify language of text
    • Remove foreign language from text
  • How can this be useful in a real world project [2-4 minutes]
  • Technical details of building iNLTK [8-10 minutes]
    • Data Collection and Cleaning
    • Building Language Model over whole of wikipedia
    • Fine tuning Language Model for Classifier and then building classifier on top of Language Model - Using Transfer Learning
    • Using sentencepiece for unsupervized tokenization
  • Q/A [4-5 minutes]

Who is this talk for

  • NLP enthusiasts who want to work with low resource languages
  • Application developers who want to build apps in vernacular languages catering to localized audience
  • Deep Learning practitioners/enthusiasts

Prerequisites:

  • Basics of Machine Learning
  • Basics of NLP

Content URLs:

Speaker Info:

Gaurav is the creator of Natural Language Toolkit for Indic Languages (iNLTK) library, which is an open source Deep Learning library built on top of Pytorch in python and aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. Gaurav has been working on this library and training Language Models, Classifiers in low resource Indic Languages - which have then been used in iNLTK - since last year. This work of his was widely appreciated by the community, including by people like Jeremy Howard on Twitter, by people on LinkedIn: here, here, here, here and on reddit. The library has 300+ stars, 50+ Forks, 20+ watchers on GitHub.

Other than iNLTK, Gaurav has also built a tool for competitive programmers, Code with AI (120+ stars) which predicts which techniques one should use to solve a competitive programming problem to get correct answer. 3300+ unique users have used the tool, with 4200+ sessions and 5300+ page-views since January 2019.

In his day job, Gaurav is currently working as a Software Engineer at Goldman Sachs.

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: