IndicNLP - An open data platform to bring Indian languages to the advancements of NLP.

Adam Shamsudeen (~adamshamsudeen)



Language is not merely a tool for communication. It is the conduit for culture and history. Indian languages are complex and bring about unique challenges for NLP practitioners.

India is a country of diversity in geography, culture and indeed language. There are 22 official languages and many more regional languages with numerous dialects spread over this subcontinent. Every language deserves to live through the end of time, as does the culture and history of people who speak them. They deserve to be preserved in all their forms, in sounds and in scripts and whatever form we may come up with in the future. The only way to ensure that is to enrich this diversity even more.

Technology is a tool that has given rise to specific markets. These markets often use a single medium of language for commerce and communication; English. English is used by a majority of systems developed as open source as well as by the companies like Google, Facebook, Microsoft, etc for Speech to Text, Syntactic Parser, Stemmer, Lemmatizer, Tokenizer, Parts of Speech Tagger, etc. The popularity of the language was due to years of data collection, research and availability of computing power. This has further led to a big market for these companies to invest and develop systems for it. Amidst this, technology serves as the only tool to help preserve and maintain the lifespan of colloquial languages.

Our work

We trained a Malayalam language model (Vaaku2Vec) on the Wikipedia article dump from Oct 2018. The Wikipedia dump had 55k+ articles. The difficulty in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. The language model was used to train a classifier which classifies news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whopping 92% accuracy in the classification task. Similarly, we trained the language model for Tamil and Bangla and released text classifier along with it.

We also trained a word2vec model on Malayalam, Tamil, and Bangla and we also trained a text classifier using this model. The word2vec model is publicly available for playing around.

We wrote a custom multi-threaded web scraper for crawling the Indian news websites and building a news corpus, which is also free to download and use.

Our work was praised by Jeremy Howard, Sebastian Ruder and Smerity on Twitter. We also won the FOSS contribution award from ICFOSS for our work on MalayalamNLP.

We presented our work in Chennaipy and Kochi python and the talk was well received.

Talk Outline:

  • What is Natural language processing? 5mins
  • Why NLP is hard? 5mins
  • Our work so far. 5mins
  • What do we need to do? 5mins

What the audience will learn?

  • A brief introduction to NLP
  • How can one apply NLP on their mother tongue
  • How can they join and contribute to IndicNLP :)

Future Plans

In IndicNLP, we try to build language models, tools and techniques for NLP on Indian languages and use them to solve NLP tasks. We also try to consolidate and curate all open source work and data in NLP on Indian Languages. We have a discussion forum to engage interested people and preliminary data tagging webapp for tagging translation task. We are a clutch of language enthusiasts, working towards linguistic equality in the digital arena. Our plan is to build a community of people who would like to bring Indian languages to the field of NLP and Deep learning. Let’s work together and teach the machines our tongue.

Speaker Info:

Kamal Raj:
He primarily works on Natural Language Processing. He mainly worked on building NLP models on the medical domain using Transfer Learning. He has open sourced projects on Named Entity Recognition, Question answering and Language Modeling. He has 2 publications. The first publication is PHI Scrubber: A Deep Learning Approach. Second is Pre-trained BioBERT with Attention Visualisation for Medical Natural Language Inference, which is currently accepted to ACL 2019.

Adam Shamsudeen:
Interested in solving real-life problems using code. Founder and Ex-CEO of MindHack Innovations. Currently working on using the latest deep learning innovations to speed up clinical trials.

Selva Kumar:
He is interested in doing a bit of painting and 3d modeling. Passionate about linguistics and languages & culture in general. He is interested in AGI. He is also a proud, free software evangelist. He has two publications An Attentive Sequence Model for Adverse Drug Event Extraction from Biomedical Text and Compositional Attention Networks for Interpretability in Natural Language Question Answering.

Three of us work as Research Engineer in Saama Technologies, Chennai.

Id: 1245
Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: