IndicNLP - An open data platform to bring Indian languages to the advancements of NLP.
Adam Shamsudeen (~adamshamsudeen) |
Language is not merely a tool for communication. It is the conduit for culture and history. Indian languages are complex and bring about unique challenges for NLP practitioners.
India is a country of diversity in geography, culture and indeed language. There are 22 official languages and many more regional languages with numerous dialects spread over this subcontinent. Every language deserves to live through the end of time, as does the culture and history of people who speak them. They deserve to be preserved in all their forms, in sounds and in scripts and whatever form we may come up with in the future. The only way to ensure that is to enrich this diversity even more.
Technology is a tool that has given rise to specific markets. These markets often use a single medium of language for commerce and communication; English. English is used by a majority of systems developed as open source as well as by the companies like Google, Facebook, Microsoft, etc for Speech to Text, Syntactic Parser, Stemmer, Lemmatizer, Tokenizer, Parts of Speech Tagger, etc. The popularity of the language was due to years of data collection, research and availability of computing power. This has further led to a big market for these companies to invest and develop systems for it. Amidst this, technology serves as the only tool to help preserve and maintain the lifespan of colloquial languages.
We trained a Malayalam language model (Vaaku2Vec) on the Wikipedia article dump from Oct 2018. The Wikipedia dump had 55k+ articles. The difficulty in training a Malayalam language model is text tokenization, since Malayalam is a highly inflectional and agglutinative language. The language model was used to train a classifier which classifies news into 5 categories (India, Kerala, Sports, Business, Entertainment). Our classifier came out with a whopping 92% accuracy in the classification task. Similarly, we trained the language model for Tamil and Bangla and released text classifier along with it.
We presented our work in Chennaipy and Kochi python and the talk was well received.
- What is Natural language processing? 5mins
- Why NLP is hard? 5mins
- Our work so far. 5mins
- What do we need to do? 5mins
What the audience will learn?
- A brief introduction to NLP
- How can one apply NLP on their mother tongue
- How can they join and contribute to IndicNLP :)
In IndicNLP, we try to build language models, tools and techniques for NLP on Indian languages and use them to solve NLP tasks. We also try to consolidate and curate all open source work and data in NLP on Indian Languages. We have a discussion forum to engage interested people and preliminary data tagging webapp for tagging translation task. We are a clutch of language enthusiasts, working towards linguistic equality in the digital arena. Our plan is to build a community of people who would like to bring Indian languages to the field of NLP and Deep learning. Let’s work together and teach the machines our tongue.
He primarily works on Natural Language Processing. He mainly worked on building NLP models on the medical domain using Transfer Learning. He has open sourced projects on Named Entity Recognition, Question answering and Language Modeling. He has 2 publications. The first publication is
PHI Scrubber: A Deep Learning Approach. Second is
Pre-trained BioBERT with Attention Visualisation for Medical Natural Language Inference, which is currently accepted to ACL 2019.
Interested in solving real-life problems using code. Founder and Ex-CEO of MindHack Innovations. Currently working on using the latest deep learning innovations to speed up clinical trials.
He is interested in doing a bit of painting and 3d modeling. Passionate about linguistics and languages & culture in general. He is interested in AGI. He is also a proud, free software evangelist. He has two publications
An Attentive Sequence Model for Adverse Drug Event Extraction from Biomedical Text and
Compositional Attention Networks for Interpretability in Natural Language Question Answering.
Three of us work as Research Engineer in Saama Technologies, Chennai.