Lessons from Open-Tamil Library for Indian Language Applications
Muthiah Annamalai (~arcturusannamalai) |
The first 20 years of Indian languages on the Internet have been spent debating encoding schemes and editors, to be concerned with application development. India with its rich enthno-linguistic history needs to preserve and grow this heritage in the digital space. We believe this can be done only through writing novel, and useful applications specific to each languages.
Proposed Content of the Talk
One such effort is the open-tamil library, designed to create high-level applications in Tamil. Open-Tamil library is a freely available package for Python 2 ( and Python 3K) to process Tamil text, and then some. We believe these lessons learnt from Open-Tamil can be reused for other Indian Languages on the web.
Users may manipulate Tamil text at the letter level instead of worrying about encodings and code-points, leading to better separation of design and detail
API allows for streaming algorithms on canonical Tamil data, independent of encodings. We show data analysis using Tamil wikipedia, and illustrate various computational linguistics applications.
- convert numbers -> numerals
- reverse words
- build tries
- built-in dictionary
Open-Tamil has the distinction of being available for Python 2, and 3. It is continuously tested, and released via the Python package index.
As a community developed effort, and due to proximity of the various Indian languages, we believe Open-Tamil can form a prototype open-source toolbox for other Indian languages.
We have built a framework, and we can extend it to other Indian languages
Ref: "Open Tamil Text Processing Tools," M. Annamalai, T. Shrinivasan, M. Annamalai, (INFITT-2014), Puducherry, India.
$ pip install open-tamil
$ pip3 install open-tamil (for Py3k)
UTF-8 capable editor
Interest in Indian Languages
- pip url - Open-Tamil v0.4 : https://pypi.python.org/pypi/Open-Tamil
- Github page - https://github.com/arcturusannamalai/open-tamil
- Slide Share - http://www.slideshare.net/ezhillang/opentamil-text-processing-library
Muthu created open-tamil project out of interest to provide a high-quality foundational piece for Tamil computing. He can be reached on Twitter @ezhillang and on Github @arcturusannamalai
Open-Tamil is maintained in conjunction with development efforts of T. Arulalan, and T. Shrinivasan.
Some of my contributions in Tamil computing can be found under,