Lessons from Open-Tamil Library for Indian Language Applications

Muthiah Annamalai (~arcturusannamalai)


12

Votes

Description:

The first 20 years of Indian languages on the Internet have been spent debating encoding schemes and editors, to be concerned with application development. India with its rich enthno-linguistic history needs to preserve and grow this heritage in the digital space. We believe this can be done only through writing novel, and useful applications specific to each languages.

Proposed Content of the Talk

  1. One such effort is the open-tamil library, designed to create high-level applications in Tamil. Open-Tamil library is a freely available package for Python 2 ( and Python 3K) to process Tamil text, and then some. We believe these lessons learnt from Open-Tamil can be reused for other Indian Languages on the web.

  2. Users may manipulate Tamil text at the letter level instead of worrying about encodings and code-points, leading to better separation of design and detail

  3. API allows for streaming algorithms on canonical Tamil data, independent of encodings. We show data analysis using Tamil wikipedia, and illustrate various computational linguistics applications.

    1. convert numbers -> numerals
    2. reverse words
    3. build tries
    4. built-in dictionary
    5. spell-checker API
  4. Open-Tamil has the distinction of being available for Python 2, and 3. It is continuously tested, and released via the Python package index.

  5. As a community developed effort, and due to proximity of the various Indian languages, we believe Open-Tamil can form a prototype open-source toolbox for other Indian languages.

  6. We have built a framework, and we can extend it to other Indian languages

Ref: "Open Tamil Text Processing Tools," M. Annamalai, T. Shrinivasan, M. Annamalai, (INFITT-2014), Puducherry, India.

Prerequisites:

$ pip install open-tamil

$ pip3 install open-tamil (for Py3k)

UTF-8 capable editor

Interest in Indian Languages

Content URLs:

  1. pip url - Open-Tamil v0.4 : https://pypi.python.org/pypi/Open-Tamil
  2. Github page - https://github.com/arcturusannamalai/open-tamil
  3. Slide Share - http://www.slideshare.net/ezhillang/opentamil-text-processing-library

Speaker Info:

Muthu created open-tamil project out of interest to provide a high-quality foundational piece for Tamil computing. He can be reached on Twitter @ezhillang and on Github @arcturusannamalai

Open-Tamil is maintained in conjunction with development efforts of T. Arulalan, and T. Shrinivasan.

Speaker Links:

Some of my contributions in Tamil computing can be found under,

  1. http://ezhillang.org/#publications
  2. http://ezhillang.org/koodam/play/

Section: Others
Type: Talks
Target Audience: Intermediate
Last Updated: