How I ended up maintaining a python package with over 420,000+ downloads

Kurian Benoy (~kurianbenoy)


3

Votes

Description:

This talk is a sequel of Kurian's previous Pycon India 2023 talk OpenAI Whisper and it's amazing power to do fine-tuing in my mother tongue. For that talk one of the core-part was how to fine-tune and my python-library for benchmarking ASR for Malayalam languages - malayalam-asr-benchmarking. When building that library, Kurian ended up building another python-library named whisper_normalizer.

The main reason for building the library whisper_normalizer was partly because I felt details about the text normalization approach used by whisper which can be found on Appendix Section C pp.21 the paper Robust Speech Recognition via Large-Scale Weak Supervision by OpenAI team is super useful. In my work, I had written another internal library at my previous work place and I was using same whisper normalization algorithm again and again for lot of project as well.

Working on Malayalma Speech to Text benchmarking was the final trigger for me to stop this non-sense and build a python package with nbdev framework. TBH in my previous talk, I didn't even have one slide about this python package as I felt I was just solving one trivial problem for myself. Turns out I was not solving this problem for not just me, but for lot more people. Looking back, now it looks like a good problem to have.

Fast track to December 2023, I noticed this github project all of a sudden has like 30+ stars. It was surprising to me and before thinking too much I realized that I got some months like 50K+ downlads. Then downloads started increasing, and it's constantly increasing all the time. At the time of writing this proposal the number of downloads for my package is as shown in the tweet. It's increasing very fast and we plan to hit 500K+ download by the time of presenting this talk.

Maybe the moral of the story is Kurian doing some niche work in Malayalam, which literally no one cared about ended up me with maintaining this nice python package with lot of downloads.

We have made incremental modifications to the project on the text normalization in indian languages, when we realized Whisper BasicTextNormalizer is a bad idea for Indian languages. Why it's such an important problem is documented by Dr Kavya on her blogpost and issues with BasicTextNormalizer. My colleague Abhigyan has extensively worked on Indian Text normalization during his Master's course and we will be discussing that as well in this talk.

Talk Outline

  1. About Whisper_normalizer which now has 450K+ download(by time of talk, I hope to cross 500K+)
  2. Origin of Whisper_normalizer
  3. What is Text normalization?
  4. What does the Python package do?
  5. Why working for Indian text languages matter
  6. Complexities of Indian Text normalization
  7. Conclusion

Prerequisites:

  • No pre-requisites.

Speaker Info:

About Kurian Benoy

Kurian is working as an ML Engineer at Speech Team building Full-Stack GenAI @ Sarvam.ai. He have contributed to various open source organizations like Swathanthra Malayalam Computing, FOSSASIA, Keras, DVC, HuggingFace, fast.ai, CloudCV etc.

Kurian is the maintainer of whisper_normalizer python package and various Malayalam ASR models like MalWhisper and VegamWhisper. He is also a core-developer of indicsubtitler website as well, which is an open source subtitling platform for transcribing and translating videos/audios in Indic languages.

More details about his previous talks can be found by clicking the link.

About Abhigyan Raman

Abhigyan is a founding ML Engineer @Sarvam.ai. He has worked in AI4Bharat and was a IIT Madras Master student before he dropped out for passion.

His previous talk on Role of Language models - Abhigyan , IIT Madras

Speaker Links:

Kurian Benoy

Abhigyan Raman

Section: Artificial Intelligence and Machine Learning
Type: Talk
Target Audience: Intermediate
Last Updated: