Language Model (Text Analysis) using Python from scratch

Divya Choudhary (~divya798)




What is Language Model?

Language Model is basically a way to determine how likely a certain sentence is in the language. "You are reading my LM write up now " is more likely to be said than “Now you are my LM reading write up”, even though both sentences contain only correct English words; and the sentence "I had ice-cream with a" is more likely to end with "spoon" than with "banana". LM helps impart this understanding of a language to machines.

What’s the need? "Computers are incredibly fast, accurate and stupid; humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination." (Albert Einstein)

Computers don’t understand our language! All they are programmed to understand are very specific instructions. Languages we speak are much more complex than that; you can say one thing in multiple ways, for example "where do I go for party tonight?" and "could you give me name of the best restaurant near me?"-- this is called language variability. As if this was less burden to translate to computers, sometimes you say something that can have several meanings, like "Look at the dog with one eye" -- this is called language ambiguity. A human being usually understands the correct meaning in the context of the conversation. A computer... doesn't really.

There are many amazing work already done in the field with Siri autocompleting what you forget to type or Google responding to your “okay Google” calls. This said, there still exists immense room for research in the field of making these models more and more intelligent, be it in disambiguation, intent understanding etc. The basis of all starts from a language model.


Language model is broadly of two types:

Statistical LM: A language model is formalized as a probability distribution over a sequence of strings (words), and traditional methods usually involve making an n-th order Markov assumption and estimating n-gram probabilities via counting and subsequent smoothing (Chen and Goodman 1998). The count-based models are simple to train, but probabilities of rare n-grams can be poorly estimated due to data sparsity (despite smoothing techniques)

Neural LM: The use of neural networks in the development of language models has become very popular, to the point that it may now be the preferred approach. The use of neural networks in language modeling is often called Neural Network based Language Models, or NNLM for short. Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation.

What does it take to build a Statistical LM in Python?

More than anything, we need a corpus large enough to contain multiple variations possible and a good model :D and all the below mentioned steps.

Steps needed for building a language model (will also be the flow of the talk along with implementation code):

  • Read corpus:

    • from scratch or using pandas library
  • Tokenize:

    • from scratch based on different kinds of corpus we might be dealing with or by using methods from NLTK library
  • Generate n-grams from corpus:

    • from scratch (putting n-gram generation logic algorithmically)
    • using NLTK library
  • Sense check: check for unwanted/extra characters in words and remove them

  • Probability computations:

    • from scratch by putting our logic of finding and comparing probabilities at each stage of phrase generation algorithmically
    • using libraries like numpy or scipy
  • Generate phrase using n-grams and logic on probability computation (can be an implementation on HMM, MLE etc.)

    • using different libraries like NLTK, sklearn
  • Refinement for edge cases / externally add more logic to data for better results: based on the type of corpus & business logics to be considered

Sample Use cases

  1. Autocorrect
  2. Automatic summarization
  3. Automated reply to emails
  4. Spell Corrector (Grammarly)


  • Basic idea of NLP
  • Concept of tokenization, lemmatization etc.
  • Just a skim through read of n-gram modeling(if possible, else what use will I be of :P)
  • Basic python coding Scikit learn, NLTK libraries of Python

Speaker Info:

Data Scientist with ~4 years of experience. For more info, please pay a visit to my LinkedIn.

Speaker Links:

Id: 996
Section: Data science
Type: Talks
Target Audience: Intermediate
Last Updated: