Natural Language Generation with Python
Jaidev Deshpande (~jaidev) |
A picture is worth a thousand words. Sometimes.
Take a good look at this infographic (original post)
Now try to remember your thoughts were as you were reading it. To an experienced audience, a few things would have stood out immediately:
- It's a scatterplot,
- Each point represents a GoT character
- For each character, it shows how many episodes they appeared in, and how much screen time they had averaged per episode.
By this time, most viewers are already too bored - or are too distracted - to notice the real insight of the infographic, which is that Ned Stark dominates Game of Thrones by screen time, even though he was killed off in the ninth episode!
Even to the trained eye, even to those who do data visualization for a living, charts can be difficult to read. There is an entire industry around data interpretation, which is full of people whose sole job is to look at charts and come up with an explanation of what they see. The primary consumers of visualizations are such professionals. Unsurprisingly, the people who make decisions are almost never the consumers of visualizations - because visualizations are descriptive, and decision makers need information in a format that is prescriptive - something that recommends actions.
Natural language generation (NLG) is a nascent, but very real technology that is filling this gap between raw data and consumable, actionable insights. Numerous studies have gone on to demonstrate that people are more likely to pay attention when information is narrated instead of simply displayed. At Gramener, our motto is "Insights as stories" - thus, natural language generation is an obvious technology for us to invest in.
This talk is about how we built and open source natural language generation framework with little more than pandas and spaCy.
This is a strictly shallow-learning problem. I really wish I could tell you that we used deep learning to generate text from raw data, but unfortunately there wasn't enough training data for deep learning to work. Thus, in order to solve it with relatively less data, we broke the problem down into smaller, shallower problems. The NLG pipeline is broken into three major modules:
Autolysis (automatic exploratory data analysis) - most of the EDA that is usually carried out on structured data can be automated. Every insight that comes from a dataset is the result of one of a finite number of exploratory analyses. The NLG framework automates this by running the dataset against a set of models, and decides whether each resulting insight is significant enough to be narrated.
Intent classification - This module deals with answering natural language questions about the data. Every question that can be posed to a data can be classified into one of a collection of intents, and each intent in turn dictates the text generation heuristics.
Text generation - This is where most of the NLP comes into play. The text generation module accepts a structured object as input, which is expected to contain results of the previous stages, which is then rendered in a specified context. The context of text generation here can be an intent, or data.
Outline of the Talk
- Motivation for NLG (2 mins)
- The NLG Problem - How do humans interpret data? (5 minutes)
Components of the NLG Framework
Autolysis with Pandas and SciPy (5 minutes)
- Intent classification with spaCy & sklearn (3 minutes)
Text Generation with spaCy (5 minutes)
Extending NLG to deep learning
Framework for collecting training data (2 minutes)
- Converting natural language to structured queries and analysis code (2 minutes)
- Table-to-sequence models for generating narratives on structured data (2 minutes)
- Basics of exploratory data analysis
- NLP terminology
To be added.
I'm a data scientist based in New Delhi, India. I currently work at Gramener, where I build data science products for other data scientists. My research interests are in signal processing and computational harmonic analysis. I'm obsessed with applications of machine learning in personal productivity and recommendation systems. I blog about these here.