Demystifying Natural Language Processing using Python (Scikit-Learn/ Keras)

Vaibhav Srivastav (~Vaibhavs10)



It can be difficult to figure out how to work with text in scikit-learn, even if you’re already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What’s the difference between a “fit” and a “transform”? What’s a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on…

In this tutorial, we’ll answer all of those questions, and more! We’ll start by walking through the vectorization process in order to understand the input and output formats. Then we’ll read a simple dataset into pandas, and immediately apply what we’ve learned about vectorization. We’ll move on to the model building process, including a discussion of which model is most appropriate for the task. We’ll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we’ll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.


By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation.

Detailed Outline

  1. Model building in scikit-learn (refresher) - 10 Minutes
  2. Representing text as numerical data - 15 Minutes
  3. Reading a text-based dataset into pandas - 5 Minutes
  4. Vectorizing our dataset - 10 Minutes
  5. Building and evaluating a model - 20 Minutes
  6. Comparing models - 20 Minutes
  7. Examining a model for further insight - 10 Minutes
  8. Practicing this workflow on another dataset - 30 Minutes
  9. Tuning the vectorizer (discussion) - 10 Minutes


Prerequisite Knowledge

Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required.

Required Software

Attendees will need to bring a laptop with scikit-learn and pandas (and their dependencies) already installed. Installing the Anaconda distribution of Python is an easy way to accomplish this. Both Python 2 and 3 are welcome.

I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice.

Speaker Info:

Hi! I am a Data Scientist working with Deloitte Consulting LLP, I work with Fortune Technology 10 clients to help them make data-driven (read profitable) decisions. Prior to this I have worked with startups across India to build Social Media Analytics Dashboards, Chatbots, Recommendation Engines and Forecasting Models.

My core interest lie in Natural Language Processing, Machine Learning/ Statistics and Product development.

In my free time I give talks and participate in local PyData/ PyUserGroup meetups, have previously given a talk at Gartner Data and Analytics Summit, PyCon India, PyCon APAC (Philippines), PyCon Korea, PyCon Malaysia and Google Cloud Summit!

If Data is what floats your boat, then coffee is on me! :D

Id: 1288
Section: Data Science, Machine Learning and AI
Type: Workshop
Target Audience: Beginner
Last Updated: