Building an automatic keyphrase extraction system using NLTK in Python

Prastut Kumar (~prastut)




Ever wondered how Google search shows relevant results first even though the query you searched belonged to the results on Page 2 also (in a nutshell how the famous Google Page Rank works) or how your post get's automatically categorised in Quora or how Medium groups articles into clusters based on the article's context?

Extraction of important topical words and phrases from documents, commonly known as terminology extraction or automatic keyphrase extraction is a hot topic in the research field. It comes under one of the crucial tasks in natural language processing for purposes of automatically extracting structured information from unstructured (text) datasets. Keyphrases provide a concise description of a document’s content; they are useful for document search, clustering, categorization, and summarization; help in building a content based recommendation system as you can quantify semantic similarity with other documents. Since we are producing raw data now more than ever, clustering and contextualizing data becomes a more difficult task. With the help of Python and specifically NLTK it becomes a tad bit easier.

My talk will provide information regarding methodology, keyphrase selection (unsupervised and supervised methods), algorithms which help us quantify weights relative to document corpus followed by a step wise guidance on building a decent keyphrase extraction system using NLTK in Python.

This project is a part of my GSoC project.


A basic understanding of Python. (List Comprehensions, Classes, Functions, Loops) . You need no prior knowledge about NLTK or Natural Language processing as a whole, as I will be going into depth as I explain the project. An interest towards such research fields would help in enjoying the talk.

Content URLs:

Speaker Info:

I am a passionate designer + hacker by nature currently working as Google Summer of Code Student @ Teem, Berkman Center for Internet and Society, Harvard University. I am currently in my junior year pursuing CS from VIT, Chennai Campus. Apart from that I am busy cultivating an open source culture at my university.

Speaker Links:

  • Github:
  • Website:
  • Design portfolio:
  • LinkedIn:
  • Quora:

Section: Data Visualization and Analytics
Type: Talks
Target Audience: Beginner
Last Updated:

You have written in your first line "This workshop" but categorized it in the form of talk. Is it suppose to be a talk or a workshop?


I had a confusion between a workshop and a talk, since the way I see it, my proposal lies between workshop and a talk. Still thanks for pointing this out. I will edit it.

Prastut Kumar (~prastut)

It seems more apt for a workshop to me! Btw, nice proposal, looking forward for it :)


@Kajal, Updated the talk. The earlier content was way too generic.

Prastut Kumar (~prastut)

@prastut Its good enough! :)


Login to add a new comment.