Building an automatic keyphrase extraction system using NLTK in Python

Prastut Kumar (~prastut) | 27 Jun, 2016

44

Votes

Description:

Ever wondered how Google search shows relevant results first even though the query you searched belonged to the results on Page 2 also (in a nutshell how the famous Google Page Rank works) or how your post get's automatically categorised in Quora or how Medium groups articles into clusters based on the article's context?

Extraction of important topical words and phrases from documents, commonly known as terminology extraction or automatic keyphrase extraction is a hot topic in the research field. It comes under one of the crucial tasks in natural language processing for purposes of automatically extracting structured information from unstructured (text) datasets. Keyphrases provide a concise description of a document’s content; they are useful for document search, clustering, categorization, and summarization; help in building a content based recommendation system as you can quantify semantic similarity with other documents. Since we are producing raw data now more than ever, clustering and contextualizing data becomes a more difficult task. With the help of Python and specifically NLTK it becomes a tad bit easier.

My talk will provide information regarding methodology, keyphrase selection (unsupervised and supervised methods), algorithms which help us quantify weights relative to document corpus followed by a step wise guidance on building a decent keyphrase extraction system using NLTK in Python.

This project is a part of my GSoC project.

Prerequisites:

A basic understanding of Python. (List Comprehensions, Classes, Functions, Loops) . You need no prior knowledge about NLTK or Natural Language processing as a whole, as I will be going into depth as I explain the project. An interest towards such research fields would help in enjoying the talk.

Content URLs:

https://github.com/P2Pvalue/teem-tag

https://docs.google.com/presentation/d/1VLjU2MQnB3GUw7p4z_Q6olFdYoW2q-LTsE8OUFZqrgI/edit?usp=sharing

Speaker Info:

I am a passionate designer + hacker by nature currently working as Google Summer of Code Student @ Teem, Berkman Center for Internet and Society, Harvard University. I am currently in my junior year pursuing CS from VIT, Chennai Campus. Apart from that I am busy cultivating an open source culture at my university.

Speaker Links:

Github: https://github.com/prastut
Website: http://prastut.github.io/
Design portfolio: https://www.behance.net/prastutkumar
LinkedIn: https://in.linkedin.com/in/prastut
Quora: https://www.quora.com/profile/Prastut-Kumar-1

Section:	Data Visualization and Analytics
Type:	Talks
Target Audience:	Beginner
Last Updated:	28 Jul, 2016

Comments