Introduction NLP is an inter-disciplinary subject. Computer Science Linguistics Statistics ........... It is a rapidly developing field of study. If you are using computers and Internet in your day to day life you may be using some NLP based product. Spell- Checker Search Engines etc.... Natural Language Toolkit (NLTK) A collection of Python programs, modules, data set and tutorial to support research and development in Natural Language Processing (NLP). Written by Steven Bird, Edvard Loper and Ewan Klien NLTK is Free and Open source Esy to use Modular Simple Well documented Simple and extensible What You Will Learn • How simple programs can help you manipulate and analyze language data, and how to write these programs • How key concepts from NLP and linguistics are used to describe and analyze language • How data structures and algorithms are used in NLP • How language data is stored in standard formats, and how data can be used to evaluate the performance of NLP techniques I) Installation of NLTK 1) Make sure that Ptyhon 2.4 or 2.5 or 2.6 is available in your system 2) Install Python Tkinter package 3) Install Numpy, Matplotlib, Prover9, MaltParse and MegaM 4) Download NLTK and Install it a) If you are installing NLTK from source Download http://nltk.googlecode.com/files/nltk-2.0b5.zip Unzip it , It will create nltk-2.0b5 . Open terminal and cd in to this folder Be super user run python setup.py install b) To install data Start python interpreter type >>> import nltk >>> nltk.download() It will open a GUI from the GUI you can select the packages which is required. Click download button. That is all!!!!! Now you are ready to play with NLTK. II) Let us start the game 1) To access data for working out the example in the book a) Start Python interpreter 2) Some basic work outs from the book a) Concordance >>> from nltk.book import * >>> text1.concordance("and") >>> text1.similar("and") b) dispersion plot """ Positional information """ >>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) You can see a graphical plot. >>> text4.dispersion_plot(["and", "to", "of", "with", "the"]) What is it !!! Why ??? c) Counting Vocabulary >>> len(text3) 1) List of distinct word in text3 sorted in dictionary order. >>> sorted(set(text3)) 2) Count occurrence of a particular word in a text. >>> text3.count("and") >>> 100 * text3.count("and") / len(text3) """ What percentage of text it is taken by a specific word. """ 3) Selecting word based on parameters {w|w is a member of V and P(w)} [w for w in V if p(w)] >>> V = set(text1) >>> my_word = [w for w in V if len(w) > 15] >>> sorted(my_word) >>> fd = FreqDist(text4) >>> sorted([w for w in set(text5) if len(w) > 7 and fd[w] > 15]) d) Finding bi-grams >>> txt = "jaganadh is talking about NLTK." >>> wlist = txt.split() >>> bigrams(wlist) what will happen if i do like this >>> bigrams(txt) e) Collocations >>> text4.collocations() III) So far we were playing with data available in NLTK Now let us see if it is possible to work with our own data. a) Populate our own corpora with NLTKand analyse it >>> from nltk.corpus import PlaintextCorpusReader as ptr >>> cr = '/home/conf/Desktop/pyConf.in/cor' >>> wlis = ptr(cr,'.*') >>> wlis.fileids() b) Let us try to find it out how to count number of characters, words and sentences in the corpus. >>> for fid in wlis.fileids(): print len(wlis.raw(fid)) >>> for fid in wlis.fileids(): print len(wlis.words(fid)) >>> for fid in wlis.fileids(): print len(wlis.sents(fid)) c) How to extract sentences from the corpus which we populated. >>> for sent in wlis.sents('1c'): print sent d) Find the longest sentence in the corpus. >>> sents = wlis.sents('1c') >>> long_sen = max([len(s) for s in sents]) >>> [s for s in sents if len(s) == long_sen] e) Generating bigrams from the corpus >>> nltk.bigrams(wlis.words('1c')) f) Ploting conditional frquency distribution >>> big = nltk.bigrams(wlis.words('1c')) >>> gd = nltk.ConditionalFreqDist(big) >>> gd.plot() g) Tabulate CFD >>> gd.tabulate() h) Generating puzzle >>> puzzle = nltk.FreqDist('egivrvonl') >>> ob = 'r' >>> wl = nltk.corpus.words.words() >>> [w for w in wl if len(w) >= 6 and ob in w and nltk.FreqDist(w) <= puzzle] i) Finding pronunciation of words >>> ent = nltk.corpus.cmudict.entries() >>> w = 'work' >>> for wo, p in ent: if wo == w: print p We can find stress patterns in text also with this module. j) WordNet data accessing >>> from nltk.corpus import wordnet as wn >>> wn.synsets('car') >>> wn.synset('car.n.01').lemma_names >>> for synset in wn.synsets('car'): print synset.lemma_names IV) Collecting corpus from web with NLTK >>> from urllib import urlopen >>> url = 'http://jaganadhg.freeflux.net/blog' >>> raw = urlopen(url).read() >>> txt = nltk.clean_html(raw) To tokenize text >>> tok = nltk.word_tokenize(txt) a) Stemming >>> porter = nltk.PorterStemmer() >>> word = 'running' >>> porter.stem(word) >>> lancaster = nltk.LancasterStemmer() >>> lancaster.stem(tok[110]) b) Lemmatization >>> wnl = nltk.WordNetLemmatizer() >>> wnl.lemmatize(word) V) POS Tagging a) >>> text = nltk.word_tokenize("We are attending Python Conference") >>> nltk.pos_tag(text) VI) Parsing >>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] >>> grammar = "NP: {
?*}" >>> cp = nltk.RegexpParser(grammar) >>> result = cp.parse(sentence) >>> print result >>> result.draw() VII) What people are doing with NTK. A group of students from Bangalor developed a small application for finding authorship attribution. Analyzing IRC log . (Found in planet python) Create corus from web. Extraction MWE from corpus. Teaching NLP VIII) Whre to find more info on NLTK Visit the NLTK site www.nltk.org Buy a book "Natural Language Processing with Python" from O'Reilly In India it can be purchased from Shroffpublishers Contents in the book C1 Language Processing and Python C2 Accessing Text Corpora and Lexical Resources C3 Processing Raw Text C4 Writing Structured Programs C5 Catagorizing and Tagging Words C6 Learning to Classify C7 Extracting Information from Text C8 Analyzing Sentence Structure C9 Building Feature-Based Grammers C10 Analyzing the Meaning of Sentences C11 Managing Linguistic Data IX) Contribute to NLTK Visit the NLTK site. GSOC Buy a book jaganadhg@gmail.com jaganadhg@au-kbc.org http://jaganadhg.freeflux.net/blog