Natural Language Processing is Fun!
Saurabh Deshmukh (~saurabh15) |
Computers are great at working with structured data like spreadsheets and database tables. But us humans usually communicate in words, not in tables. Unfortunately we don’t live in this alternate version of history where all data is structured. That’s unfortunate for computers. A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand unstructured text and extract data from it? Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to understand and process human languages. As long as computers have been around, programmers have been trying to write programs that understand languages like English. The reason is pretty obvious — humans have been writing things down for thousands of years and it would be really helpful if a computer could read and understand all that data.
Computers can’t yet truly understand English in the way that humans do — but they can already do a lot! In certain limited areas, what you can do with NLP already seems like magic. You might be able to save a lot of time by applying NLP techniques to your own projects. And even better, the latest advances in NLP are easily accessible through open source Python libraries like spaCy, textacy, and neuralcoref. What you can do with just a few lines of python is amazing. Extracting meaning from text is hard. The process of reading and understanding English is very complex. Doing anything complicated in machine learning usually means building a pipeline. The idea is to break up your problem into very small pieces and then use machine learning to solve each smaller piece separately. And that's exactly the strategy we use for NLP.
Steps for building a pipeline:
- Sentence Segmentation - Break the text apart into separate sentences.
- Word Tokenization - Break the sentence into separate words.
- Predicting Parts of Speech for each token - Here, we look at each token and try to guess its part of speech — whether it is a noun, a verb, an adjective and so on.
- Text Lemmatization - Figuring out the most basic form or lemma of each word in the sentence.
- Identifying Stop Words - Words that you might want to filter out before doing any statistical analysis.
- Dependency Parsing - Figure out how all the words in our sentence relate to each other.
- Named Entity Recognition (NER) - The goal of Named Entity Recognition, or NER, is to detect and label these nouns with the real-world concepts that they represent.
- Coreference Resolution - The goal is to figure out this same mapping by tracking pronouns across sentences.
Going through thousands of documents and trying to redact all the names by hand could take years. But with NLP, it’s a breeze. My talk will help you understand what Natural Language processing is, and by the time my talk is over, you will have a brief overview of natural language processing.
- Basic idea of what Natural Language Processing (NLP) is.
Myself Saurabh Sunil Deshmukh, currently pursuing my B.E. (Computer Science and Engineering ) from Government college of Engineering Aurangabad, Maharashtra. I started with python four months before considering its scope and popularity in data science and machine learning. I have also studied Big Data analytics using Apache Spark and Apache Hadoop. I would love to share my (just started) journey into data science also eager to hear from everyone else.