Next generation of word embeddings in gensim
Parul Sethi (~parulsethi) |
Python has many Natural language processing tools. In particular if someone wants to implement a recommender or a document classifier they face a problem choosing from many open source word embeddings available. I will highlight the differences between popular word embeddings, Word2Vec, FastText and WordRank and reflect how these different embeddings could directly affect the downstream NLP tasks especially related to similarity. I'll also discuss how to deal with the common issues of rare, frequent and out of vocabulary words. As Visualizations are also a crucial part of Data analysis, to understand the structure and underlying patterns that may be held within the data, I’ll cover about visualizing the word embeddings using TensorBoard and gensim.
- What are word embeddings and why are they useful.
- Examples of some popular word embeddings
- Why you need to choose carefully b/w those different embeddings.
- Example of their different results, for similarity.
- Benchmark performance overview: on Word Similarity and Analogy data (how diff. embeddings perform on this)
- Visualizations: PCA, t-SNE (using TensorBoard)
- Relation b/w word frequency and embedding performance
Just a basic idea of what word embeddings are.
I'm a pythonista studying Maths and IT at University of Delhi. For the love of Open-source and NLP, I regularly contribute to a widely used Python library gensim and has also been selected as their GSoC(Google summer of code) student under NumFOCUS umbrella for 2017 (my live blog). I've given a similar talk on the proposed topic at PyDelhi 2017.