Getting started with vector representations of language
Every type of data, whether it be text or images, is treated in the form of numbers by a computer. However, how do we retrieve meaning from these numbers? How do we analyze a piece of text, the relationships underlying it and its various aspects using machine learning and mathematics? This is where word embeddings come in. Simply put, word embeddings are vector representations of text. They can help capture relationships, entities, roots and contextual information between the words or characters that form a piece of content. This talk aims to discuss the working behind developing word embeddings, the need for such representations, and popular vocabulary embeddings such as word2vec, gloVe etc - with mathematical descriptions as well. Application areas of embeddings and potential for further research will be discussed alongside.
A (tentative) walk-through for the talk ahead:
- Introduction to word embeddings: what vector representations entail, idea of numbers representing language
- History, notable works in the field: Mikolov et al.
- Application areas: why embeddings? Real-life examples of usage
- Mathematical working (brief): ways to generate embeddings, prevalent methods
- State-of-the-art models: word2vec, gloVe, fasttext; their principles and development
- Resources for getting started with using word embeddings: Gensim, NLTK
- Conclusion: scope for research and current challenges, new representations and way forward
Basic background in python and linear algebra would be preferred, although efforts will be taken to ensure this is a beginner-friendly talk. If you have experience with reading research papers and knowledge about the introductory concepts, reading these might be helpful (would be touched upon in the talk as well):
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Will update soon with presentation slides.
I'm a third year undergraduate at Delhi Technological University. I have been involved in numerous applied machine learning and NLP-based research projects and endeavors over the past year, and am also a member of PyLadies Delhi, PyData Delhi and various other organizations encouraging development in Python. My research interests include natural language processing, speech emotion recognition and ensemble methods. I also like to read and write poems.
My website: http://anjalibhavan.github.io/