Auto-detection of tags and text classification on unstructured data with python

Swathi Tatavarthy (~swathi14)


Description:

Since the turn of the millennium, digital assets of all kinds have become an increasingly significant part of our daily experience. Every day, we consume and interact with photos, audiovisual media, text documents, E-Mails and a multitude of other digital formats. Navigating all of these digital assets creates challenges for Enterprises and end-users alike. Users want to organize their assets and make their search more efficient. They want to be able to find them, categorize them, and use them when and where they want. With the substantial increase of digital assets every day, manual labeling of them has become a tedious task. This gave a path for text analysis to be an emerging field of study. Platforms such as e-commerce, social media, news agencies are already leveraging the process of analyzing and extracting the textual information from different types of data. Text Classification is one of the essential parts of text analysis. In general text categorization is used to generate the tags from the unstructured data and label them into predefined categories.This kind of approach can be applied in many contexts, ranging from document filtering to automated metadata generation, word sense disambiguation, video indexing , image classification and processing of Optical Character Recognition (OCR) data and in any application that requires an efficient organization of documents. It improves the search efficiency and retrieves the results in a fraction of seconds.This approach serves the need in real time and can be adopted across any domain. The content/labels from different file types is extracted using python supported libraries like OpenCV, tesseract, pdfminer3, docx2txt, gensim and nltk. For textual documents text pre-processing and word vectorization is done to extract the most frequent keywords/tags. Images and video files are processed using OpenCV and object detection algorithms (YOLO, SSD) to extract the labels from the files. All the keywords and labels are classified into domains by Wordnet Hypernyms.

Prerequisites:

  • Basics of Python
  • Basic understanding on Natural Language Processing

Content URLs:

Content - https://docs.google.com/document/d/1_4t3BxiVp4MlfShrPCUyUXh6af5CwXjFJrjCN_-uZ6M/edit?usp=sharing Poster link - https://drive.google.com/file/d/1zpSD2yuEC0kgLGb_PEE_dlzlHX7urrUy/view?usp=sharing

Speaker Info:

I am a Software Engineer and a Python enthusiast. With the advent of machine learning and AI, I was fascinated about the insights generated from the data. The Rich libraries support of python for Machine Learning has given me more interest to dig deeper into it. I have been part of building various AI products using Python , NLP and Predictive Analytics on Cloud Platforms. This poster focuses on the Natural language processing of textual data using python libraries.

Section: Data Science, Machine Learning and AI
Type: Poster
Target Audience: Intermediate
Last Updated: