Information Extraction - The Easier way
sravya yellapragada (~sravya94) |
The mode of Information/Data Transmission was once very traditional and we have emerged from those paper based systems to a more digital space. Some of the archaic data in few domains still exist in paper based models viz., Insurance, Healthcare, Education etc. Extracting & Digitizing data from those paper based documents and forms has always been a difficult problem to solve. A few reasons why, is due to the sheer variance in document structure/layout/data. A few scribbles with a pen or pencil adds an additional dimension of complexity.
Many companies tackle this problem manually which makes it even more challenging as it is a slow process and highly expensive. One approach is to cluster the documents based on the template of documents and write rules to extract required content. The flipside of this approach is, If document deviates from the expected template the defined rules would not work.
This talk will briefly cover the approach to solve these problems by automatically extracting text and data from documents (scanned pdfs, word/excel files) without user’s manual effort. We will also discuss some common problems that arise during information extraction. This talk is helpful in multiple domains like Insurance, Health Care, Education etc. Since they have huge amount of data within which they need to filter only the required data and discard remaining
We will demonstrate how this problem can be solved in an easy and scalable manner. Our current approach can understand data, identify tables and extract information as key-value pairs based on context. Our approach can extract default entities like Name, Date, Address, Unique Identifier, Amount and also User defined custom entities. We can choose the result format to be in the form of either CSV or Excel files.
Outcome of this talk: After this talk, the audience will have an idea of how to extract required information from documents like contracts, tax documents, sales orders, bills, enrollment forms, benefit applications, insurance claims, policy documents, market slips, medical documents etc irrespective of the domain and template of document
Basic familiarity with the Python programming language
slides : https://github.com/sravyaysk/pycon-2019/blob/master/InformationExtraction/Information%20Extraction.pdf
Talk outline : https://github.com/sravyaysk/pycon-2019/blob/master/InformationExtraction/talk%20outline.txt
I am working as Data Scientist in Pramati Technologies Hyderabad with 2 years of experience in solving real world business problems across different domains. As a part of Data science team, we design, code, train, test, deploy and iterate on enterprise scale machine learning systems. My primary focus is on building powerful and efficient applications using Natural Language Processing and Deep learning. I love programming and actively participates in hackathons, online contents conducted by hackerearth, kaggle. My hobbies include singing, painting, making gifts, cards, crafts etc.
This is the first time I am giving a talk at PyCon and I chose to talk about this topic as it is a major concern across multiple domains and many find it hard to provide some sort of a solution, so I am hoping that this talk will be helpful to them and others who are interested.