Combining NLP with Structured Data to map Clinical Entities to the relevant Section Headers in a Clinical Document

Sagar Dawda (~sagar34)




In this talk we will discuss about combining unstructured data with structured data and apply Machine Learning techniques to extract Section Headers from a clinical document.


Extraction of Section Headers for clinical documents is pretty challenging task as every doctor / hospital follows a different format and no standards are available for documenting clinical notes. Even though a SOAP format provides a structure to WHAT should be included in a clinical document, it does not lay any framework on HOW.

A lot of research papers have been published for the subject where different approaches like Sentence Segmentation, Bayesian probability, CRF etc have been explored. However just like a team, working in silos can only provide decent results to a certain extent. Here we'll discuss the different silo approaches to tackle the problem and their respective results and compare it with the combined approach and its result.

The approached we followed was that of a binary classification - Section Header or Not Section Header

Here's what a typical NLP classification task looks like:

  1. Tokenizing the text
  2. Converting all the token to lower case
  3. Stemming / Lemmatization
  4. Stop words removal
  5. Vectorization
  6. Training and Testing your ML / DL model

Here's a quick overview on the pros and cons of different silo approaches


While every approach has its own set of strengths and weaknesses, the results for silos never achieved the expected performance. Thus there was clearly a need to combine the best of all the approaches


  1. Utilizing the feature richness of BOW approach using Vectorizer
  2. Extracting metadata about the context from the XML
  3. Merging the features
  4. Identifying Section Headers
  5. Mapping Clinical Entities to the relevant headers


With the silo approach the maximum mapping accuracy was ~67%. Feature combination of NLP and Metadata increased the accuracy to 86%


Should have preferably trained an NLP model for text classification

Video URL:

Content URLs:

Link to PDF -

Speaker Info:

Sagar Dawda works as a Data Scientist III for Episource India Pvt Ltd. The majority of the work is done on applying cutting edge NLP techniques to clinical text. He has also mentored many students at an ed-tech startup GreyAtom. Some of the challenging problems of the field solved by him and his team are:

  1. Custom NER for clinical text
  2. Service date identifier
  3. Clinical Entities mapping to Section Header
  4. Domain specific ontology search for diseases

Speaker Links:

  • Github -
  • Linkedin -

Section: Data Science, Machine Learning and AI
Type: Talks
Target Audience: Intermediate
Last Updated: