Datafication of Indian Court Judgments using Natural Language Processing (NLP)

Dammalapati Sai Krishna (~dammalapati) | 07 Jul, 2023

4

Votes

Description:

Several humanities domains such as Economics, Politics, etc., have increasingly embraced data science and other empirical methods for research. Legal studies is a new frontier in this domain and the use of data science applications to analyse legal data like judgment documents, case orders etc., has just begun globally. However, there is a challenge in performing such empirical legal research (ELR) -- lack of structured legal-judicial datasets. This is because legal-judicial data is often unstructured and textual in nature. Currently, the researchers performing ELR to extract relevant variables from the legal-judicial texts (Eg: case orders, court judgments) through manual annotation. A few governments and judicial institutions are also providing metadata to help researchers do ELR. However, the datasets generated from current approaches are not exhaustive and are polluted with the presence of false positives and false negatives. Legal research is stymied due to this poor state of data, eventually affecting legal reforms of a polity.

In this context, Natural Language Processing (NLP) can help create structured legal-judicial datasets from these unstructured texts. Specifically, I use an open-source NLP library called "OpenNyAI" which is trained on Indian court judgments. OpenNyAI currently has two NLP models - Named Entity Recognition (NER) and Extractive Summarisation.

As an experiment, I piloted these models on a sample of 51 POCSO (Protection of Children from Sexual Offences Act) court judgements to develop structured datasets of the relevant statutes and provisions. I found that the datasets produced are more exhaustive than the official metadata released in 45% of the cases. The produced datasets are also validated by legal researchers. I further plan to use NLP to create structured datasets from all judgments related to child rights in India delivered during the period of 2015-2022.

The longitudinal data on child rights will help researchers perform ELR on legislation enacted to protect children and advocate for policy reforms.

In this talk, I would like to talk about: a) The open-source NLP library called OpenNyAI which will help create structured datasets from Indian court judgments. b) Challenges posed in the datafication of unstructured textual documents using the case study of the experiment on POCSO cases. c) How we plan to scale it across states to create a data ecosystem of child rights in the country.

Prerequisites:

Basics of Python programming and NLP

Video URL:

https://www.loom.com/share/801a908ecf434c2b918a4d099db9c6e2?sid=57e82851-d761-4589-b6d8-26b6157d0381

Content URLs:

Slideshow: https://docs.google.com/presentation/d/1CJhc_PsfI4yI5GeBtaK79TmxfzbU4iKHyRoISjxUhAM/edit?usp=sharing

Blog: https://medium.com/civicdatalab/exploring-the-capabilities-of-natural-language-processing-nlp-in-conducting-legal-analysis-88ef2b9dec9c

Speaker Info:

Sai Krishna is a Data Engineer at CivicDataLab and a graduate of the National Institute of Technology Karnataka (NITK 2017). Ever since graduation, he has taken an immense interest to work in the intersection of technology (Data Science) and public policy. Over the last 5 years, he worked with NGOs, Universities, Governments and Startups in deploying Data Science solutions to the Public policy problems like Disaster Management, Urbanisation and Air Pollution.

Speaker Links:

https://www.linkedin.com/in/saikrishnadammalapati/

Section:	Data Science, AI & ML
Type:	Talks
Target Audience:	Intermediate
Last Updated:	07 Jul, 2023

Comments