Building a search engine for code discovery

Bhanu K (~bhanu26)


1

Vote

Description:

Problem

Developers spend a good amount of time searching for code and configuring it to suit their need for a given problem. They often search old codebases (without context), ask colleagues to identify a code block. This internal knowledge base can be better organized for discovery.

This workshop aims to build a simple search engine for your code (in remote repositories or local disk) using natural language processing techniques.

Specifically, the participants will be using scikit-learn to build a corpus of their code as a TF-IDF (term frequency - inverse document frequency) index. Subsequently, the result will be a code snippet for a given input keyword that can be copy-pasteable in the development workflow.

Workshop Startup kit

A startup kit will be shared with all participants few days before the workshop (via a GitHub repository). It will contain:

  • *.py, *.yaml files as content to be indexed. that is, training data.
  • default estimators, keys (.npz, .pkl, .json). that is, estimators used for search lookups. These default estimators should be used by the participants should they fail to index the estimator in the workshop.
  • python code to tokenize code content, index it and search.

Part 1 - Introduction, demo - 10 mins

Introduce the objective and relevance of the problem.

Part 2 - Prepare the environment - 10 mins

Give participants time to install the dependencies.

Part 3 - Identify the files, TF-IDF - 30 mins

  • Step 1 - Prepare python, yaml files. We will be using YAML and Python files to build the search engine indices. 10 mins.
  • Step 2 - Introduction to TF-IDF and practice exercises. 20 mins.

Break - 10 mins

Part 4 - Implement the search engine - 75 mins

At its core, we need to build a search engine index which is saved to disk in the form of .npz, .pkl and .json files.

  • Step 1 - Identify correct files. Necessary *py, *.yaml files. 10 mins.
  • Step 2 - Tokenize the file content. Each file will be a document in the document matrix. 30 mins.
  • Step 2a) - Tokenizing .yaml file is mostly straightforward.
  • Step 2b) - Tokenizing .py file needs some setup to identify the name, args, kwargs, docstring, function calls, method references.
  • Step 3 - Utilities to index and re-index (train) the estimator (model). 20 mins.
  • Step 4 - Search for a given user query. Use cosine similarity to compare the user query relevance against the document estimator. 10 mins.

5 mins buffer for spillovers.

Part 5 - Extensions, discussion - 15 mins

This approach can be extended to handle code from GitHub, GitLab or any other remote code hosting/collaboration software your team uses. Allow people to track their repositories, refresh indexes periodically (as code changes).

Prerequisites:

We will be using beginner to intermediate level Python in the workshop.

  • Content will be shared in a Google Colab notebook which participants can replicate and follow along.
  • Python - In the workshop, we will use Python for file traversing, tokenization, module/function/docstring identification for python functions, indexing and re-indexing document matrices.
  • scikit-learn - will be good to know the overview of how scikit-learn (specifically, tfidfvectorizer) is useful.
  • Read NCS paper to understand the importance of code search and relevance of code discovery.

Video URL:

https://youtu.be/6wKn2oZJqz0

Speaker Info:

I am Bhanu Kamapantula. I work in Gramener's product team with a focus on tools to improve developer workflows. My recent work includes Gramex Charts, Mobility in Covid.

I organized a workshop at PyConf Hyderabad 2019 - Building a ML classification application with Gramex (material).

My interests lie at the intersection of technology, data and people. I co-organize a Data Storytellers meetup. I used to be an academic researcher. You can read more on my personal blog.

View a subset of my talks, workshops.

Speaker Links:

https://bkamapantula.github.io

https://github.com/bkamapantula

https://twitter.com/thoughtisdead

Section: Data Science, Machine Learning and AI
Type: Workshop
Target Audience: Intermediate
Last Updated: