Building a search engine for code discovery
Bhanu K (~bhanu26) |
Description:
Problem
Developers spend a good amount of time searching for code and configuring it to suit their need for a given problem. They often search old codebases (without context), ask colleagues to identify a code block. This internal knowledge base can be better organized for discovery.
This workshop aims to build a simple search engine for your code (in remote repositories or local disk) using natural language processing techniques.
Specifically, the participants will be using scikit-learn to build a corpus of their code as a TF-IDF (term frequency - inverse document frequency) index. Subsequently, the result will be a code snippet for a given input keyword that can be copy-pasteable in the development workflow.
Workshop Startup kit
A startup kit will be shared with all participants few days before the workshop (via a GitHub repository). It will contain:
*.py
,*.yaml
files as content to be indexed. that is, training data.- default estimators, keys (
.npz
,.pkl
,.json
). that is, estimators used for search lookups. These default estimators should be used by the participants should they fail to index the estimator in the workshop. - python code to tokenize code content, index it and search.
Part 1 - Introduction, demo - 10 mins
Introduce the objective and relevance of the problem.
Part 2 - Prepare the environment - 10 mins
Give participants time to install the dependencies.
Part 3 - Identify the files, TF-IDF - 30 mins
- Step 1 - Prepare python, yaml files. We will be using YAML and Python files to build the search engine indices. 10 mins.
- Step 2 - Introduction to TF-IDF and practice exercises. 20 mins.
Break - 10 mins
Part 4 - Implement the search engine - 75 mins
At its core, we need to build a search engine index which is saved to disk in the form of .npz
, .pkl
and .json
files.
- Step 1 - Identify correct files. Necessary
*py
,*.yaml
files. 10 mins. - Step 2 - Tokenize the file content. Each file will be a document in the document matrix. 30 mins.
- Step 2a) - Tokenizing
.yaml
file is mostly straightforward. - Step 2b) - Tokenizing
.py
file needs some setup to identify the name, args, kwargs, docstring, function calls, method references. - Step 3 - Utilities to index and re-index (train) the estimator (model). 20 mins.
- Step 4 - Search for a given user query. Use cosine similarity to compare the user query relevance against the document estimator. 10 mins.
5 mins buffer for spillovers.
Part 5 - Extensions, discussion - 15 mins
This approach can be extended to handle code from GitHub, GitLab or any other remote code hosting/collaboration software your team uses. Allow people to track their repositories, refresh indexes periodically (as code changes).
Prerequisites:
We will be using beginner to intermediate level Python
in the workshop.
- Content will be shared in a Google Colab notebook which participants can replicate and follow along.
Python
- In the workshop, we will usePython
for file traversing, tokenization, module/function/docstring identification for python functions, indexing and re-indexing document matrices.scikit-learn
- will be good to know the overview of howscikit-learn
(specifically,tfidfvectorizer
) is useful.- Read NCS paper to understand the importance of code search and relevance of code discovery.
Video URL:
https://youtu.be/6wKn2oZJqz0
Speaker Info:
I am Bhanu Kamapantula. I work in Gramener's product team with a focus on tools to improve developer workflows. My recent work includes Gramex Charts, Mobility in Covid.
I organized a workshop at PyConf Hyderabad 2019 - Building a ML classification application with Gramex (material).
My interests lie at the intersection of technology, data and people. I co-organize a Data Storytellers meetup. I used to be an academic researcher. You can read more on my personal blog.
View a subset of my talks, workshops.
Speaker Links:
https://bkamapantula.github.io
https://github.com/bkamapantula
https://twitter.com/thoughtisdead