Building a search engine for code discovery
Bhanu K (~bhanu26) |
Developers spend a good amount of time searching for code and configuring it to suit their need for a given problem. They often search old codebases (without context), ask colleagues to identify a code block. This internal knowledge base can be better organized for discovery.
This workshop aims to build a simple search engine for your code (in remote repositories or local disk) using natural language processing techniques.
Specifically, the participants will be using scikit-learn to build a corpus of their code as a TF-IDF (term frequency - inverse document frequency) index. Subsequently, the result will be a code snippet for a given input keyword that can be copy-pasteable in the development workflow.
Workshop Startup kit
A startup kit will be shared with all participants few days before the workshop (via a GitHub repository). It will contain:
*.yamlfiles as content to be indexed. that is, training data.
- default estimators, keys (
.json). that is, estimators used for search lookups. These default estimators should be used by the participants should they fail to index the estimator in the workshop.
- python code to tokenize code content, index it and search.
Part 1 - Introduction, demo - 10 mins
Introduce the objective and relevance of the problem.
Part 2 - Prepare the environment - 10 mins
Give participants time to install the dependencies.
Part 3 - Identify the files, TF-IDF - 30 mins
- Step 1 - Prepare python, yaml files. We will be using YAML and Python files to build the search engine indices. 10 mins.
- Step 2 - Introduction to TF-IDF and practice exercises. 20 mins.
Break - 10 mins
Part 4 - Implement the search engine - 75 mins
At its core, we need to build a search engine index which is saved to disk in the form of
- Step 1 - Identify correct files. Necessary
*.yamlfiles. 10 mins.
- Step 2 - Tokenize the file content. Each file will be a document in the document matrix. 30 mins.
- Step 2a) - Tokenizing
.yamlfile is mostly straightforward.
- Step 2b) - Tokenizing
.pyfile needs some setup to identify the name, args, kwargs, docstring, function calls, method references.
- Step 3 - Utilities to index and re-index (train) the estimator (model). 20 mins.
- Step 4 - Search for a given user query. Use cosine similarity to compare the user query relevance against the document estimator. 10 mins.
5 mins buffer for spillovers.
Part 5 - Extensions, discussion - 15 mins
This approach can be extended to handle code from GitHub, GitLab or any other remote code hosting/collaboration software your team uses. Allow people to track their repositories, refresh indexes periodically (as code changes).
We will be using beginner to intermediate level
Python in the workshop.
- Content will be shared in a Google Colab notebook which participants can replicate and follow along.
Python- In the workshop, we will use
Pythonfor file traversing, tokenization, module/function/docstring identification for python functions, indexing and re-indexing document matrices.
scikit-learn- will be good to know the overview of how
tfidfvectorizer) is useful.
- Read NCS paper to understand the importance of code search and relevance of code discovery.
View a subset of my talks, workshops.