Performance optimization with Python-Elasticsearch

Anisha Swain (~anisha24) | 01 Jul, 2019

Description:

Objective:

By the end of this talk we would be able to have an idea about how to handle extremely large documents in Elasticsearch queries and implement custom filters to analyze query strings.
1. Overview:
Whenever we think of implementing search for our application data, Elasticsearch bubbles up as one of the top choices for search. Elasticsearch is a popular, open source search stack used by web, mobile and cloud applications for search applications. But when it comes to large numbers of documents, Elasticsearch requires proper analysis of the query items. It requires custom filters to reduce the post processing of fetched data, as well as enhance the performance of the API at the same time. Problem: Sometimes when we query data from Elasticsearch, more amount of data is fetched from the required amount which again needs post processing resulting in eating up more memory and degrading performance. The deep pagination of large amount of data is the costliest part of the process. Hence this talk is about returning all documents quite cheaply and increase the performance as well as trying to eliminate the post processing.
1. Approach:
Starting from using simple aggregations, this talk will do a deep dive of the design of analyzers for use different combinations of Tokenizers and TokenFilters. This will save us from fetching large data by being certain about the query and enhance the search capability. Analyzers are the special algorithms that determine how a string field in a document is transformed into terms in searching.It will also touch the aspect of using scan and scroll API for retrieval of humongous amount of data in batches without paying penalty for deep pagination. A scrolled search allows us to do an initial search and to keep pulling batches(amount depends on the customization) of results from Elasticsearch until there are no more results left and the scanning part disables sorting of results, making the process a lot cheaper.
1. Outline of the talk (~35 mins)
- Intro to Elasticsearch (2-3 min)
  - Documents and Indices.
  - Search and Analyze.

Aggregations (3 min)
All about Analyzer (10 min)
- The Components of an Analyzer
Normalizers (3 min)
Tokenizers (3 min)
Token Filters (3 min)
The Components of an Analyzer (5 min)
Scan and Scroll API (with Demo)(5 min)
Conclusion and Q/A(2 min)
1. Key takeaways
Some of the key takeaways from this talk should be -
Understanding simple data aggregation
Dealing with a large amount of documents query with pagination/scrolling
Choosing the right analyzer for an Elasticsearch query can be as much an art as science.
1. Who is this talk for?
Python/JavaScript developers who deal with performance issues on a daily basis.
The curious souls who want to know about Elasticsearch.
Who deals with Elasticsearch on a daily basis.

Prerequisites:

Recommended:
- Fundamentals of Elasticsearch.
- Basic Knowledge of documents.
1. Preferable:
- Understanding of query and aggregation.
- Understanding of performance analysis.

Content URLs:

Note: This talk is inspired by official Elasticsearch Refernce Guide and video course "Get Started with Elasticsearch" by Will Button.

Audience might find these links useful (not prerequisites)

The presentation, which is still being prepared, can be viewed here

Speaker Info:

Anisha Swain

Anisha has been a part of the open source world for over 3 years now. As a student developer, she started contributing to HospitalRun as a part of Rails Girls Summer of Code. She is a former Summer Research Fellow under Indian Academy of Sciences and have been associated with the field of research on Image Processing. She also contributed to AIMA-Javascript as a participant of Google Summer of Code-2017. She has been a part of Google Developers Group and Women Techmakers as a speaker and volunteer.

For the past 6 months, she is working at Red hat, Bangalore as an Intern and continues to contribute to Red Hat open source insiders products. She is a technology enthusiast aiming to develop cutting edge technologies for future. Her fields of interest include Web Development, Performance and Scale, Image Processing, Machine Learning, and Data Science.
1. Manaswini Das
Manaswini Das is an open source lover from Bhubaneswar, India. She is a former Outreachy intern at Open Humans Foundation and a Processing Foundation fellow. She contributes to open source software and is ambitious in developing futuristic technologies. Her fields of interest include open source and artificial intelligence. Her hobbies include poetry, blogging and basketball. Being a pensive person, she likes diving into the depth of everything that she comes across.

Speaker Links:

Anisha Swain - Github - Codes and Open Source Contributions - Linkedin - Professional Career - Twitter - Social - Personal Website - Visualization Work - Blog - Technical Articles - Article Publication - Image Processing - Research Paper - Real time biometric surveillance with gait recognition
- Manaswini Das
  - Github
  - Blog

Section:	Developer tools and automation
Type:	Talks
Target Audience:	Beginner
Last Updated:	01 Sep, 2019

Comments