Performance optimization with Python-Elasticsearch
Anisha Swain (~anisha24) |
By the end of this talk we would be able to have an idea about how to handle extremely large documents in Elasticsearch queries and implement custom filters to analyze query strings.
Whenever we think of implementing search for our application data, Elasticsearch bubbles up as one of the top choices for search. Elasticsearch is a popular, open source search stack used by web, mobile and cloud applications for search applications. But when it comes to large numbers of documents, Elasticsearch requires proper analysis of the query items. It requires custom filters to reduce the post processing of fetched data, as well as enhance the performance of the API at the same time. Problem: Sometimes when we query data from Elasticsearch, more amount of data is fetched from the required amount which again needs post processing resulting in eating up more memory and degrading performance. The deep pagination of large amount of data is the costliest part of the process. Hence this talk is about returning all documents quite cheaply and increase the performance as well as trying to eliminate the post processing.
Starting from using simple aggregations, this talk will do a deep dive of the design of analyzers for use different combinations of Tokenizers and TokenFilters. This will save us from fetching large data by being certain about the query and enhance the search capability. Analyzers are the special algorithms that determine how a string field in a document is transformed into terms in searching.It will also touch the aspect of using scan and scroll API for retrieval of humongous amount of data in batches without paying penalty for deep pagination. A scrolled search allows us to do an initial search and to keep pulling batches(amount depends on the customization) of results from Elasticsearch until there are no more results left and the scanning part disables sorting of results, making the process a lot cheaper.
Outline of the talk (~35 mins)
Intro to Elasticsearch (2-3 min)
- Documents and Indices.
- Search and Analyze.
- Aggregations (3 min)
- All about Analyzer (10 min)
- The Components of an Analyzer
- Normalizers (3 min)
- Tokenizers (3 min)
- Token Filters (3 min)
- The Components of an Analyzer (5 min)
- Scan and Scroll API (with Demo)(5 min)
Conclusion and Q/A(2 min)
Some of the key takeaways from this talk should be - - Understanding simple data aggregation - Dealing with a large amount of documents query with pagination/scrolling - Choosing the right analyzer for an Elasticsearch query can be as much an art as science.
Who is this talk for?
- The curious souls who want to know about Elasticsearch.
- Who deals with Elasticsearch on a daily basis.
- Fundamentals of Elasticsearch.
- Basic Knowledge of documents.
- Understanding of query and aggregation.
- Understanding of performance analysis.
Audience might find these links useful (not prerequisites)
The presentation, which is still being prepared, can be viewed here
- Anisha Swain
For the past 6 months, she is working at Red hat, Bangalore as an Intern and continues to contribute to Red Hat open source insiders products. She is a technology enthusiast aiming to develop cutting edge technologies for future. Her fields of interest include Web Development, Performance and Scale, Image Processing, Machine Learning, and Data Science.
- Manaswini Das
Manaswini Das is an open source lover from Bhubaneswar, India. She is a former Outreachy intern at Open Humans Foundation and a Processing Foundation fellow. She contributes to open source software and is ambitious in developing futuristic technologies. Her fields of interest include open source and artificial intelligence. Her hobbies include poetry, blogging and basketball. Being a pensive person, she likes diving into the depth of everything that she comes across.
- Anisha Swain
- Manaswini Das