Realistic Synthetic Data at scale: Influenced by, but not production data

Mehul_Sheth


12

Votes

Description:

To have a high confidence in a product, testing it against a data set which resembles production data is must. The challenge is in generating data for testing that represents production. The data in production is _not predictable_, it doesn’t follow simple formula, there are many variables that characterize it. Broadly, test data can be divided into two categories: Arbitrary, which is random and unstructured and Realistic, which follows patterns, is predictable and controlled. To generate a Realistic test data, right patterns needs to be captured by analyzing the existing production data. Access to production data can be regulated and not easy to obtain. However, implementing code to read relevant data from production, without exposing the actual data, but updating models which are used to generate test data, when required such that the generated test data represents production data in selected dimensions, as directed by the business of the product under test.

In this session Mehul Sheth will talk about Druva's journey in generating test data at scale, which is highly influenced by production data, has "genes" of production data but not a single byte is taken "as-is" from production. Although Druva's journey and decisions taken may be unique and not directly applicable in all scenarios, session will highlight the _thought process, algorithms and decisions_ in a generic fashion. How to focus on the ability to assess the model and tweak it to include _edge conditions, remain realistic, applicable at all time, versatile, repeatable_ and most important: easily controllable.

Specifically, the session describes a process for _modeling a directory tree_ with files and folders with various variables (like size of file, number of files and folders in each folder at each depth, patterns in names of files and folders, ratio of different file types and other variables) which may be important for the application under test. And then how to _apply this model to generate file-sets_ of different sizes but completely random data, maintaining the relations between modeled variables. Datasets thus generated are random in raw format, however, maintain the characteristics of the model and can be used for performance / stress testing anti-virus software, legal discovery software or backup software. Extending the concept further, it can be used to model any data and meta-data like mailboxes or transnational databases.

Watch Video Teaser!

Prerequisites:

Nothing specific, however if participants have faced issue in generating production like data for testing, this session will help you!

Video URL:

https://youtu.be/W_Lp5QXhY1Q

Content URLs:

Will be shared soon

Speaker Info:

Mehul Sheth is a Senior Performance Engineer in Performance Labs at Druva where he is responsible for the performance of CloudApps product of Druva InSync. He has a experience of more than 13 years in development and performance engineering, where he has ensured production performance of thousands of applications. Mehul loves to tackle unsolved problems and strives to bring a simple solution to the table, rather than trying complex things.

Speaker Links:

LinkedIn

Section: Others
Type: Talks
Target Audience: Beginner
Last Updated: