Finding Usage Patterns for Bots & Suspicious Users From Raw Logs Using Map Reduce

Dhruv Kalaan (~dhruv)


12

Votes

Description:

Anomaly detection and usage patterns of specific users is a major cause of concern for a e-commerce website, and with scrapping affecting live-traffic to the tunes of affecting performance to almost 90%, it has become a prime focus area in the world of security. Standard web application firewalls can help detect these Bots, but only basis of predefined rules/rates. Using map/reduce and analyzing access logs, we try to analyze behavioral patterns of a user and address them as bots, or a suspicious user and throw alerts for the same.

Framework is divided into 2 parts:- 1. Data Acquisition and Log Parsing 2. Machine learning and training model for anomaly finding

Part 1 Talking about simply plugging in and hooking up logs from various systems into our system, parsing them to get required enrichment fields.

Part 2 Feeding these logs into spark for machine learning, providing training sets of good/bad data and making use cases basis of user requirement.

We would be briefly talking about the architecture, and then moving onto show casing some use cases, and showing the plug-ability of new use cases. Our idea is to have this completely framework driven, where logs can simply be achieved by setting up a listener via fluent. Even Machine learning use cases have been made completely configurable where we can simply plugin any use case basis of config files and driver driven machine learning codes. So designing a use case won't take more than 10-15 minutes.

Since rule based detection and alerting any web application firewall can do we have moved away from rule based anomaly detection and letting the system learn itself basis of machine learning algorithms and thereby producing anomalies which would mean, any deviation from normal behavior rather than only certain rule based triggering, which any web application firewall would do.

We would be ending off, by providing challenges faced during our journey and how we overcame those issues, throughput that we have seen in our system, and our ideal response time to take care of any anomaly.

Our goal, helping a tech driven company like Makemytrip to quickly identify any anomalies that have occurred in the least amount of time, thereby boosting business.

Prerequisites:

Basic Python Knowledge Understanding of Logstash/Syslog-Ng Spark/Pyspark Kafka ELK Basic Machine Learning understanding

Content URLs:

Link to Elast-Alert = https://github.com/yelp/elastalert Link to Elastic Search = https://github.com/elastic/elasticsearch Link to Kibana = https://github.com/elastic/kibana Link to Spark - http://spark.apache.org/docs/latest/programming-guide.html Link to Kafka - http://kafka.apache.org/documentation.html Link to Fluentd - http://www.fluentd.org/

Python Libraries used :- 1. Pyspark Streaming Context, Pyspark Spark Context, Pyspark clustering 2. Elastic search 3. Streaming Libraries - Kafka 4. Elast Alert 5. Map Reduce

Speaker Info:

Dhruv Kalaan = I am a Data Science + Security Expert, currently working at Makemytrip India Pvt Ltd. working on ETL Automation, Using the ELK Stack to accumulate Security Logs/Access Logs, using SIEM, to understand these logs and correlate them, and push these logs in forms of alerts or events to databases, or Alerting Tools. In my free time, I love reading about new technologies out in the market to drive complex solutions to closure with simpler tools and solutions.

Kunal Aggarwal - I am a DevOps + Security Expert, currently working at MakeMyTrip India Pvt. Ltd. Carrying over 2 years experience of DevOps, I develop Automation Tools, do Security Tasks like VAPT's, Bug Bounty, Vulnerability Assessments. In my free time, I love to participate in coding challenges and looking out for new vulnerabilities on the web and try to exploit them.

Speaker Links:

Kunal Aggarwal - https://in.linkedin.com/in/kunalaggarwal92 Dhruv Kalaan - https://in.linkedin.com/in/dhruv-kalaan-a05a1154 Spoke about the same but on a smaller scale, and down-sized it, now its much larger, framework driven, and is almost dev-complete, expanded it much further.

We have extended our idea, which we presented at pydelhi by making it framework driven, configurable, and making it easy to hook up into any system very easily, also this can now detect any anomaly as compared to only detecting certain rule type of anomalies.

Link of talk at pydelhi- https://www.youtube.com/watch?v=RI86aYIRlog

Section: Security
Type: Talks
Target Audience: Intermediate
Last Updated:

Now thats something interesting. Looking forward to this talk.

Shubham Mittal (~shubham)

Thank you for the proposal, looks like an interesting topic, I looked at the talk shared and it would be great if you can provide some details on the following questions:

  1. How different are you planning this talk to be from the one given at PyDelhi. Please share a concrete plan.
  2. Until the example shared by you which talk about the deal codes and it's user, the talk looked fairly simple of How-Tos of various tools, which I will advice to restrain from. The example looked like it's more a porblem of grep's then why this elaborate setup.
  3. Since you are one of the largest e-comm companies, it would great to get some insights on numbers, throughput and scale challenges.
  4. Please list down the libraries you will be using to talk to various components like elastic-search, Kafka, elastic-alert, map-reduce etc etc.

If not structured well, this talk will be too much in too little time.

konark modi (~konark)

@konark - We have modified our talk basis of inputs provided by you :) let us know what you think of it now.

Dhruv Kalaan (~dhruv)

@konark - We have modified our talk basis of inputs provided by you :) let us know what you think of it now.

Dhruv Kalaan (~dhruv)

@konark - We have modified our talk basis of inputs provided by you :) let us know what you think of it now.

Dhruv Kalaan (~dhruv)

Login to add a new comment.