How to detect Phishing URLs using PySpark Decision Trees

Hitesh Dharmdasani (~hiteshd)




Credential theft via use of Phishing Pages form a bulk of common-man-incidents in todays Information Security landscape. This talk is aimed at looking at ways to detect Phishing pages by leveraging large quantities of previously seen phishing pages and using machine learning algorithms to predict future phishing pages with a high degree of accuracy.

Current systems to detect maliciousness of web pages rely on other factors such as the reputation of the domain name, the flux of the domain name, reputation of the IP address, whitelists and blacklists

While the above approaches have proved successful in the past, The content of the web page is the deciding factor to determine if a page is a phishing page or the original page for the activity. The model being designed has proved accurate to 99.97% with just over 10 features. Hence delivering promising results.


Willingness to ask questions. Attendees must have a preliminary understanding of what Phishing Pages are and what they do. Understanding of Machine learning primitives is favorable but not essential.

Content URLs:

Talk on Spark at PyBelgaum

Content for PyBelgaum workshop

The code for the Classifier and feature extractor is going to be open sourced post the talk PyCon India.

Speaker Info:

I am an Independent Security Researcher. My interests lie at the intersection of network security, data science and big data. During my graduate studies i was a part of CESR: Center for Evidence-based Security Research where i worked on understanding Botnets and developing ways to combat Cyber Crime. I am currently building a series of data driven network security products/services

Id: 4
Section: Security
Type: Talks
Target Audience: Intermediate
Last Updated: