Clustering in PySpark
Hariharan C (~hariharan) |
What with Big Data being the flavour of the decade and ML on distributed computing gaining prominence, this talk focusses on our experiments with the various clustering algorithms currently available in PySpark. We will discuss various challenges in data munging, scaling and optimization, feature encoding with mixed variables and the many Spark gotchas that we tripped on. We finally provide a code walkthrough of our soon to be open-sourced implementation of a clustering algorithm which we developed to bridge limitations of existing algorithms.
Understanding of basic clustering, basics of MapReduce and Spark.
Hariharan C is a data scientist at Mad Street Den. He builds artificially intelligent data products for e-commerce. Data hacker. ML wielder. Wannabe quiz-geek. Man-U fanboy.
Twitter : https://twitter.com/harc007