How I scaled up Datascience tasks by using Dask and Numba as against pyspark
Sundara Raman Narayanan (~sundara_raman) |
Data scientist prefer R and Python as a tool for understanding data and to build models. However, one major drawback with these tools is that they do not scale up automatically. Hence they will need to duplicate the same effort to rewrite the entire pipeline in some other tool such as pyspark or scala spark. But this also limits them with the kind of model that they can build using the data in hand. Because, not all models are available in spark. For example a simple KMeans clustering is readily available in Spark. However, If I want to build Agglomerative clustering algorithm then I do not have a package that has implemented this algorithm. Thus we have three challenges that data scientist face while using spark. 1. Selection of algorithm is limited 2. Time taken to rewrite the entire pipeline in spark from the default tool of preference 3. Learning curve for the tool spark.
Here I discuss about the natively available tool set such as Numba and Dask that can help a data scientist to solve the three issues mentioned above without compromising on the scalability.
Basic knowledge about the Data Science and Python
Sundara Raman Narayanan is working as a Data scientist with Crayon Data. He has 14 years of industry experience. His research area is Natural language processing. He has developed various statistical models such as Random forest, Gradient boosting and XGBoost to solve realworld problems for business. His domain focus is majorly around Marketing and Finance. He is currently focusing on the Travel and tourism domain to solve the consumer choice problem.