Revealing Uncommon Tricks: 25 Obscure Pandas & NumPy Hacks learnt over 5 years of being a Data Scientist

Nitin Kishore Sai Samala (~snknitin)


3

Votes

Description:

Pandas and NumPy are indispensable tools in the arsenal of any data scientist, providing a solid foundation for data manipulation, analysis, and scientific computing. Over the course of five years in the data science realm, I have delved deep into these libraries, unearthing lesser-known, yet incredibly powerful hacks that have transformed my data workflows. In this enlightening talk, I will share 25 obscure Pandas & NumPy hacks that will revolutionize the way you approach data challenges and elevate your data science prowess to new heights.

The talk will commence with an introduction to Pandas & NumPy, covering their key functionalities and popular use cases. Building upon this foundation, we will embark on a captivating exploration of lesser-known hacks that can save you valuable time and streamline your data science pipelines. Ideally, they should become an indispensable part of your future workflows.

  1. Efficient Groupby Aggregations: Harness the hidden power of custom aggregation functions and optimized groupby operations to compute complex aggregations swiftly, achieving remarkable performance gains.
  2. Transforming Data with Custom Functions: Unravel the art of using NumPy's vectorized functions and Pandas' apply method to efficiently transform data based on your unique requirements.
  3. Merging DataFrames with Fuzzy Matching: Explore the world of fuzzy matching using Pandas' merge_asof and merge_ordered functions, allowing you to join data with less strict matching criteria.
  4. Pivot Tables with Multi-Index and Multi-Level Columns: Elevate your data summarization game by creating advanced pivot tables with multiple levels of indexing and columns, unleashing powerful insights.
  5. Fastest Way to Drop Duplicates: Discover a lesser-known approach to drop duplicates with lightning speed and a better selection criteria across columns, perfect for handling large datasets efficiently.
  6. Pandas DataFrame Styling and describe settings: Dive into Pandas' styling capabilities, enabling you to present your data with striking visualizations and aesthetic improvements.
  7. NumPy Broadcasting Tricks: Unveil NumPy's broadcasting capabilities and clever tricks that enable you to perform operations on arrays of different shapes seamlessly.
  8. Handling Missing Data with NumPy: Master the art of dealing with missing data effectively using NumPy's masked arrays and advanced indexing techniques.
  9. Filtering Rows with NumPy's np.where: Learn how to leverage np.where to conditionally filter rows in NumPy arrays, enabling you to simplify complex logic.
  10. Smart Memory Management with Chunked Processing: Harness the power of chunked processing to optimize memory usage when working with large datasets, without sacrificing performance.
  11. Creating Interactive Visualizations with Seaborn and Plotly: Combine the power of Seaborn and Plotly to create interactive visualizations that tell compelling data stories.
  12. Powerful Time Series Analysis with Pandas: Uncover lesser-known Pandas time series functionalities, such as rolling windows, resampling, and time zone conversions, to perform advanced stats analyses.
  13. Elegant Handling of Categorical Data: Discover Pandas' categorical data type and its potential to optimize memory usage and enhance performance for categorical variables.
  14. The Right way to fillna: Deal with your pesky Nans and <NA> types before they come back to bite you during training.
  15. NumPy's Fast Fourier Transform (FFT): Unlock the potential of NumPy's FFT for extracting frequency-domain features from time series data, enabling advanced signal processing.
  16. Speed Up Data Reading with Dask and Vaex: Dive into the world of powerful parallel computing libraries, to efficiently read and process large datasets in parallel.
  17. Memory Mapping with NumPy: Uncover the lesser-known memory mapping functionality of NumPy to efficiently read and write large arrays from disk.
  18. Simplify Missing Data Imputation with NumPy Masked Arrays: Learn how to handle missing data using NumPy masked arrays, offering a clean and intuitive approach to imputation.
  19. Effortless Broadcasting in Pandas: Master the broadcasting capabilities of Pandas to perform element-wise operations on DataFrames with different shapes.
  20. Inplace application with Pandas' agg: Exploit the versatility of Pandas' agg function to compute multiple aggregations efficiently in a single step.
  21. Savepoint: Ditch the CSV: Different ways of saving and loading dataframes into Pandas
  22. Handling Complex Data Structures with Pandas and NumPy: Combine the strengths of Pandas and NumPy to handle complex data structures and multi-dimensional arrays with ease.
  23. Optimize Categorical Data Conversion with pd.factorize: Discover pd.factorize, an efficient method for converting categorical data to numerical representation.
  24. Unpacking Timestamps in Pandas using fast.ai helper: Explore Pandas' feature engineering capabilities, to do efficient time-based analysis to the second.
  25. Automate Your EDA with pandas-profiling: Explore advanced Pandas and NumPy techniques for efficiently processing and analyzing massive datasets with ease.

*Some of these may be subject to change with respect to the duration of the talk and ease of clustering *

Key Takeaways:

  1. Acquire 25 lesser-known Pandas & NumPy hacks to optimize data manipulation and analysis workflows.
  2. Enhance data cleaning, transformation, and summarization techniques with powerful and efficient methods.
  3. Elevate your data visualization capabilities using Pandas and Plotly for interactive and appealing visualizations.
  4. Boost your efficiency in handling big data with smart memory management and chunked processing techniques.

Outline:

  1. Introduction to Pandas & NumPy (5 minutes)
    • Overview of key functionalities and applications
    • Popular use cases in data science
  2. 25 Obscure Pandas & NumPy Hacks (20 minutes)
    • Each hack will be presented with a code snippet and practical use case
    • Hacks will cover data manipulation, optimization, visualization, and more
  3. Key Takeaways and Closing Remarks (5 minutes)
    • Recap of the 25 hacks and their potential impact on data science workflows
    • Encouragement to apply these hacks creatively in real-world projects

Join me on this data science talk as we unveil 25 lesser known hacks using Pandas & NumPy, and empower you with the tools to become a more efficient, effective, and resourceful data scientist. Whether you're a seasoned practitioner or a budding data enthusiast, these hacks will undoubtedly take your data science skills to the next level. Let's learn together, to perform better data manipulation techniques!

Prerequisites:

Basic understanding of Python, Data manipulation for ML/DL models, Experience using packages like Pandas and Numpy for working on some data science projects.

Content URLs:

You can find the slides uploaded here

https://github.com/snknitin/exploring-pandas-and-numpy

Please open in slide show view to allow transitions

Speaker Info:

A Jack-of-all-trades with a Masters in Computer Science and a Minor in Data Science. I graduated from UMASS Amherst in 2018 and am a Staff Data Scientist. My undergrad was done from BITS Pilani. I love solving puzzles/ciphers, deductive reasoning and tackling real-world challenges that require learning and combining different concepts.

My name is a palindrome, and I’m a polymath and a polyglot (First Language is PythonπŸ‰ ). I aim to be an expert generalist across all the subfields and domains of AI, and I am perpetually working towards it. Life is a constant struggle between being a member of the community and standing out as an individual. I find a balance between both. I’m not big on introductions because actions speak louder, and I believe people should grind until they no longer have to introduce themselves.

Speaker Links:

I have delivered 2 talks prior to this -

1) RE-WORK Applied AI Summit, San Francisco, Jan 2020.

2) Walmart AI Summit, Bangalore, April 2022.

You can find the links to these here- https://snknitin.github.io/talks/

Some additional redirects for the intrigued :

🌐 Website
πŸ“œ Blogs
πŸ€– Open Source Contribution
πŸ±β€πŸ’» Github
⛓️ LinkedIn.

Section: Data Science, AI & ML
Type: Talks
Target Audience: Intermediate
Last Updated: