Revealing Uncommon Tricks: 25 Obscure Pandas & NumPy Hacks learnt over 5 years of being a Data Scientist
Nitin Kishore Sai Samala (~snknitin) |
Description:
Pandas and NumPy are indispensable tools in the arsenal of any data scientist, providing a solid foundation for data manipulation, analysis, and scientific computing. Over the course of five years in the data science realm, I have delved deep into these libraries, unearthing lesser-known, yet incredibly powerful hacks that have transformed my data workflows. In this enlightening talk, I will share 25 obscure Pandas & NumPy hacks that will revolutionize the way you approach data challenges and elevate your data science prowess to new heights.
The talk will commence with an introduction to Pandas & NumPy, covering their key functionalities and popular use cases. Building upon this foundation, we will embark on a captivating exploration of lesser-known hacks that can save you valuable time and streamline your data science pipelines. Ideally, they should become an indispensable part of your future workflows.
- Efficient Groupby Aggregations: Harness the hidden power of custom aggregation functions and optimized groupby operations to compute complex aggregations swiftly, achieving remarkable performance gains.
- Transforming Data with Custom Functions:
Unravel the art of using NumPy's vectorized functions and Pandas'
apply
method to efficiently transform data based on your unique requirements. - Merging DataFrames with Fuzzy Matching:
Explore the world of fuzzy matching using Pandas'
merge_asof
andmerge_ordered
functions, allowing you to join data with less strict matching criteria. - Pivot Tables with Multi-Index and Multi-Level Columns: Elevate your data summarization game by creating advanced pivot tables with multiple levels of indexing and columns, unleashing powerful insights.
- Fastest Way to Drop Duplicates: Discover a lesser-known approach to drop duplicates with lightning speed and a better selection criteria across columns, perfect for handling large datasets efficiently.
- Pandas DataFrame Styling and
describe
settings: Dive into Pandas' styling capabilities, enabling you to present your data with striking visualizations and aesthetic improvements. - NumPy Broadcasting Tricks: Unveil NumPy's broadcasting capabilities and clever tricks that enable you to perform operations on arrays of different shapes seamlessly.
- Handling Missing Data with NumPy: Master the art of dealing with missing data effectively using NumPy's masked arrays and advanced indexing techniques.
- Filtering Rows with NumPy's
np.where
: Learn how to leveragenp.where
to conditionally filter rows in NumPy arrays, enabling you to simplify complex logic. - Smart Memory Management with Chunked Processing: Harness the power of chunked processing to optimize memory usage when working with large datasets, without sacrificing performance.
- Creating Interactive Visualizations with Seaborn and Plotly: Combine the power of Seaborn and Plotly to create interactive visualizations that tell compelling data stories.
- Powerful Time Series Analysis with Pandas: Uncover lesser-known Pandas time series functionalities, such as rolling windows, resampling, and time zone conversions, to perform advanced stats analyses.
- Elegant Handling of Categorical Data: Discover Pandas' categorical data type and its potential to optimize memory usage and enhance performance for categorical variables.
- The Right way to
fillna
: Deal with your pesky Nans and <NA> types before they come back to bite you during training. - NumPy's Fast Fourier Transform (FFT): Unlock the potential of NumPy's FFT for extracting frequency-domain features from time series data, enabling advanced signal processing.
- Speed Up Data Reading with Dask and Vaex: Dive into the world of powerful parallel computing libraries, to efficiently read and process large datasets in parallel.
- Memory Mapping with NumPy: Uncover the lesser-known memory mapping functionality of NumPy to efficiently read and write large arrays from disk.
- Simplify Missing Data Imputation with NumPy Masked Arrays: Learn how to handle missing data using NumPy masked arrays, offering a clean and intuitive approach to imputation.
- Effortless Broadcasting in Pandas: Master the broadcasting capabilities of Pandas to perform element-wise operations on DataFrames with different shapes.
- Inplace application with Pandas'
agg
: Exploit the versatility of Pandas'agg
function to compute multiple aggregations efficiently in a single step. - Savepoint: Ditch the CSV: Different ways of saving and loading dataframes into Pandas
- Handling Complex Data Structures with Pandas and NumPy: Combine the strengths of Pandas and NumPy to handle complex data structures and multi-dimensional arrays with ease.
- Optimize Categorical Data Conversion with
pd.factorize
: Discoverpd.factorize
, an efficient method for converting categorical data to numerical representation. - Unpacking Timestamps in Pandas using
fast.ai
helper: Explore Pandas' feature engineering capabilities, to do efficient time-based analysis to the second. - Automate Your EDA with pandas-profiling: Explore advanced Pandas and NumPy techniques for efficiently processing and analyzing massive datasets with ease.
*Some of these may be subject to change with respect to the duration of the talk and ease of clustering *
Key Takeaways:
- Acquire 25 lesser-known Pandas & NumPy hacks to optimize data manipulation and analysis workflows.
- Enhance data cleaning, transformation, and summarization techniques with powerful and efficient methods.
- Elevate your data visualization capabilities using Pandas and Plotly for interactive and appealing visualizations.
- Boost your efficiency in handling big data with smart memory management and chunked processing techniques.
Outline:
- Introduction to Pandas & NumPy (5 minutes)
- Overview of key functionalities and applications
- Popular use cases in data science
- 25 Obscure Pandas & NumPy Hacks (20 minutes)
- Each hack will be presented with a code snippet and practical use case
- Hacks will cover data manipulation, optimization, visualization, and more
- Key Takeaways and Closing Remarks (5 minutes)
- Recap of the 25 hacks and their potential impact on data science workflows
- Encouragement to apply these hacks creatively in real-world projects
Join me on this data science talk as we unveil 25 lesser known hacks using Pandas & NumPy, and empower you with the tools to become a more efficient, effective, and resourceful data scientist. Whether you're a seasoned practitioner or a budding data enthusiast, these hacks will undoubtedly take your data science skills to the next level. Let's learn together, to perform better data manipulation techniques!
Prerequisites:
Basic understanding of Python, Data manipulation for ML/DL models, Experience using packages like Pandas and Numpy for working on some data science projects.
Content URLs:
You can find the slides uploaded here
https://github.com/snknitin/exploring-pandas-and-numpy
Please open in slide show view to allow transitions
Speaker Info:
A Jack-of-all-trades with a Masters in Computer Science
and a Minor in Data Science. I graduated from UMASS Amherst
in 2018 and am a Staff Data Scientist
. My undergrad was done from BITS Pilani
. I love solving puzzles/ciphers, deductive reasoning and tackling real-world challenges that require learning and combining different concepts.
My name is a palindrome, and Iβm a polymath and a polyglot (First Language is Pythonπ ). I aim to be an expert generalist across all the subfields and domains of AI, and I am perpetually working towards it. Life is a constant struggle between being a member of the community and standing out as an individual. I find a balance between both. Iβm not big on introductions because actions speak louder, and I believe people should grind until they no longer have to introduce themselves.
Speaker Links:
I have delivered 2 talks prior to this -
1) RE-WORK Applied AI Summit, San Francisco, Jan 2020.
2) Walmart AI Summit, Bangalore, April 2022.
You can find the links to these here- https://snknitin.github.io/talks/
Some additional redirects for the intrigued :
π Website
π Blogs
π€ Open Source Contribution
π±βπ» Github
βοΈ LinkedIn.