Feature or Preprocessing Step? How to Correctly Set a Baseline in NLP Classification Tasks

Lisa A S (~lisa_a)




Abstract: In the realm of NLP classification, establishing a robust and fair baseline model is crucial for evaluating the performance of sophisticated models such as neural networks and transformers. Despite the growing body of research, many papers suffer from poorly defined baselines and inappropriate preprocessing steps. This talk will address these common pitfalls and provide a structured approach to correctly setting a baseline in NLP classification tasks. By distinguishing between preprocessing steps and features, and demonstrating proper preprocessing techniques for traditional and advanced models, this presentation aims to elevate the standard practices in the field.


  • Identify Common Issues in Baseline Models: Highlight the prevalent issues found in numerous NLP papers, such as sloppy baseline models and the misuse of preprocessing techniques across different model types.
  • Distinguish Between Preprocessing and Feature Engineering: Define what constitutes a preprocessing step versus a feature, and explain their roles in setting a baseline.
  • Establishing a Robust Baseline: Provide a step-by-step guide on how to correctly set a baseline for traditional ML models and compare it to more sophisticated models.


  1. Introduction
  • Importance of a robust baseline in NLP classification tasks.
  • Overview of common pitfalls in current research.
  1. Identifying Sloppy Baselines
  • Examples of inadequate preprocessing for traditional ML models.
  • Consequences of improper preprocessing: deleting or generating features, inappropriate tokenization, and not leveraging Scikit-learn capabilities.
  1. Distinguishing Preprocessing Steps from Features
  • Acceptable preprocessing steps for a baseline model: lower-casing, appropriate tokenization, deleting single-appearance tokens.
  • The necessity of experimenting with preprocessing techniques.
  1. Setting a Baseline: A Detailed Approach
  • Data splitting: training, validation, and test sets.
  • Baseline with traditional ML models: using Naive Bayes with basic preprocessing.
  • The importance of hyperparameter tuning for vectorizers.
  1. Algorithm Selection and Optimization
  • Pipeline for algorithm selection and feature engineering.
  • Examples of preprocessing and feature engineering techniques: normalization, handling noise, replacing slang, stemming, etc.
  1. Comparing Traditional and Advanced Models
  • Fair comparison to advanced models (neural networks, transformers).
  • Appropriate preprocessing for advanced models: minimal preprocessing, avoiding lemmatization/stemming, lower-casing, and tokenization.
  1. Tips and Best Practices for Advanced Models
  • Handling unknown words and dummy tokens.
  • Dealing with non-generalizable information (emails, paths, names).
  • Enhancing embeddings with additional data.
  1. Conclusion
  • Recap of best practices for setting a robust baseline.
  • Encouragement to adopt these practices for more reliable and comparable NLP research outcomes.

Note: examples shown during the talk will all be from Indian researchers using Hindi datasets to make it more relevant and relatable to the audience given that the majority of work in NLP is in English


Attendees will gain a comprehensive understanding of the importance of setting a robust baseline in NLP classification tasks. They will learn practical techniques to improve their preprocessing steps and distinguish between preprocessing and feature engineering. This knowledge will help them conduct more reliable experiments and produce more credible results in their research.


Researchers, data scientists, and practitioners in the field of NLP and machine learning who are involved in model development and evaluation. This talk is especially relevant for those looking to improve their methodological rigor in establishing baselines and preprocessing practices.


No prerequisites are needed apart from a basic understanding of machine learning model evaluation and NLP

Speaker Info:


With a unique blend of legal and technical expertise, Lisa began their academic journey by studying law at the London School of Economics. Transitioning to the field of computer science, Lisa earned a Master’s degree from Imperial College London and a PhD from University College London. During their PhD, she discovered a passion for teaching, earning several awards for her innovative approach. Today, Lisa shares their coding expertise with corporate clients around the world, balancing professional commitments with a vibrant lifestyle that includes traveling between Asia and Europe.

Fluent in Russian, German, and English, Lisa is currently developing multilingual teaching content, enhancing accessibility and understanding across different linguistic backgrounds. She spends half the year in Thailand, indulging in long-distance swimming and Muay Thai, and is also a certified yoga trainer and massage therapist (both trainings completed in India). Combining intellectual pursuits with physical wellness, Lisa embodies a holistic approach to life and learning.

Speaker Links:


https://www.youtube.com/watch?v=Hh944jtZoHE (start at 2:40:00)




Section: Python in Education and Research
Type: Talk
Target Audience: Intermediate
Last Updated: