Advertisement

Why Data Cleaning Is 80% of the Job

Every data science tutorial jumps straight to the exciting part building models, creating visualisations, running predictions. But in reality, 80% of a data project is cleaning and preparing the data. If your data is dirty, your results will be wrong. Period.

When I built my Crypto Analytics Dashboard, the raw API data had missing values, inconsistent date formats, duplicate entries, and columns with mixed data types. Before I could visualise anything, I had to clean it all and Pandas made that possible.

Step 1: First Look at Your Data

Always start by understanding what you're working with before making any changes:

import pandas as pd

df = pd.read_csv("sales_data.csv")

# Quick overview
print(df.shape)          # (rows, columns)
print(df.dtypes)         # Data types per column
print(df.info())         # Non-null counts + types
print(df.describe())     # Statistical summary
print(df.head(10))       # First 10 rows

This 30-second check tells you: how big the dataset is, which columns exist, what types they are, and where the problems might be.

Step 2: Handle Missing Values

Missing data is the most common issue. Here's how I decide what to do:

# Find missing values per column
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)  # % missing

# Strategy 1: Drop rows (if few are missing)
df = df.dropna(subset=["email", "order_id"])

# Strategy 2: Fill with a sensible default
df["city"].fillna("Unknown", inplace=True)

# Strategy 3: Fill with median (for numeric columns)
df["price"].fillna(df["price"].median(), inplace=True)

# Strategy 4: Forward fill (for time series)
df["stock_price"].fillna(method="ffill", inplace=True)

My rule of thumb: if a column has more than 40% missing values, I question whether it's useful at all. If it's under 5%, dropping those rows is usually safe.

Step 3: Remove Duplicates

Duplicate rows silently corrupt your analysis averages get skewed, counts inflate, and charts lie:

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# Remove exact duplicates
df = df.drop_duplicates()

# Remove duplicates based on specific columns
df = df.drop_duplicates(subset=["order_id"], keep="first")
Advertisement

Step 4: Fix Data Types

Pandas often guesses types wrong dates come in as strings, numbers as objects, booleans as integers:

# Convert string dates to datetime
df["order_date"] = pd.to_datetime(df["order_date"],
                                   format="%d/%m/%Y",
                                   errors="coerce")

# Convert string numbers to numeric
df["revenue"] = pd.to_numeric(df["revenue"]
                  .str.replace(",", "")
                  .str.replace("$", ""),
                  errors="coerce")

# Convert to category (saves memory for repeated values)
df["status"] = df["status"].astype("category")

The errors="coerce" parameter is essential it turns unparseable values into NaN instead of crashing your script.

Step 5: Standardise Text Data

Text inconsistency is sneaky "Mumbai", "mumbai", "MUMBAI", and " Mumbai " are all different to Pandas:

# Standardise text columns
df["city"] = df["city"].str.strip().str.title()
df["email"] = df["email"].str.strip().str.lower()
df["status"] = df["status"].str.strip().str.upper()

# Replace inconsistent values
df["country"] = df["country"].replace({
    "IN": "India",
    "IND": "India",
    "india": "India",
    "US": "United States",
    "USA": "United States"
})

Step 6: Handle Outliers

Outliers can wreck statistical analysis. I use the IQR method to identify them:

# IQR method for outlier detection
Q1 = df["price"].quantile(0.25)
Q3 = df["price"].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

# Flag outliers (don't always remove them!)
df["is_outlier"] = (df["price"] < lower) | (df["price"] > upper)
print(f"Outliers found: {df['is_outlier'].sum()}")

My Complete Cleaning Template

Here's the workflow I follow for every new dataset:

The Bottom Line

Data cleaning isn't glamorous, but it's the foundation of every trustworthy analysis. If you skip it, your dashboards will show wrong numbers, your models will make wrong predictions, and your decisions will be based on lies. Master Pandas data cleaning first everything else gets easier after that.

Need a Data Dashboard?

I build custom Streamlit data apps and analytics pipelines.

Get in Touch '
Advertisement