Introduction
In this beginner-friendly article, we'll explore the importance of handling missing data
in data science projects. We'll uncover various techniques to deal with missing values, ensuring that your data is clean and ready for analysis. So, let's get started dealing with missing data!
Importance Of Data Cleaning
Data Cleaning is the most fundamental step in a data science project. It is the process of identifying errors
, inconsistencies
and inaccuracies
in the data. It is very important to identify the anomalies and treat them in the best possible way to get the most out of the data.
There are several ways to clean data. We'll be looking at them one by one.
🔃Loading the Data
In this article, we'll be using the planets
dataset by Seaborn.
import pandas as pd
import seaborn as sb
import numpy as np
planets = sb.load_dataset('planets')
planets.head(5)
💻Code Output:
🔍Checking Missing Values
Before treating the missing data, we first need to check whether they are present or not. So to do the same run the below code.
planets.isnull().sum()
💻Code Output:
Let's get the same data in a structured format.
def get_missing_info(dataframe):
info = pd.DataFrame()
info['Columns'] = planets.columns.values
info['Missing Values'] = planets.isnull().sum().values
info['Percentage Missing'] = np.round(100*planets.isnull().sum().values / len(planets), 2)
return info
get_missing_info(planets)
💻Code Output:
This method provides a more clear view of the missing values in the data.
🛠️Handling Missing Data
Now that we've identified the missing data, we can now handle it using two methods:
By dropping the columns.
Drop the missing rows.
Imputing them by
Descriptive Statistics
of the respective column.Replace the missing records.
However, some points should be kept in mind while going with any of these methods.
If we drop the columns or rows, we'll be
losing valuable information
. If the missing data is not random, this willintroduce bias in the data
.Before imputing the data with any of the methods, it is necessary to first
check the distribution of the data
. If the data doesn't follow a normal distribution, this may introduce biased imputation.
💡Dropping Columns
In our data, the "mass" column contains more than 50% missing values. So let's drop this feature.
planets.drop(columns = ['mass'], inplace = True)
💻Code Output:
inplace = True
to permanently drop the column.✅Pros:
Removing unnecessary columns
simplifies the dataset
, making it easier to work with.When building machine learning models, fewer columns can result in
faster model training and prediction times
.
⚠️Cons:
The most significant drawback of dropping columns is the
potential loss of valuable information
.If columns are dropped without careful consideration, it can
introduce bias into the dataset.
💡Dropping Rows
You can also drop rows containing missing values.
planets.dropna(inplace = True)
💻Code Output:
✅Pros:
Dropping rows with missing values can result in a cleaner dataset,
reducing the need for complex imputation methods
.In some cases, removing rows with missing values can lead to
more accurate analysis or modeling results.
⚠️Cons:
Dropping rows
removes entire observations
, which may contain valuable insights.If the missing data is not completely random, dropping rows can
introduce bias into your analysis
or model.
💡Imputing Missing Data
There are many ways to impute the missing data. Let's discuss them one by one.
💡Using Pandas fillna()
- Using
Descriptive Statistics
(mean, median and mode)
This is the most common method
used to fill in the missing values. Let's fill the "orbital_period" columns as it has only 4% missing values.
# Finding the mean of the column
missing_values_impute = planets['orbital_period'].mean()
# Replacing the missing values with the mean
planets['orbital_period'].fillna(missing_values_impute, inplace = True)
outliers
or not before deciding to replace it with mean. You need to choose the best imputation method accordingly.💡Pro Tip:
Use median() instead of mean() if your data contains outliers.
Use mode() in case you are handling categorical data.
✅Pros:
Preserves Central Tendency.
Simple and Computationally Efficient.
⚠️Cons:
Sensitive to Outliers.
May Distort Data Distribution.
💡Replacing Missing Values
- Using Forward and Backward Fill
To perform forward fill, use the fillna()
method with the method
argument.
ffill
fills missing values with the previous non-missing value
in the column whereas, bfill
fills missing values with the next non-missing value
in the column.
# Forward Fill
planets.fillna(method = 'ffill')
# Backward Fill
planets.fillna(method = 'bfill')
✅Pros:
Preserves Temporal Order.
Applicable to Time-Series Data.
⚠️Cons:
May Not Reflect True Values.
Inappropriate for Non-Temporal Data.
💡Bonus: Using Sklearn Imputer
The SimpleImputer from the sklearn.impute
can be used to fill in the missing values using a descriptive statistic (mean, median or most frequent) or using a constant value.
from sklearn.impute import SimpleImputer
# Creating an object of SimpleImputer
imputer = SimpleImputer(strategy = 'mean')
# Fitting and Transforming the imputer
planets['distance'] = imputer.fit_transform(planets[['distance']])
✅Pros:
It can
handle both numeric and categorical data
, making it versatile for a wide range of datasets.It
maintains the shape and structure
of your original dataset, ensuring that the imputed values fit appropriately into the existing data.
⚠️Cons:
Some strategies like mean or median imputation assume linearity, which may not be valid for all datasets,
potentially introducing bias
.It
may not handle situations where imputation requires information
from multiple columns.
💭Final Thoughts
In data science, data cleaning is similar to cleaning up your room before starting a project. Handling missing values, duplicates, and outliers ensures that you get accurate results.
Whether you're a beginner or an experienced data professional, keeping your data clean is key to making informed decisions. With the right techniques, you can turn messy data into valuable insights.
🧠Check Your Understanding
Question 1: What is a common method for handling missing data in a numerical column?
A) Replacing missing values with the column's mean
B) Deleting the entire row with missing data
C) Ignoring missing values during analysis
D) Replacing missing values with zeros
Question 2: What is the potential drawback of imputing missing data with the mean or median?
A) It increases the dataset's size
B) It preserves the original data distribution
C) It can introduce bias into the data
D) It works well for all types of missing data
Question 3: Which data cleaning technique involves filling missing values with the most frequently occurring value in a column?
A) Mean imputation
B) Median imputation
C) Mode imputation
D) Constant imputation
Remember to drop your answers in the comments below! We're curious to see how many you got right. It's like a mini Python coding quiz.
Check out my most viewed articles on Fake Data Generation and Pandas Techniques.
Connect with me on LinkedIn.
Subscribe to my newsletter and get such hidden gems straight into your inbox! Happy Data Exploring ^_^