Handling Missing Data like a Pro!

Handling Missing Data like a Pro!

Missing Data? No Problem!

Introduction

In this beginner-friendly article, we'll explore the importance of handling missing data in data science projects. We'll uncover various techniques to deal with missing values, ensuring that your data is clean and ready for analysis. So, let's get started dealing with missing data!

Importance Of Data Cleaning

Data Cleaning is the most fundamental step in a data science project. It is the process of identifying errors, inconsistencies and inaccuracies in the data. It is very important to identify the anomalies and treat them in the best possible way to get the most out of the data.

There are several ways to clean data. We'll be looking at them one by one.

🔃Loading the Data

In this article, we'll be using the planets dataset by Seaborn.

import pandas as pd
import seaborn as sb
import numpy as np
planets = sb.load_dataset('planets')
planets.head(5)

💻Code Output:

🔍Checking Missing Values

Before treating the missing data, we first need to check whether they are present or not. So to do the same run the below code.

planets.isnull().sum()

💻Code Output:

Let's get the same data in a structured format.

def get_missing_info(dataframe):

  info = pd.DataFrame()

  info['Columns'] = planets.columns.values
  info['Missing Values'] = planets.isnull().sum().values
  info['Percentage Missing'] = np.round(100*planets.isnull().sum().values / len(planets), 2)

  return info

get_missing_info(planets)

💻Code Output:

This method provides a more clear view of the missing values in the data.


🛠️Handling Missing Data

Now that we've identified the missing data, we can now handle it using two methods:

  1. By dropping the columns.

  2. Drop the missing rows.

  3. Imputing them by Descriptive Statistics of the respective column.

  4. Replace the missing records.

However, some points should be kept in mind while going with any of these methods.

  1. If we drop the columns or rows, we'll be losing valuable information. If the missing data is not random, this will introduce bias in the data.

  2. Before imputing the data with any of the methods, it is necessary to first check the distribution of the data. If the data doesn't follow a normal distribution, this may introduce biased imputation.


💡Dropping Columns

In our data, the "mass" column contains more than 50% missing values. So let's drop this feature.

planets.drop(columns = ['mass'], inplace = True)

💻Code Output:

💡
Remember to add inplace = True to permanently drop the column.

✅Pros:

  • Removing unnecessary columns simplifies the dataset, making it easier to work with.

  • When building machine learning models, fewer columns can result in faster model training and prediction times.

⚠️Cons:

  • The most significant drawback of dropping columns is the potential loss of valuable information.

  • If columns are dropped without careful consideration, it can introduce bias into the dataset.


💡Dropping Rows

You can also drop rows containing missing values.

planets.dropna(inplace = True)

💻Code Output:

✅Pros:

  • Dropping rows with missing values can result in a cleaner dataset, reducing the need for complex imputation methods.

  • In some cases, removing rows with missing values can lead to more accurate analysis or modeling results.

⚠️Cons:

  • Dropping rows removes entire observations, which may contain valuable insights.

  • If the missing data is not completely random, dropping rows can introduce bias into your analysis or model.


💡Imputing Missing Data

There are many ways to impute the missing data. Let's discuss them one by one.

💡Using Pandas fillna()

  • Using Descriptive Statistics (mean, median and mode)

This is the most common method used to fill in the missing values. Let's fill the "orbital_period" columns as it has only 4% missing values.

# Finding the mean of the column
missing_values_impute = planets['orbital_period'].mean()

# Replacing the missing values with the mean
planets['orbital_period'].fillna(missing_values_impute, inplace = True)
💡
Note that you need to first investigate whether the data contains outliers or not before deciding to replace it with mean. You need to choose the best imputation method accordingly.

💡Pro Tip:

Use median() instead of mean() if your data contains outliers.

Use mode() in case you are handling categorical data.

✅Pros:

  • Preserves Central Tendency.

  • Simple and Computationally Efficient.

⚠️Cons:

  • Sensitive to Outliers.

  • May Distort Data Distribution.


💡Replacing Missing Values

  • Using Forward and Backward Fill

To perform forward fill, use the fillna() method with the method argument.

ffill fills missing values with the previous non-missing value in the column whereas, bfill fills missing values with the next non-missing value in the column.

# Forward Fill
planets.fillna(method = 'ffill')

# Backward Fill
planets.fillna(method = 'bfill')

✅Pros:

  • Preserves Temporal Order.

  • Applicable to Time-Series Data.

⚠️Cons:

  • May Not Reflect True Values.

  • Inappropriate for Non-Temporal Data.


💡Bonus: Using Sklearn Imputer

The SimpleImputer from the sklearn.impute can be used to fill in the missing values using a descriptive statistic (mean, median or most frequent) or using a constant value.

from sklearn.impute import SimpleImputer

# Creating an object of SimpleImputer
imputer = SimpleImputer(strategy = 'mean')

# Fitting and Transforming the imputer
planets['distance'] = imputer.fit_transform(planets[['distance']])
💡
Note that in this case, we choose strategy as "mean". However, you can choose other methods like "median". In the case of categorical features, you have the option of "most_frequent" which replaces missing values with the mode value.

✅Pros:

  • It can handle both numeric and categorical data, making it versatile for a wide range of datasets.

  • It maintains the shape and structure of your original dataset, ensuring that the imputed values fit appropriately into the existing data.

⚠️Cons:

  • Some strategies like mean or median imputation assume linearity, which may not be valid for all datasets, potentially introducing bias.

  • It may not handle situations where imputation requires information from multiple columns.


💭Final Thoughts

In data science, data cleaning is similar to cleaning up your room before starting a project. Handling missing values, duplicates, and outliers ensures that you get accurate results.

Whether you're a beginner or an experienced data professional, keeping your data clean is key to making informed decisions. With the right techniques, you can turn messy data into valuable insights.


🧠Check Your Understanding

Question 1: What is a common method for handling missing data in a numerical column?

A) Replacing missing values with the column's mean

B) Deleting the entire row with missing data

C) Ignoring missing values during analysis

D) Replacing missing values with zeros


Question 2: What is the potential drawback of imputing missing data with the mean or median?

A) It increases the dataset's size

B) It preserves the original data distribution

C) It can introduce bias into the data

D) It works well for all types of missing data


Question 3: Which data cleaning technique involves filling missing values with the most frequently occurring value in a column?

A) Mean imputation

B) Median imputation

C) Mode imputation

D) Constant imputation


Remember to drop your answers in the comments below! We're curious to see how many you got right. It's like a mini Python coding quiz.

Check out my most viewed articles on Fake Data Generation and Pandas Techniques.

Connect with me on LinkedIn.
Subscribe to my newsletter and get such hidden gems straight into your inbox! Happy Data Exploring ^_^