Get Ahead of 90% of Data Nerds with These Pandas Techniques
5 Lesser-Known Pandas Tricks
Introduction
Data analysis is a powerful tool in today's world, but to use it to its full potential, we need to master some essential techniques. From binning techniques to query operations, and from reshaping data to making sense of it with aggregation, this article will cover some of them.
Data Filtering With query()
The query() function in Pandas allows us to filter the data
using a query expression. It queries the columns of the data using a boolean expression
.
In this case, we'll be filtering the data based on age >= 25
and fare >= 50
.
import pandas as pd
import seaborn as sns
# Load the Titanic dataset
titanic_data = sns.load_dataset('titanic')
# Showing only age and fare column
titanic_data = titanic_data[['age', 'fare']]
# Filtering based on age and fare
filtered_passengers = titanic_data.query('age >= 25 and fare >= 50')
filtered_passengers.head(5)
Code Output:
In this filtered data, all the records contain the age and fare values in the range given in the condition.
Binning Techniques
Using cut() for Custom Binning
We use cut when we need to segment data values into bins
using the bins parameter. This function is mainly used to create categorical features from continuous features
. For example, we can convert the age column to a group of age ranges.
import pandas as pd
import seaborn as sb
# Loading the titanic dataset using seaborn
titanic = sb.load_dataset('titanic')
# Defining Age Bins
bins = [0, 18, 35, 50, 100]
# Defining Age Labels
labels = ['Child', 'Young Adult', 'Middle Aged Adults', 'Senior']
titanic['age_group_cut'] = pd.cut(titanic['age'], bins = bins, labels = labels)
pd.DataFrame(titanic[['age', 'age_group_cut']]).sample(5)
Code Output:
qcut() for Quantile-Based Binning
In the case of qcut()
we specify the quantiles
(quartiles and this case) using the q
parameter. Labels are assigned automatically based on quantiles. This is useful when we want to divide the data into quantiles
, letting Pandas decide the bin size for us.
import pandas as pd
import seaborn as sb
# Loading the titanic dataset using seaborn
titanic = sb.load_dataset('titanic')
labels = ['Q1', 'Q2', 'Q3', 'Q4']
titanic['age_group_qcut'] = pd.qcut(titanic['age'], q = 4, labels = labels)
pd.DataFrame(titanic[['age', 'age_group_qcut']]).sample(5)
Code Output:
Reshaping Data
Using Pandas .melt() Function
The pandas.melt()
function is used to transform the data from a wide format to a long format
, making it easier to work and analyze the data. It is useful when we want some columns to act as identifiers and unpivot other columns
.
Let's first select a subset of the data:
import pandas as pd
import seaborn as sns
# Load the Titanic dataset
titanic_data = sns.load_dataset('titanic')
# Select a subset of the dataset
subset = titanic_data[['class', 'fare', 'age']].head(5)
print("Original DataFrame:")
pd.DataFrame(subset)
Now let's use .melt()
function to transform the data:
melted_data = pd.melt(subset, id_vars=['class'], value_vars=['fare', 'age'], var_name='variable', value_name='value')
print("\nDataFrame after using pd.melt():")
melted_data.sample(5)
Code Output:
Data Aggregation with .agg()
The .agg()
function is mainly used to perform multiple aggregation operations
based on one or more columns. Let's understand this with an example.
import pandas as pd
import seaborn as sns
# Loading the Titanic dataset
titanic_data = sns.load_dataset('titanic')
# Grouping the data by passenger class and apply multiple aggregation
agg_result = titanic_data.groupby('class').agg({
'age': ['mean', 'median', 'std']}).reset_index()
# Rename the columns for a better understanding
agg_result.columns = ['Passenger_Class', 'Age_Mean', 'Age_Median', 'Age_Std']
# Display the aggregated result
print(agg_result)
Code Output:
One-Hot Encoding with get_dummies()
The pd.get_dummies()
function in Pandas is used for one-hot encoding
categorical variables. It converts categorical (nominal) variables into a set of binary columns (0 or 1) called dummy variables.
Let's first have a look at the class
column from the Titanic data.
import pandas as pd
import seaborn as sns
# Load the Titanic dataset
titanic_data = sns.load_dataset('titanic')
pd.DataFrame(titanic_data['class'])
Let's now one-hot encode the class column.
import pandas as pd
import seaborn as sns
# Load the Titanic dataset
titanic_data = sns.load_dataset('titanic')
# Select the 'class' column as an example
class_dummies = pd.get_dummies(titanic_data['class'], prefix='Class')
# Display the DataFrame with dummy variables and prefixes
class_dummies.head()
Code Output:
drop_first = True
to escape the issue of multicollinearity when one-hot encoding categorical variables.Final Thoughts
In conclusion, we've explored several useful techniques in Pandas that can help you efficiently work with data. Whether you're new to data or already on your journey, learning and practicing these techniques can make your analysis more insightful. Start using them today, and you'll find that data analysis becomes a bit easier and more enjoyable.
Check Your Understanding
What does the agg()
function in Pandas stand for?
a. Aggregate
b. Analysis
c. Argument
d. Accumulate
What is the primary purpose of using drop_first=True
in pd.get_dummies()
?
a. It drops the last category in the data.
b. It drops the first category in the data.
c. It drops all categories except the first one.
d. It drops missing values.
What does the cut()
function in Pandas allow you to do?
a. Split a data frame into multiple smaller data frames
b. Perform statistical tests on data
c. Create custom bins for data binning
d. Merge two DataFrames
Remember to drop your answers in the comments below! We're curious to see how many you got right. It's like a mini Python coding quiz.
Check out my most viewed articles on Fake Data Generation and Replacing Loops.
Connect with me on LinkedIn.
Subscribe to my newsletter and get such hidden gems straight into your inbox! Happy Data Exploring ^_^