Data Reshaping#

Data reshaping is a crucial aspect of data preparation that involves transforming data between wide format (with more columns) and long format (with more rows), depending on the needs of your analysis. This chapter demonstrates how to reshape data from wide to long formats and vice versa using Pandas.

Wide to Long Format#

The wide_to_long function in Pandas is a powerful tool for transforming data from wide format to long format, which is often more amenable to analysis in Pandas:

import pandas as pd

# Sample DataFrame in wide format
data = {
    'id': [1, 2],
    'A_2020': [100, 200],
    'A_2021': [150, 250],
    'B_2020': [300, 400],
    'B_2021': [350, 450]
}
df = pd.DataFrame(data)

# Transforming from wide to long format
long_df = pd.wide_to_long(df, stubnames = ['A', 'B'], sep = '_', i = 'id', j = 'year')
print(long_df)

Result:

           A    B
id year
1  2020  100  300
   2021  150  350
2  2020  200  400
   2021  250  450

This output represents a DataFrame in long format where each row corresponds to a single year for each variable (A and B) and each id.

Long to Wide Format#

Converting data from long to wide format involves creating a pivot table, which can simplify certain types of data analysis by displaying data with one variable per column and combinations of other variables per row:

# Assuming long_df is the DataFrame in long format from the previous example
# We will use a slight modification for clarity
long_data = {
    'id': [1, 1, 2, 2],
    'year': [2020, 2021, 2020, 2021],
    'A': [100, 150, 200, 250],
    'B': [300, 350, 400, 450]
}
long_df = pd.DataFrame(long_data)

# Transforming from long to wide format
wide_df = long_df.pivot(index = 'id', columns = 'year')
print(wide_df)

Result:

       A         B
year 2020 2021 2020 2021
id
1    100  150  300  350
2    200  250  400  450

This result demonstrates a DataFrame in wide format where each id has associated values of A and B for each year spread across multiple columns.

Reshaping data effectively allows for easier analysis, particularly when dealing with panel data or time series that require operations across different dimensions.