Data Cleaning#

Let's go through the data cleaning process in a more detailed manner, step by step. We will start by creating a DataFrame that includes missing (NA or null) values, then apply various data cleaning operations, showing both the commands used and the resulting outputs.

First, we create a sample DataFrame that includes some missing values:

import pandas as pd

# Sample DataFrame with missing values
data = {
    'old_name': [1, 2, None, 4, 5],
    'B': [10, None, 12, None, 14],
    'C': ['A', 'B', 'C', 'D', 'E'],
    'D': pd.date_range(start = '2023-01-01', periods = 5, freq = 'D'),
    'E': [20, 21, 22, 23, 24]
}
df = pd.DataFrame(data)

This DataFrame contains missing values in columns 'old_name' and 'B'.

Checking for Missing Values#

To find out where the missing values are located, we use:

missing_values = df.isnull().sum()

Result:

old_name    1
B           2
C           0
D           0
E           0
dtype: int64

Filling Missing Values#

We can fill missing values with a specific value or a computed value (like the mean of the column):

filled_df = df.fillna({'old_name': 0, 'B': df['B'].mean()})

Result:

   old_name     B  C          D   E
0       1.0  10.0  A 2023-01-01  20
1       2.0  12.0  B 2023-01-02  21
2       0.0  12.0  C 2023-01-03  22
3       4.0  12.0  D 2023-01-04  23
4       5.0  14.0  E 2023-01-05  24

Dropping Missing Values#

Alternatively, we can drop rows with missing values:

dropped_df = df.dropna(axis = 'index')

Result:

   old_name     B  C          D   E
0       1.0  10.0  A 2023-01-01  20
4       5.0  14.0  E 2023-01-05  24

We can also drop columns with missing values:

dropped_df = df.dropna(axis = 'columns')

Result:

   C          D   E
0  A 2023-01-01  20
1  B 2023-01-02  21
2  C 2023-01-03  22
3  D 2023-01-04  23
4  E 2023-01-05  24

Renaming Columns#

To rename columns for clarity or standardization:

renamed_df = df.rename(columns = {'old_name': 'A'})

Result:

     A     B  C          D   E
0  1.0  10.0  A 2023-01-01  20
1  2.0   NaN  B 2023-01-02  21
2  NaN  12.0  C 2023-01-03  22
3  4.0   NaN  D 2023-01-04  23
4  5.0  14.0  E 2023-01-05  24

Dropping Columns#

To remove unnecessary columns:

dropped_columns_df = df.drop(columns = ['E'])

Result:

   old_name     B  C          D
0       1.0  10.0  A 2023-01-01
1       2.0   NaN  B 2023-01-02
2       NaN  12.0  C 2023-01-03
3       4.0   NaN  D 2023-01-04
4       5.0  14.0  E 2023-01-05

Each of these steps demonstrates a fundamental aspect of data cleaning in Pandas, crucial for preparing your dataset for further analysis.