Statistical Analysis#

Statistical analysis is a key component of data analysis, helping to understand trends, relationships, and distributions in data. Pandas offers a range of functions for performing statistical analyses, which can be incredibly insightful when exploring your data. This chapter will cover the basics, including correlation, covariance, and various ways of summarizing data distributions.

Correlation Matrix#

A correlation matrix displays the correlation coefficients between variables. Each cell in the table shows the correlation between two variables. Here’s how to generate a correlation matrix:

import pandas as pd

# Sample DataFrame
data = {'age': [25, 30, 35, 40, 45],
        'salary': [50000, 44000, 58000, 62000, 66000]}
df = pd.DataFrame(data)

# Creating a correlation matrix
corr_matrix = df.corr()
print(corr_matrix)

Result:

             age    salary
age     1.000000  0.883883
salary  0.883883  1.000000

Covariance Matrix#

The covariance matrix is similar to a correlation matrix but shows the covariance between variables. Here’s how to generate a covariance matrix:

# Creating a covariance matrix
cov_matrix = df.cov()
print(cov_matrix)

Result:

           age       salary
age       62.5       6250.0
salary  6250.0   80000000.0

Value Counts#

This function is used to count the number of unique entries in a column, which can be particularly useful for categorical data:

# Sample DataFrame
data = {'department': ['HR', 'Finance', 'IT', 'HR', 'Finance']}
df = pd.DataFrame(data)

# Using value counts
value_counts = df['department'].value_counts()
print(value_counts)

Result:

Finance    2
HR         2
IT         1

Unique Values in Column#

To find unique values in a column, use the unique function. This can help identify the diversity of entries in a column:

# Getting unique values from the column
unique_values = df['department'].unique()
print(unique_values)

Result:

['HR' 'Finance' 'IT']

Number of Unique Values#

If you need to know how many unique values are in a column, use nunique:

# Counting unique values
num_unique_values = df['department'].nunique()
print(num_unique_values)

Result:

3

These tools provide a fundamental insight into the statistical characteristics of your data, essential for both preliminary data exploration and advanced analyses.