Descriptive Statistics

Descriptive Statistics#

Descriptive statistics is a branch of statistics that involves the analysis and summary of data to gain insights into its main characteristics. It provides a set of quantitative measures that describe the central tendency, dispersion, and shape of a dataset. These statistics help in understanding the data distribution, identifying patterns, and making data-driven decisions.

There are several key descriptive statistics commonly used to summarize data:

Mean: The mean, or average, is calculated by summing all values in a dataset and dividing by the total number of observations. It represents the central tendency of the data.
Median: The median is the middle value in a dataset when it is arranged in ascending or descending order. It is less affected by outliers and provides a robust measure of central tendency.
Mode: The mode is the most frequently occurring value in a dataset. It represents the value or values with the highest frequency.
Variance: Variance measures the spread or dispersion of data points around the mean. It quantifies the average squared difference between each data point and the mean.
Standard Deviation: Standard deviation is the square root of the variance. It provides a measure of the average distance between each data point and the mean, indicating the amount of variation in the dataset.
Range: The range is the difference between the maximum and minimum values in a dataset. It provides an indication of the data's spread.
Percentiles: Percentiles divide a dataset into hundredths, representing the relative position of a value in comparison to the entire dataset. For example, the 25th percentile (also known as the first quartile) represents the value below which 25% of the data falls.

Now, let's see some examples of how to calculate these descriptive statistics using Python:

import numpy as npy

data = [10, 12, 14, 16, 18, 20]

mean = npy.mean(data)
median = npy.median(data)
mode = npy.mode(data)
variance = npy.var(data)
std_deviation = npy.std(data)
data_range = npy.ptp(data)
percentile_25 = npy.percentile(data, 25)
percentile_75 = npy.percentile(data, 75)

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std_deviation)
print("Range:", data_range)
print("25th Percentile:", percentile_25)
print("75th Percentile:", percentile_75)

In the above example, we use the NumPy library in Python to calculate the descriptive statistics. The mean, median, mode, variance, std_deviation, data_range, percentile_25, and percentile_75 variables represent the respective descriptive statistics for the given dataset.

Descriptive statistics provide a concise summary of data, allowing data scientists to understand its central tendencies, variability, and distribution characteristics. These statistics serve as a foundation for further data analysis and decision-making in various fields, including data science, finance, social sciences, and more.

With pandas library, it's even easier.

import pandas as pd

# Create a dictionary with sample data
data = {
    'Name': ['John', 'Maria', 'Carlos', 'Anna', 'Luis'],
    'Age': [28, 24, 32, 22, 30],
    'Height (cm)': [175, 162, 180, 158, 172],
    'Weight (kg)': [75, 60, 85, 55, 70]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame:")
print(df)

# Get basic descriptive statistics
descriptive_stats = df.describe()

# Display the descriptive statistics
print("\nDescriptive Statistics:")
print(descriptive_stats)

and the expected results

DataFrame:
     Name  Age  Height (cm)  Weight (kg)
0    John   28          175           75
1   Maria   24          162           60
2  Carlos   32          180           85
3    Anna   22          158           55
4    Luis   30          172           70

Descriptive Statistics:
            Age  Height (cm)  Weight (kg)
count   5.000000      5.00000     5.000000
mean   27.200000    169.40000    69.000000
std     4.509250      9.00947    11.704700
min    22.000000    158.00000    55.000000
25%    24.000000    162.00000    60.000000
50%    28.000000    172.00000    70.000000
75%    30.000000    175.00000    75.000000
max    32.000000    180.00000    85.000000

The code creates a DataFrame with sample data about names, ages, heights, and weights and then uses describe() to obtain basic descriptive statistics such as count, mean, standard deviation, minimum, maximum, and quartiles for the numeric columns in the DataFrame.