Statistical Functions

35 min

NumPy provides comprehensive statistical functions for data analysis, enabling you to compute descriptive statistics on arrays efficiently. Statistical functions are vectorized and optimized, making them much faster than computing statistics with Python loops. These functions are essential for data analysis, scientific computing, and machine learning. Understanding statistical functions enables effective data analysis. Statistical functions are fundamental to data science.

Common functions include np.mean() (average), np.median() (middle value), np.std() (standard deviation), np.var() (variance), np.min()/np.max() (minimum/maximum), and np.percentile() (percentiles). Each function provides insights into data distribution and characteristics. Understanding these functions enables comprehensive data analysis. These functions are the foundation of statistical analysis.

Statistical functions work on arrays and can be applied along specific axes using the axis parameter, enabling computation of statistics for each row, column, or dimension. axis=0 computes along rows (down columns), axis=1 computes along columns (across rows). Understanding axis parameter enables multi-dimensional statistics. Axis parameter is essential for multi-dimensional arrays.

NumPy also provides cumulative statistics (np.cumsum(), np.cumprod()) that compute running totals, and correlation/covariance functions (np.corrcoef(), np.cov()) for relationship analysis. Cumulative functions are useful for time series analysis. Correlation functions measure relationships between variables. Understanding these functions enables advanced analysis. These functions extend statistical capabilities.

Statistical functions handle NaN values differently—some ignore them (nanmean, nanstd), while others propagate them. Use nan-prefixed functions (np.nanmean()) when your data contains NaN values. Understanding NaN handling enables robust statistical analysis. NaN handling is important for real-world data.

Best practices include using appropriate statistical functions for your analysis, understanding axis parameter for multi-dimensional arrays, handling NaN values appropriately, using percentiles for robust statistics, and understanding the difference between population and sample statistics (ddof parameter). Understanding statistical functions enables effective data analysis. Statistical functions are essential for data science.

Key Concepts

NumPy provides comprehensive statistical functions for data analysis.
Common functions: mean, median, std, var, min, max, percentiles.
Statistical functions can be applied along specific axes.
Cumulative and correlation functions extend statistical capabilities.
NaN-handling functions are available for missing data.

Learning Objectives

Master

Computing basic statistics (mean, median, std, var)
Using axis parameter for multi-dimensional statistics
Computing percentiles and cumulative statistics
Handling NaN values in statistical computations

Develop

Understanding statistical analysis concepts
Designing effective data analysis workflows
Appreciating NumPy's statistical capabilities

Tips

Use axis parameter to compute statistics along specific dimensions.
Use nan-prefixed functions (np.nanmean()) when data contains NaN values.
Use percentiles for robust statistics (less sensitive to outliers).
Understand ddof parameter for population vs sample statistics.

Common Pitfalls

Not specifying axis parameter, getting unexpected results for 2D arrays.
Not handling NaN values, getting NaN in results.
Confusing axis=0 (rows) with axis=1 (columns).
Not understanding population vs sample statistics (ddof parameter).

Summary

NumPy provides comprehensive statistical functions.
Statistical functions work on arrays and can use axis parameter.
Cumulative and correlation functions extend capabilities.
NaN-handling functions are available for missing data.
Understanding statistical functions enables effective data analysis.

Exercise

Use NumPy statistical functions for data analysis.

import numpy as np

# Create sample data
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
data_2d = np.array([[1, 2, 3, 4],
                    [5, 6, 7, 8],
                    [9, 10, 11, 12]])

print("Data:", data)
print("2D Data:")
print(data_2d)

# Basic statistics
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard deviation:", np.std(data))
print("Variance:", np.var(data))
print("Minimum:", np.min(data))
print("Maximum:", np.max(data))

# Percentiles
print("25th percentile:", np.percentile(data, 25))
print("75th percentile:", np.percentile(data, 75))

# Statistics along axes (for 2D arrays)
print("Mean along rows:", np.mean(data_2d, axis=0))
print("Mean along columns:", np.mean(data_2d, axis=1))
print("Standard deviation along rows:", np.std(data_2d, axis=0))

# Cumulative statistics
print("Cumulative sum:", np.cumsum(data))
print("Cumulative product:", np.cumprod(data))

# Correlation and covariance
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

correlation = np.corrcoef(x, y)
print("Correlation matrix:")
print(correlation)

covariance = np.cov(x, y)
print("Covariance matrix:")
print(covariance)

Statistical Functions