Lib/statistics.py (part 2)

Source:

cpython 3.14 @ ab2d84fe1023/Lib/statistics.py

This annotation covers spread and correlation measures. See lib_statistics_detail for mean, median, mode, fmean, geometric_mean, harmonic_mean, multimode, quantiles.

Map

Lines	Symbol	Role
1-80	`variance` / `stdev`	Sample variance and standard deviation
81-180	`pvariance` / `pstdev`	Population variance and standard deviation
181-320	`_ss`	Helper: sum of squares about the mean
321-460	`covariance`	Measure of how two variables vary together
461-580	`correlation`	Pearson correlation coefficient
581-700	`linear_regression`	Fit a line y = slope*x + intercept

Reading

`variance`

# CPython: Lib/statistics.py:620 variance
def variance(data, xbar=None):
    """Return the sample variance of data.

    Raises StatisticsError if fewer than 2 data points.
    xbar -- pre-computed mean (pass to avoid recomputing).
    """
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 2:
        raise StatisticsError('variance requires at least two data points')
    T, total, count = _ss(data, xbar)
    return _convert(total / (count - 1), T)   # Bessel's correction: n-1

Sample variance uses n-1 (Bessel's correction) to give an unbiased estimator. Population variance uses n (the full count).

`_ss`

# CPython: Lib/statistics.py:420 _ss
def _ss(data, c=None):
    """Return (T, sum_of_squares, count) for data."""
    if c is None:
        c = fmean(data)
    T = type(c)
    # Use compensated summation to reduce floating-point error
    total = _Σ = 0
    count = 0
    for x in data:
        d = x - c
        total += d
        _Σ += d*d
        count += 1
    # Compensate for rounding in (x-c): total should be 0 if c=mean
    ss = _Σ - total*total/count
    return T, ss, count

The compensated sum reduces catastrophic cancellation when data values are close to the mean. The total term corrects for accumulated floating-point drift in d = x - c.

`correlation`

# CPython: Lib/statistics.py:720 correlation
def correlation(x, y, /, *, method='linear'):
    """Return the Pearson correlation coefficient for x and y.

    method='ranked': Spearman rank correlation.
    """
    n = len(x)
    if n != len(y):
        raise StatisticsError('correlation requires that both inputs have same number of data points')
    if n < 2:
        raise StatisticsError('correlation requires at least two data points')
    xbar = fmean(x)
    ybar = fmean(y)
    num = sum((xi - xbar) * (yi - ybar) for xi, yi in zip(x, y))
    den = sqrt(sum((xi - xbar)**2 for xi in x) * sum((yi - ybar)**2 for yi in y))
    if not den:
        raise StatisticsError('at least one of the inputs is constant')
    return num / den

Pearson r is 1.0 for perfect positive linear correlation, -1.0 for perfect negative, 0 for no linear relationship. method='ranked' first converts values to ranks for Spearman's rho.

`linear_regression`

# CPython: Lib/statistics.py:780 linear_regression
def linear_regression(x, y, /, *, proportional=False):
    """Return the slope and intercept of a linear regression y = slope*x + intercept.

    proportional=True: force intercept=0, fit y = slope*x.
    """
    n = len(x)
    xbar = fmean(x)
    ybar = fmean(y)
    slope = sum((xi - xbar)*(yi - ybar) for xi, yi in zip(x, y)) / \
            sum((xi - xbar)**2 for xi in x)
    intercept = ybar - slope * xbar
    return LinearRegression(slope=slope, intercept=intercept)

Returns a LinearRegression named tuple with slope and intercept. Ordinary least squares minimizes the sum of squared residuals.

gopy notes

statistics is pure Python. _ss uses compensated summation. variance/stdev are implemented in module/statistics/module.go. correlation and linear_regression use Go's math.Sqrt. The Fraction and Decimal type support (via _convert) uses module/fractions and module/decimal.

Map​

Reading​

variance​

_ss​

correlation​

linear_regression​

gopy notes​

Map