Skip to main content

Lib/statistics.py

Source:

cpython 3.14 @ ab2d84fe1023/Lib/statistics.py

Map

LinesSymbolPurpose
1–70module header, _sum()imports, exact rational accumulation helper
71–180mean, fmean, geometric_meancentral tendency functions
181–280median, median_low, median_high, median_groupedorder statistics
281–350mode, multimodefrequency-based statistics
351–470variance, pvariance, stdev, pstdevdispersion, two-pass algorithm
471–560covariance, correlation, linear_regressionbivariate statistics
561–780NormalDist classnormal distribution object with full protocol
781–900NormalDist.pdf, .cdf, .inv_cdfdensity and quantile functions
901–980NormalDist.overlap, .quantiles, .samplesdistribution utilities
981–1100internal helpers, __all___erf_inf, _convert, _coerce, exports

Reading

Exact rational accumulation in _sum and mean

mean delegates to the internal _sum helper, which converts integer inputs to fractions.Fraction so that the total is exact before dividing. _sum returns a (type, Fraction_total, count) triple. _convert then coerces the result back to the dominant numeric type so callers see int, float, or Decimal as appropriate.

# CPython: Lib/statistics.py:151 mean
def mean(data):
T, total, count = _sum(data)
if count < 1:
raise StatisticsError('mean requires at least one data point')
return _convert(total / count, T)

fmean skips the exact path and uses math.fsum for a fast but still numerically careful floating-point mean. geometric_mean converts to logarithms, calls math.fsum, then exponentiates, handling the sign check for negative inputs before delegating.

Median variants and mode

median sorts a copy of the data and picks the middle element (or averages the two middle elements for even-length sequences). median_low and median_high always return an actual data value rather than an interpolated one. median_grouped treats data as a histogram and interpolates within the bin that contains the median, using the classic L + (n/2 - cf) / f * interval formula.

# CPython: Lib/statistics.py:247 median_grouped
def median_grouped(data, interval=1):
data = sorted(data)
n = len(data)
if n == 0:
raise StatisticsError("no median for empty data")
x = data[n // 2]
L = x - interval / 2
cf = data.index(x)
f = data.count(x)
return L + interval * (n / 2 - cf) / f

mode returns the single most common value (raising StatisticsError if the data is empty). multimode returns a list of all values that share the maximum frequency.

Two-pass variance and stdev

variance computes the mean in a first pass, then accumulates squared deviations in a second pass. This avoids catastrophic cancellation in the naive one-pass formula. When the caller already has the mean it can pass it as xbar to skip the internal mean calculation.

# CPython: Lib/statistics.py:367 variance
def variance(data, xbar=None):
if iter(data) is data:
data = list(data)
n = len(data)
if n < 2:
raise StatisticsError('variance requires at least two data points')
if xbar is None:
T, xbar, _ = _sum(data)
xbar = _convert(xbar / n, T)
T, total, count = _sum((x - xbar) ** 2 for x in data)
return _convert(total / (count - 1), T)

pvariance divides by n instead of n - 1. stdev and pstdev are thin wrappers that apply math.sqrt or the type-appropriate square-root to the variance result.

NormalDist: the distribution object

NormalDist stores _mu (mean) and _sigma (standard deviation) and exposes the full suite of normal distribution operations. Arithmetic operators are defined so that NormalDist(mu1, sigma1) + NormalDist(mu2, sigma2) produces a new NormalDist with the summed parameters, reflecting the sum of independent normals.

# CPython: Lib/statistics.py:743 NormalDist.cdf
def cdf(self, x):
"Cumulative distribution function. P(X <= x)"
return 0.5 * (1.0 + erf((x - self._mu) / (self._sigma * _SQRT2)))

inv_cdf (the quantile function) uses a rational approximation for the central region and a separate approximation for the tails, matching Abramowitz and Stegun tables to about seven significant digits. pdf is the standard Gaussian density formula expressed in terms of math.exp.

gopy notes

Status: not yet ported.

Planned package path: module/statistics/.

Key porting notes:

  • _sum relies on fractions.Fraction for exact integer paths. module/fractions/ must be in place before mean over integer lists returns the correct type.
  • NormalDist.cdf and pdf depend on math.erf and math.exp. Both are available in gopy's math module port.
  • NormalDist.inv_cdf uses a multi-branch rational approximation coded inline. Port it as a direct translation with the same coefficient literals, not a library call, to preserve the exact numerical behaviour.
  • median_grouped uses list.index and list.count, which are O(n) scans. The semantics are correct as long as floating-point equality works; document this as a known limitation matching CPython.
  • covariance and linear_regression are straightforward numeric loops with no exotic dependencies and can be ported independently of the exact-arithmetic path.