Skip to main content

statistics.py: Descriptive Statistics

statistics.py provides exact and approximate descriptive statistics using a mix of Fraction accumulation for integer inputs, Decimal for fixed-point inputs, and float for everything else. The module has no C accelerator.

Map

LinesSymbolPurpose
1–60imports, __all__math, numbers, fractions, decimal, itertools
61–130_sum, _coerce, _convertType-unification helpers for mixed numeric inputs
131–220mean, fmeanArithmetic mean via Fraction sum and fast float mean
221–290geometric_mean, harmonic_meanLog-space and reciprocal-sum variants
291–420median, median_low, median_high, median_groupedSort-based and interpolated medians
421–500mode, multimodeCounter-based most-frequent-value search
501–620_ss, variance, pvariance, stdev, pstdevTwo-pass sum-of-squares and standard deviation
621–720covariance, correlationPairwise two-pass algorithms, Pearson r
721–820linear_regressionSlope and intercept via covariance and variance
821–1000NormalDistGaussian distribution object, pdf, cdf, inv_cdf
1001–1100quantilesInclusive and exclusive interpolation methods
1101–1200StatisticsError, helper guardsInput validation and empty-sequence errors

Reading

mean and fmean accumulation strategies

mean converts each element to Fraction before summing, then converts the exact rational total back to the input numeric type via _convert. This avoids catastrophic cancellation for integer sequences while producing the same type the caller passed in. fmean uses math.fsum instead, which is faster for large float sequences but sacrifices the exact-integer guarantee.

Two-pass variance algorithm

_ss computes the sum and then loops a second time to accumulate (x - mean)**2 using the corrected two-pass formula rather than the one-pass E[x^2] - E[x]^2 formula, which is numerically unstable for nearly-equal values. variance and stdev call _ss and divide by n - 1 (Bessel's correction). pvariance and pstdev divide by n for the population case.

NormalDist and quantiles

NormalDist stores mu and sigma and provides pdf, cdf (via math.erfc), and inv_cdf (a rational approximation to the probit function). quantiles supports two methods: inclusive uses (n-1) intervals anchored at the data endpoints, and exclusive uses (n+1) intervals with notional points beyond the data range. The chosen method changes the interpolation formula for each cut point but shares the same linear-interpolation step.

gopy notes

  • _sum returns a (type, Fraction) pair encoding the dominant numeric type. The port must track this pair through accumulation and call _convert at the end, matching CPython's type-promotion rules exactly.
  • covariance and correlation were added in 3.10. linear_regression gained a proportional keyword argument in 3.11 and confidence-interval support in 3.14. The port should gate each addition behind a feature constant rather than backporting all changes into one flat implementation.
  • NormalDist.inv_cdf uses a minimax rational approximation with hardcoded coefficients. Port the coefficients verbatim (see statistics.py around line 950) and add a comment citing the Abramowitz and Stegun reference that CPython cites.
  • StatisticsError is a subclass of ValueError. The gopy port must register it as a proper exception subtype so except ValueError catches it.