Skip to main content

v0.4.0 - Numbers, strings, hashing, and format

Released May 4, 2026.

The fun parts of a Python runtime are the VM and the compiler. The parts that decide whether your runtime is actually usable are the small ones nobody thinks about: does float('1.0e-308') parse to the same uint64 bits CPython produces? Does format(1234567, ',d') emit 1,234,567 with the comma in the right place? Does hash(b"hello") produce the same int CPython does under the same PYTHONHASHSEED?

None of these questions are interesting until your runtime gets one of them wrong, at which point they are the only thing your users want to talk about. A pandas DataFrame round-trip through your runtime that produces slightly different floats from CPython's is a bug report you'll spend a week tracking down. A hash that doesn't match is half your dict-based code path quietly broken.

v0.4.0 ships the answers to these questions. Number parsing goes through pystrconv and is bit-for-bit identical to CPython. Hashing goes through SipHash-1-3 keyed off the runtime secret, and the test panel pins the output against captured CPython values under PYTHONHASHSEED=0. The format-spec mini-language has its own parser and its own renderers for int, float, and string, all of them ported one-to-one from formatter_unicode.c.

The release is small in line count and broad in coverage. After v0.4, every leaf operation the v0.5 compile pipeline and the v0.6 VM need from the bottom of the value stack is in place.

Highlights

Three themes pull through this release.

Bit-perfect float parity

Python's floats are IEEE-754 doubles. So are Go's float64. But "both are IEEE-754" does not mean "parsing the same string gives the same bits". CPython's float parser does its own preprocessing before handing off to David Gay's dtoa.c. PEP 515 underscores get stripped. nan(payload) literals get rejected. inf and infinity are accepted case-insensitively but Inf is not. The list of edge cases is longer than it has any right to be.

v, err := pystrconv.ParseFloat("1_000.5e-3")
// 1.0005

v, err = pystrconv.ParseFloat("nan(deadbeef)")
// err: cannot convert string to float

bits := math.Float64bits(v)
// matches CPython's struct.unpack('<Q', struct.pack('<d', float('1_000.5e-3')))[0]

The implementation wraps Go's strconv.ParseFloat, but only after the CPython preprocessor runs. We did not write a faithful dtoa.c port; instead we verified bit-for-bit parity through a checked-in panel of inputs (subnormals, edge round-half-to-even cases, the gradual-underflow boundary). A faithful dtoa.c port is tracked as a follow-up, but the practical behavior is already nailed.

Formatting walks the other direction: FormatFloat reproduces CPython's repr for every code ('r', 's', 'g', 'G', 'e', 'E', 'f', 'F', '%') and every flag (alternate form, always-sign, space-sign, no-negative-zero, add-dot-zero). The output mirrors what format_float_short produces in Python/pystrtod.c.

SipHash-1-3, keyed off the runtime secret

CPython 3.14 hashes bytes through SipHash-1-3 under a process-wide secret. The secret is initialized from PYTHONHASHSEED (or randomly if unset), and the same secret is used for hash() on str, bytes, and the various other hashable types that ultimately delegate to byte hashing.

import "gopy/hash"

h := hash.Buffer([]byte("hello"))
// Under PYTHONHASHSEED=0, this matches CPython's hash(b"hello").

h2 := hash.KeyedHash(key, []byte("payload"))
// _Py_KeyedHash for short keyed payloads.

The full panel:

  • Buffer is SipHash-1-3. This is the production path.
  • BufferFNV is the FNV variant the C source includes for embedded builds that want a smaller code footprint. We ship both; Buffer is the default.
  • KeyedHash ports _Py_KeyedHash for short keyed buffers the dict perturbation logic uses.
  • Pointer ports Py_HashPointer, used by hash of method bound objects.
  • Double ports _Py_HashDouble, used by float.__hash__ and by hash of int when the value fits in a double.
  • GetFuncDef ports PyHash_GetFuncDef, the introspection entry point that returns the hash algorithm name and key size.

Reference vectors against CPython under PYTHONHASHSEED=0 are pinned in hash/hash_test.go. The vectors cover empty bytes, the short-string boundary (below and above _Py_HASH_CUTOFF, even though 3.14 sets that cutoff to 0), and a long string that exercises the rolling-state inner loop. Any deviation from CPython's output fails the test.

Format-spec mini-language

format(value, spec) is its own programming language. The grammar is:

[[fill]align][sign][z][#][0][width][,_][.precision][type]

Every bracketed piece is optional, the order is fixed, and the interpretation of each piece depends on the type of value. # means alternate form for ints, alternate form for floats, nothing for strings. , means thousands separator for ints and floats, illegal for strings. 0 means zero-pad numbers, ignored for strings. The list goes on.

import "gopy/format"

spec, _ := format.ParseSpec(",d")
out, _ := format.FormatInt(big.NewInt(1234567), spec)
// "1,234,567"

spec, _ = format.ParseSpec(".3f")
out, _ = format.FormatFloat(3.14159, spec)
// "3.142"

spec, _ = format.ParseSpec(">10")
out, _ = format.FormatString("hi", spec)
// " hi"

The implementation is a one-to-one port of formatter_unicode.c. ParseSpec is the spec parser (the format spec itself is a tiny context-free grammar; we hand-rolled a recursive descent matcher that walks the runes). FormatString, FormatInt, and FormatFloat are the three renderers, each consuming a parsed Spec and producing the output bytes.

For the digit-generation arm of FormatFloat, we delegate to pystrconv.FormatFloat, which is where the bit-perfect float work lives. This means a format like format(0.1, '.20f') produces the same output as CPython without any per-precision special casing.

What's new

The full package breakdown.

pystrconv/

Locale-independent string and number conversion. Ports a stack of small CPython files that together cover the bottom of the value stack:

  • Python/pyctype.c for character classification. Flags, IsLower, IsUpper, IsAlpha, IsDigit, IsXDigit, IsAlnum, IsSpace, ToLower, ToUpper. Computed, not table-driven. The 0..255 range is round-trip verified against CPython's _Py_ctype_table.
  • Python/pystrcmp.c for ASCII case-insensitive compare. CompareInsensitive, CompareInsensitiveN.
  • Python/mystrtoul.c for integer parsing. ParseUint, ParseInt with base 0 autodetect (0x, 0o, 0b), bases 2 through 36, leading whitespace, sign handling, and overflow detection. The overflow rule is the same one CPython uses: if any partial product exceeds the type max before the final digit, raise.
  • Python/pystrhex.c for hex rendering. Hex, HexBytes, HexWithSep, HexBytesWithSep. These back bytes.hex() and friends. The separator variant takes a byte and an offset (every Nth byte gets a separator), which is what bytes.hex(':', 2) calls into.
  • Python/pystrtod.c plus the wrapper for Python/dtoa.c for float parsing and formatting. ParseFloat over Go's strconv with CPython preprocessing (PEP 515 underscores, case-insensitive inf and infinity, rejection of nan(payload)). FormatFloat for every code and every flag.

The CPython files this package ports from share a theme: they're the bottom of the bottom of the stack, called from everywhere, and they're locale-independent on purpose. CPython took a hit years ago when locale-dependent strtod parsed "3,14" as 3.14 in German locales. The fix was to hand-roll a parser that ignores the locale, and that's what we ship here too.

pymath/

The thin float math file that bridges between CPython's math helpers and Go's math package.

  • NaN, Inf, NegInf as float64 constants.
  • CopySign, IsNaN, IsInf, IsFinite as predicates that route to math.
  • Log1p, Hypot as the two transcendental helpers CPython exposes through pymath.c.
  • FPECounter, FPEDummy as stable-ABI sentinels from pyfpe.c. These are legacy hooks for embedders that want to install their own floating-point exception handlers. Nobody we know uses them, but the stable ABI promises they exist, so we publish them.

Ports Python/pymath.c and Python/pyfpe.c directly.

hash/

The full hash machinery. Ports Python/pyhash.c.

The internals are simple in structure but unforgiving in detail. SipHash-1-3 is a stateful round function; our implementation matches CPython's reference cycle by cycle. The runtime secret initialization reads PYTHONHASHSEED exactly the way CPython does: unset means a random secret, "0" means the test secret, any other value means seed the RNG with that integer.

The constants:

  • HashBits. The number of bits in a hash, 64 on amd64 and arm64.
  • HashModulus. The Mersenne prime 2^61 - 1 numeric hash reduces to.
  • HashInf, HashImag. Special-case hash values for positive infinity and the imaginary part of a complex.

All of these are visible because Python-level sys.hash_info exposes them.

format/

The format-spec mini-language. Ports Python/formatter_unicode.c.

  • Spec. The parsed spec struct. Carries Fill, Align, Sign, AltForm, ZeroPad, Width, GroupChar, Precision, Type.
  • ParseSpec(s) (Spec, error). Parses the mini-language. The parser walks the runes and tries to match each optional piece in turn; on a mismatch it backtracks and tries the next piece.
  • FormatString(s, spec) string. String formatter. Handles alignment, padding, fill, and precision (precision means "truncate to N characters" for strings).
  • FormatInt(n, spec) string. Int formatter. Handles every type code (b, o, d, x, X, c), grouping with , or _, sign handling, and alternate form (the 0b, 0o, 0x prefix for non-decimal bases).
  • FormatFloat(f, spec) string. Float formatter. Delegates digit generation to pystrconv.FormatFloat and adds grouping / alignment / sign / fill on top.

The grouping rule is subtle and worth a callout: ,d means group every three digits with commas, _d means group with underscores, and for hex/oct/binary the grouping is every four digits, not three. We get this right because we ported the logic from formatter_unicode.c rather than rolling our own.

Why we built it this way

A few notes on the shape of this release.

Why a wrapper over strconv instead of porting dtoa.c

Python/dtoa.c is David Gay's reference dtoa from 1996. It's 2500 lines of dense numeric code that solves the shortest round-trip decimal representation problem optimally. Go's strconv.ParseFloat and strconv.FormatFloat solve the same problem with a different algorithm (Ulf Adams's Ryu, mostly), but they produce the same output for every IEEE-754 double.

Bit-for-bit parity is the contract we want. We get that parity through strconv plus the CPython preprocessor (the part that strips underscores and rejects nan(payload)). A faithful dtoa.c port would not improve the behavior. It would just expand the code base by 2500 lines for no observable difference. We pinned that decision through the gate panel.

Why SipHash-1-3 and not SipHash-2-4

Python 3.13 switched from SipHash-2-4 to SipHash-1-3 as the default hash. The reason is performance: 1-3 is enough rounds to resist the algorithmic complexity attacks SipHash was designed for, and the savings on short strings are real. We're a 3.14 port, so we ship 1-3 and not 2-4. If you need 2-4 for an embedded build that pins to 3.12 behavior, the BufferFNV and Buffer panel is shaped to accept additional algorithms; we just haven't ported them because nobody is asking.

Why grouping codes live in the format package, not pystrconv

pystrconv.FormatFloat produces digits. It doesn't insert commas, it doesn't pad, it doesn't align. The grouping logic lives one layer up in the format package, because grouping is a property of the format spec, not of the digit-generation algorithm. CPython makes the same split (formatter_unicode.c adds the commas; pystrtod.c produces the digits), and following the C structure made the ports easier to read against the originals.

Why a stable-ABI sentinel for FPECounter

FPECounter is dead code. Nobody calls it. We ship it because the stable ABI says it exists, and a stable-ABI consumer that links against gopy expecting to find the symbol should find it. Tearing it out would save fifty lines and break (theoretical) consumers. We kept it.

Where it lives

  • pystrconv/ for the character classification, number parsing, hex rendering, and float parsing / formatting.
  • pymath/ for the float math helpers.
  • hash/ for SipHash-1-3 and the keyed-hash family.
  • format/ for the format-spec mini-language.

The CPython sources we ported from:

  • Python/pyctype.c for character classification.
  • Python/pystrcmp.c for ASCII case-insensitive compare.
  • Python/mystrtoul.c for integer parsing.
  • Python/pystrhex.c for hex rendering.
  • Python/pystrtod.c plus a wrapper for Python/dtoa.c for float parsing and formatting.
  • Python/pymath.c for the math helpers.
  • Python/pyfpe.c for the stable-ABI floating-point hooks.
  • Python/pyhash.c for SipHash-1-3 and the hash machinery.
  • Python/formatter_unicode.c for the format-spec mini-language.

Compatibility

  • Go: 1.26 or newer.
  • CPython behavioral target: 3.14.0+.

The gate test panel pins these cross-cuts:

  • hash.Buffer([]byte("hello")) matches CPython's hash(b"hello") under PYTHONHASHSEED=0. The exact bit pattern is captured in the test.
  • pystrconv.ParseFloat round-trips a panel of inputs to the same uint64 bit pattern as CPython. The panel covers subnormals, the gradual-underflow boundary, every power of two near the range edges, and a handful of "interesting" doubles (0.1, 1.0/3.0, math.pi).
  • pystrconv.FormatFloat reproduces CPython's repr for the thresholds where shortest round-trip switches from f form to exponent form.
  • format.FormatInt(1234567, ',d') matches format(1234567, ',d') from CPython.

Anywhere one of these fails, the gate test fails. No silent divergence.

Out of scope

A few things this release intentionally does not ship.

  • Full 1:1 port of Python/dtoa.c. The wrapper over Go's strconv is bit-correct (the gate pins it). A faithful source-shape port is tracked as a follow-up; we'll do it when somebody needs it.
  • SipHash-2-4. CPython's pre-3.13 default. 3.14 ships SipHash-1-3 and that's what we ship.
  • DJBX33A short-string fast path. Py_HASH_CUTOFF is 0 in default 3.14 builds, which means the short-string fast path is dead code in 3.14. We didn't port the dead path.
  • Complex formatter. The complex type lands in a later release; the format-spec handling for it lands then.

What's next

v0.5 builds the compile pipeline on top of what landed here. The AST validator, the symtable resolver, the codegen visitor panel, the flowgraph optimizer, and the assembler all show up. Numbers become real LOAD_CONST operands. Strings become real LOAD_CONST operands. The format-spec mini-language gets exercised by FormattedValue AST nodes.

v0.5.5 adds the lexer and the parser scaffolding. v0.6 turns on the VM. From v0.6 onward, every number you parse, every string you format, every hash you compute walks through what we built today.