Skip to main content

1660. gopy strings and numbers

What we are porting

Eight files from cpython/Python/:

C sourceLinesGo target
pyctype.c215pystrconv/ctype.go
pystrcmp.c31pystrconv/cmp.go
mystrtoul.c297pystrconv/strtoul.go
pystrtod.c1286pystrconv/strtod.go
dtoa.c2843pystrconv/dtoa.go (see below)
pystrhex.c171pystrconv/hex.go
pymath.c19pymath/pymath.go
pyfpe.c~5pymath/fpe.go
formatter_unicode.c1726format/format.go

Together they form the bedrock that int.__repr__, float.__repr__, int.__str__, float.__str__, format(), f"{x:.2f}", int(s, base), float(s), and bytes.hex() all dispatch through. Nothing here depends on the runtime, the parser, or the VM. Everything here is callable from Go directly and tested with table-driven inputs.

pyctype: locale-independent ctype

cpython/Python/pyctype.c ships a 256-entry bitmask table plus toupper/tolower lookup tables. The behaviour is fixed by cpython/Include/cpython/pyctype.h:

PY_CTF_LOWER 0x01
PY_CTF_UPPER 0x02
PY_CTF_ALPHA PY_CTF_LOWER | PY_CTF_UPPER
PY_CTF_DIGIT 0x04
PY_CTF_ALNUM PY_CTF_ALPHA | PY_CTF_DIGIT
PY_CTF_SPACE 0x08
PY_CTF_XDIGIT 0x10

Py_ISSPACE matches ' ', \t, \n, \v, \f, \r. Py_ISDIGIT is '0'..'9'. Py_ISXDIGIT is digits plus 'a'..'f' and 'A'..'F'. Py_TOLOWER/Py_TOUPPER are ASCII-only and locale-independent.

Go translation skips the literal 256-entry table because computing the flags from the byte is faster than a memory load and trivially auditable. The exhaustive 0..255 round-trip test guarantees the byte-for-byte equivalence with the original tables.

func Flags(c byte) uint32
func IsLower(c byte) bool
func IsUpper(c byte) bool
func IsAlpha(c byte) bool
func IsDigit(c byte) bool
func IsXDigit(c byte) bool
func IsAlnum(c byte) bool
func IsSpace(c byte) bool
func ToLower(c byte) byte
func ToUpper(c byte) byte

pystrcmp: case-insensitive ASCII compare

cpython/Python/pystrcmp.c has two functions, PyOS_mystricmp and PyOS_mystrnicmp, both ASCII case-folding. They are used by the config parser, the float repr, and a handful of other locale-sensitive spots that explicitly want the locale-free behaviour.

func CompareInsensitive(a, b string) int
func CompareInsensitiveN(a, b string, size int) int

mystrtoul: integer parsing

cpython/Python/mystrtoul.c is the fast path for int(s, base). It handles base 0 (auto-detect from 0x, 0o, 0b prefixes), bases 2..36, leading whitespace, optional +/- sign, underscores between digits (PEP 515), and overflow detection. Returns the parsed value plus the number of bytes consumed.

type IntParseError int
const (
IntOK IntParseError = iota
IntInvalid
IntOverflow
)

// ParseUint mirrors PyOS_strtoul.
func ParseUint(s string, base int) (val uint64, n int, err IntParseError)

// ParseInt mirrors PyOS_strtol.
func ParseInt(s string, base int) (val int64, n int, err IntParseError)

pystrtod and dtoa: float parsing and formatting

This is the heaviest file in the v0.4 batch. dtoa.c is David Gay's 2843-line bignum-arithmetic implementation of strtod and dtoa. pystrtod.c wraps it with the Python-specific surface (PyOS_string_to_double, PyOS_double_to_string, the 'r'/'g'/'e' format codes, the locale-independent decimal point, the inf/nan/hex literal recognition).

dtoa decision

CPython links David Gay's dtoa.c directly. Go's standard library contains its own pure-Go float parser/formatter (strconv) using Ryu/Grisu plus a bignum fallback for the round-trip-tight cases. Both are bit-correct under IEEE 754 round-to-nearest-even. They produce the same uint64 bit pattern for any input string CPython accepts because correct round-to-nearest is unique.

v0.4 implements pystrconv/strtod.go as a thin wrapper over Go's strconv.ParseFloat with CPython-specific input pre-processing:

  • Reject nan(123) parenthesised payload form (Go accepts it).
  • Reject leading whitespace inside the number (Go strconv rejects it too; the CPython entry already strips outer whitespace).
  • Honour the '_' digit separator the same way as int parsing.
  • Treat inf / infinity case-insensitively.

pystrconv/dtoa.go is a wrapper over strconv.AppendFloat('r', -1) for shortest round-trip, with explicit format codes:

  • 'r' (repr), 's' (str), 'g' (general), 'e' (exponent), 'f' (fixed), '%' (percent).
  • 'r' calls AppendFloat(buf, f, 'g', -1, 64).
  • The output sign for negative zero matches CPython's 0/-0/0.0/-0.0 rules.

A faithful 1:1 dtoa.c port is out of scope for v0.4. It is tracked as a follow-up issue: the wrapper is bit-correct (the gate test pins that), but the source-shape parity is not as direct as the rest of the port. The wrapper carries a doc comment naming each dtoa.c function it stands in for so the eventual replacement is mechanical.

Surface

// ParseFloat mirrors PyOS_string_to_double.
func ParseFloat(s string) (float64, error)

// FormatFloat mirrors PyOS_double_to_string. code is one of
// 'r', 's', 'g', 'e', 'f', '%'. precision is -1 for the
// format-default.
func FormatFloat(f float64, code byte, precision int, flags FormatFlag) string

type FormatFlag uint32
const (
FlagAddDotZero FormatFlag = 1 << iota
FlagAlternate
FlagAlwaysSign
FlagSpaceSign
)

pystrhex: bytes-to-hex

Direct port. Three entry points (Hex, HexBytes, HexWithSep) feed bytes.hex(), binascii.hexlify, memoryview.hex(). Right-grouped default; negative bytesPerGroup switches to left-grouped.

pymath and pyfpe

pymath.c is a 19-line file: it defines _Py_dg_stdnan, _Py_dg_infinity, and _Py_log1p if the C math library lacks one. The Go target uses math.NaN(), math.Inf(1), and math.Log1p directly, so the port is a thin file with the sentinels exposed at the package level for downstream code that wants the same names as CPython.

pyfpe.c is essentially empty in modern CPython (the FP exception machinery was removed in 3.10). The Go target preserves the file so the source-shape mapping stays one-to-one.

formatter_unicode: format mini-language

cpython/Python/formatter_unicode.c implements the format-spec mini-language for format(value, spec) and f-strings:

[[fill]align][sign][#][0][width][,][.precision][type]

This is a sizable port: 1726 lines of C with several mode-specific sub-formatters (int, long, float, complex, bytes, str). v0.4 ships:

  • The format-spec parser (parse_internal_render_format_spec).
  • String formatter (format_string_internal).
  • Int formatter (format_long_internal) including grouping (, / _) and the b/o/d/x/X/c type codes.
  • Float formatter (format_float_internal) including e/E/f/F/g/G/% type codes, leveraging pystrconv.FormatFloat underneath.

format_complex_internal rides along when complex lands (no v0.4 caller forces it; defer until needed).

type Spec struct {
Fill rune
Align byte // '<', '>', '=', '^', or 0
Sign byte // '+', '-', ' ', or 0
Alt bool
Zero bool
Width int
Thousands byte // ',', '_', or 0
Precision int // -1 == unset
Type byte
}

// ParseSpec parses a format spec mini-language string.
func ParseSpec(s string) (Spec, error)

// FormatString renders a string under spec.
func FormatString(s string, spec Spec) (string, error)

// FormatInt renders an integer (any signed/unsigned width via *big.Int)
// under spec.
func FormatInt(v *big.Int, spec Spec) (string, error)

// FormatFloat renders a float under spec.
func FormatFloat(v float64, spec Spec) (string, error)

Test strategy

Each function gets a table-driven test rooted in CPython output. For the round-trip-critical pieces (ParseFloat, FormatFloat, Hash over a known seed) the gate runs the bit-pattern comparison against pre-computed CPython references to catch any divergence.