1677. gopy unicode

What we are porting

Two files, ~17000 lines total. The single largest object port:

Objects/unicodeobject.c: PEP 393 kind-tagged string. Three storage kinds (1-byte, 2-byte, 4-byte) chosen at construction to fit the largest code point. Method surface (split, join, find, replace, format, encode, decode, casefold, etc.). String interning. Codecs registry hooks.
Objects/unicodectype.c: Unicode 16.0 classification tables. is_alnum, is_alpha, is_decimal, is_digit, is_lower, is_upper, is_title, is_space, is_xid_start, is_xid_continue, plus to_lower, to_upper, to_title, to_decimal, to_digit. Tables generated from the Unicode Character Database.

Go shape

// Str mirrors PyUnicodeObject. Kind picks the storage width.
type Str struct {
    VarHeader
    kind     Kind  // 1, 2, 4 bytes per code point
    data     []byte  // length = size * kind
    hash     int64
    ascii    bool   // shortcut: kind==1 && all bytes < 0x80
    interned uint8  // 0 = not, 1 = mortal, 2 = immortal
}

type Kind uint8
const (
    Kind1 Kind = 1
    Kind2 Kind = 2
    Kind4 Kind = 4
)

CPython 3.12+ also has a "compact" inline-storage variant. We do not optimise for that yet; the spill cost is one extra slice header.

ASCII fast path

Str.ascii lets every method short-circuit when both operands are ASCII. CPython does the same via the state.ascii bit on the header.

String interning

Python interns short identifier-shaped strings (__init__, self, etc.) automatically. We interns the same set CPython interns: any string passed to _Py_intern from the parser plus the contents of Lib/keyword.py. Identity is cheap inside the interned set; equality stays correct outside it.

Hashing

SipHash-1-3 over the byte payload (not the code-point sequence; identical to CPython since the kind layout determines the bytes). Cached in the hash field.

Codecs

The codecs registry lives in objects/codecs.go. v0.4 ships UTF-8, UTF-16-le/be, UTF-32-le/be, ASCII, latin-1, and the error handlers (strict, replace, backslashreplace, xmlcharrefreplace, namereplace, surrogateescape, surrogatepass). The charset-table codecs (cp1252, gb18030, etc.) are stdlib in CPython too; they ride along once the import system lands (v0.8).

Unicode tables

tools/unicode_gen/main.go reads the Unicode Character Database and emits objects/unicode_tables_gen.go. Same data CPython's Tools/unicode/makeunicodedata.py emits, just typed for Go.

File mapping

C source	Go target
`Objects/unicodeobject.c`	`objects/str.go`
kind layout / construction	`objects/str_kind.go`
methods	`objects/str_methods.go`
format / `__format__`	`objects/str_format.go`
codecs registry	`objects/codecs.go`
utf-8 codec	`objects/codec_utf8.go`
utf-16/32, ascii, latin-1	`objects/codec_*.go`
error handlers	`objects/codec_errors.go`
interning	`objects/str_intern.go`
`Objects/unicodectype.c`	`objects/unicode_ctype.go`
generated tables	`objects/unicode_tables_gen.go`
generator	`tools/unicode_gen/`

Checklist

Status legend: [x] shipped, [ ] pending, [~] partial / scaffold, [n] deferred / not in scope this phase.

Files

Surface guarantees

hash(str) matches CPython under PYTHONHASHSEED=0 across every kind (1/2/4 byte).
repr(str) chooses the same quote style and escape style CPython does, including \x, \u, \U, \N{...}.
'a' in 'banana' performs the same Boyer-Moore-Horspool fallback CPython uses for medium needles.
str.casefold() matches Unicode 16.0 (German ß -> 'ss', etc.).
'é' == 'é' is False (no auto-NFC).
'a'.encode('utf-8') == b'a', surrogate pairs round-trip with surrogatepass.
str.isidentifier() follows the XID_Start / XID_Continue rules from PEP 3131.
Interned identity: every keyword in keyword.kwlist is interned and returns the same object.

Cross-references

Hash key: 1661.
Bytes <-> str round-trip: 1676.
Format mini-language: 1660.
\N{...} lookup table also used by 1644.

Out of scope

Locale-aware case mapping (full Turkish, etc.). Stdlib.
normalize() (NFD / NFC / NFKD / NFKC). Lives in unicodedata stdlib module.
The _string extension module. Stdlib bridge.

What we are porting​

Go shape​

ASCII fast path​

String interning​

Hashing​

Codecs​

Unicode tables​

File mapping​

Checklist​

Files​

Surface guarantees​

Cross-references​

Out of scope​