1677. gopy unicode
What we are porting
Two files, ~17000 lines total. The single largest object port:
Objects/unicodeobject.c: PEP 393 kind-tagged string. Three storage kinds (1-byte, 2-byte, 4-byte) chosen at construction to fit the largest code point. Method surface (split,join,find,replace,format,encode,decode,casefold, etc.). String interning. Codecs registry hooks.Objects/unicodectype.c: Unicode 16.0 classification tables.is_alnum,is_alpha,is_decimal,is_digit,is_lower,is_upper,is_title,is_space,is_xid_start,is_xid_continue, plusto_lower,to_upper,to_title,to_decimal,to_digit. Tables generated from the Unicode Character Database.
Go shape
// Str mirrors PyUnicodeObject. Kind picks the storage width.
type Str struct {
VarHeader
kind Kind // 1, 2, 4 bytes per code point
data []byte // length = size * kind
hash int64
ascii bool // shortcut: kind==1 && all bytes < 0x80
interned uint8 // 0 = not, 1 = mortal, 2 = immortal
}
type Kind uint8
const (
Kind1 Kind = 1
Kind2 Kind = 2
Kind4 Kind = 4
)
CPython 3.12+ also has a "compact" inline-storage variant. We do not optimise for that yet; the spill cost is one extra slice header.
ASCII fast path
Str.ascii lets every method short-circuit when both operands are
ASCII. CPython does the same via the state.ascii bit on the
header.
String interning
Python interns short identifier-shaped strings (__init__,
self, etc.) automatically. We interns the same set CPython
interns: any string passed to _Py_intern from the parser plus
the contents of Lib/keyword.py. Identity is cheap inside the
interned set; equality stays correct outside it.
Hashing
SipHash-1-3 over the byte payload (not the code-point sequence;
identical to CPython since the kind layout determines the bytes).
Cached in the hash field.
Codecs
The codecs registry lives in objects/codecs.go. v0.4 ships UTF-8,
UTF-16-le/be, UTF-32-le/be, ASCII, latin-1, and the error handlers
(strict, replace, backslashreplace, xmlcharrefreplace,
namereplace, surrogateescape, surrogatepass). The
charset-table codecs (cp1252, gb18030, etc.) are stdlib in CPython
too; they ride along once the import system lands (v0.8).
Unicode tables
tools/unicode_gen/main.go reads the Unicode Character Database
and emits objects/unicode_tables_gen.go. Same data CPython's
Tools/unicode/makeunicodedata.py emits, just typed for Go.
File mapping
| C source | Go target |
|---|---|
Objects/unicodeobject.c | objects/str.go |
| kind layout / construction | objects/str_kind.go |
| methods | objects/str_methods.go |
format / __format__ | objects/str_format.go |
| codecs registry | objects/codecs.go |
| utf-8 codec | objects/codec_utf8.go |
| utf-16/32, ascii, latin-1 | objects/codec_*.go |
| error handlers | objects/codec_errors.go |
| interning | objects/str_intern.go |
Objects/unicodectype.c | objects/unicode_ctype.go |
| generated tables | objects/unicode_tables_gen.go |
| generator | tools/unicode_gen/ |
Checklist
Status legend: [x] shipped, [ ] pending, [~] partial / scaffold,
[n] deferred / not in scope this phase.
Files
-
objects/str.go:Strstruct,FromString,FromBytes(decode UTF-8),AsString, length, getitem, iter. -
objects/str_kind.go:pickKind,repack,widen. Kind selection on construction matches CPython. -
objects/str_methods.go: split, rsplit, splitlines, partition, rpartition, join, find, rfind, index, rindex, count, replace, strip, lstrip, rstrip, translate, startswith, endswith, expandtabs, center, ljust, rjust, zfill, encode, format, format_map, casefold, lower, upper, title, swapcase, capitalize, isalnum, isalpha, isascii, isdecimal, isdigit, isidentifier, islower, isupper, isnumeric, isprintable, isspace, istitle, removeprefix, removesuffix. -
objects/str_format.go:__format__over the[[fill]align] [sign][#][0][width][,_][.precision][type]mini-language. -
objects/codecs.go: registry,Encode,Decode,LookupCodec, error-handler registry. -
objects/codec_utf8.go: encode + decode + surrogate handling. -
objects/codec_utf16.go,codec_utf32.go,codec_ascii.go,codec_latin1.go. -
objects/codec_errors.go: strict, replace, ignore, backslashreplace, xmlcharrefreplace, namereplace, surrogateescape, surrogatepass. -
objects/str_intern.go: intern set,Intern, identity guarantees for compile-time-known strings. -
objects/unicode_ctype.go: classification + case-mapping predicates. -
objects/unicode_tables_gen.go: generated. -
tools/unicode_gen/main.go: generator from UCD.
Surface guarantees
-
hash(str)matches CPython under PYTHONHASHSEED=0 across every kind (1/2/4 byte). -
repr(str)chooses the same quote style and escape style CPython does, including\x,\u,\U,\N{...}. -
'a' in 'banana'performs the same Boyer-Moore-Horspool fallback CPython uses for medium needles. -
str.casefold()matches Unicode 16.0 (Germanß->'ss', etc.). -
'é' == 'é'isFalse(no auto-NFC). -
'a'.encode('utf-8') == b'a', surrogate pairs round-trip withsurrogatepass. -
str.isidentifier()follows the XID_Start / XID_Continue rules from PEP 3131. - Interned identity: every keyword in
keyword.kwlistis interned and returns the same object.
Cross-references
- Hash key: 1661.
- Bytes <-> str round-trip: 1676.
- Format mini-language: 1660.
\N{...}lookup table also used by 1644.
Out of scope
- Locale-aware case mapping (full Turkish, etc.). Stdlib.
- normalize() (NFD / NFC / NFKD / NFKC). Lives in
unicodedatastdlib module. - The
_stringextension module. Stdlib bridge.