Skip to main content

1677. gopy unicode

What we are porting

Two files, ~17000 lines total. The single largest object port:

  • Objects/unicodeobject.c: PEP 393 kind-tagged string. Three storage kinds (1-byte, 2-byte, 4-byte) chosen at construction to fit the largest code point. Method surface (split, join, find, replace, format, encode, decode, casefold, etc.). String interning. Codecs registry hooks.
  • Objects/unicodectype.c: Unicode 16.0 classification tables. is_alnum, is_alpha, is_decimal, is_digit, is_lower, is_upper, is_title, is_space, is_xid_start, is_xid_continue, plus to_lower, to_upper, to_title, to_decimal, to_digit. Tables generated from the Unicode Character Database.

Go shape

// Str mirrors PyUnicodeObject. Kind picks the storage width.
type Str struct {
VarHeader
kind Kind // 1, 2, 4 bytes per code point
data []byte // length = size * kind
hash int64
ascii bool // shortcut: kind==1 && all bytes < 0x80
interned uint8 // 0 = not, 1 = mortal, 2 = immortal
}

type Kind uint8
const (
Kind1 Kind = 1
Kind2 Kind = 2
Kind4 Kind = 4
)

CPython 3.12+ also has a "compact" inline-storage variant. We do not optimise for that yet; the spill cost is one extra slice header.

ASCII fast path

Str.ascii lets every method short-circuit when both operands are ASCII. CPython does the same via the state.ascii bit on the header.

String interning

Python interns short identifier-shaped strings (__init__, self, etc.) automatically. We interns the same set CPython interns: any string passed to _Py_intern from the parser plus the contents of Lib/keyword.py. Identity is cheap inside the interned set; equality stays correct outside it.

Hashing

SipHash-1-3 over the byte payload (not the code-point sequence; identical to CPython since the kind layout determines the bytes). Cached in the hash field.

Codecs

The codecs registry lives in objects/codecs.go. v0.4 ships UTF-8, UTF-16-le/be, UTF-32-le/be, ASCII, latin-1, and the error handlers (strict, replace, backslashreplace, xmlcharrefreplace, namereplace, surrogateescape, surrogatepass). The charset-table codecs (cp1252, gb18030, etc.) are stdlib in CPython too; they ride along once the import system lands (v0.8).

Unicode tables

tools/unicode_gen/main.go reads the Unicode Character Database and emits objects/unicode_tables_gen.go. Same data CPython's Tools/unicode/makeunicodedata.py emits, just typed for Go.

File mapping

C sourceGo target
Objects/unicodeobject.cobjects/str.go
kind layout / constructionobjects/str_kind.go
methodsobjects/str_methods.go
format / __format__objects/str_format.go
codecs registryobjects/codecs.go
utf-8 codecobjects/codec_utf8.go
utf-16/32, ascii, latin-1objects/codec_*.go
error handlersobjects/codec_errors.go
interningobjects/str_intern.go
Objects/unicodectype.cobjects/unicode_ctype.go
generated tablesobjects/unicode_tables_gen.go
generatortools/unicode_gen/

Checklist

Status legend: [x] shipped, [ ] pending, [~] partial / scaffold, [n] deferred / not in scope this phase.

Files

  • objects/str.go: Str struct, FromString, FromBytes (decode UTF-8), AsString, length, getitem, iter.
  • objects/str_kind.go: pickKind, repack, widen. Kind selection on construction matches CPython.
  • objects/str_methods.go: split, rsplit, splitlines, partition, rpartition, join, find, rfind, index, rindex, count, replace, strip, lstrip, rstrip, translate, startswith, endswith, expandtabs, center, ljust, rjust, zfill, encode, format, format_map, casefold, lower, upper, title, swapcase, capitalize, isalnum, isalpha, isascii, isdecimal, isdigit, isidentifier, islower, isupper, isnumeric, isprintable, isspace, istitle, removeprefix, removesuffix.
  • objects/str_format.go: __format__ over the [[fill]align] [sign][#][0][width][,_][.precision][type] mini-language.
  • objects/codecs.go: registry, Encode, Decode, LookupCodec, error-handler registry.
  • objects/codec_utf8.go: encode + decode + surrogate handling.
  • objects/codec_utf16.go, codec_utf32.go, codec_ascii.go, codec_latin1.go.
  • objects/codec_errors.go: strict, replace, ignore, backslashreplace, xmlcharrefreplace, namereplace, surrogateescape, surrogatepass.
  • objects/str_intern.go: intern set, Intern, identity guarantees for compile-time-known strings.
  • objects/unicode_ctype.go: classification + case-mapping predicates.
  • objects/unicode_tables_gen.go: generated.
  • tools/unicode_gen/main.go: generator from UCD.

Surface guarantees

  • hash(str) matches CPython under PYTHONHASHSEED=0 across every kind (1/2/4 byte).
  • repr(str) chooses the same quote style and escape style CPython does, including \x, \u, \U, \N{...}.
  • 'a' in 'banana' performs the same Boyer-Moore-Horspool fallback CPython uses for medium needles.
  • str.casefold() matches Unicode 16.0 (German ß -> 'ss', etc.).
  • 'é' == 'é' is False (no auto-NFC).
  • 'a'.encode('utf-8') == b'a', surrogate pairs round-trip with surrogatepass.
  • str.isidentifier() follows the XID_Start / XID_Continue rules from PEP 3131.
  • Interned identity: every keyword in keyword.kwlist is interned and returns the same object.

Cross-references

  • Hash key: 1661.
  • Bytes <-> str round-trip: 1676.
  • Format mini-language: 1660.
  • \N{...} lookup table also used by 1644.

Out of scope

  • Locale-aware case mapping (full Turkish, etc.). Stdlib.
  • normalize() (NFD / NFC / NFKD / NFKC). Lives in unicodedata stdlib module.
  • The _string extension module. Stdlib bridge.