Skip to main content

Include/cpython/unicodeobject.h: Unicode Object Layout

CPython's Unicode implementation uses three nested struct layouts to avoid allocating separate buffers for ASCII-only and Latin-1 strings. The header encodes the string kind (1/2/4 bytes per code point), compactness, and ASCII-only status in a single bitfield packed into the shared ob_base of every Python object.

Map

LinesSymbolKind
72–110PyASCIIObjectstruct
111–145PyCompactUnicodeObjectstruct
146–185PyUnicodeObjectstruct
210–230PyUnicode_KINDmacro
231–245PyUnicode_DATAmacro
260–275PyUnicode_GET_LENGTHmacro
300–340PyUnicode_AsUTF8AndSizefunction
380–420PyUnicodeWriterstruct (3.14)
430–460PyUnicodeWriter_Create / PyUnicodeWriter_Finishfunctions (3.14)

Reading

Three-level struct hierarchy

The three structs form a containment chain. PyASCIIObject is the base; PyCompactUnicodeObject embeds it; PyUnicodeObject embeds that. Casting between them is safe when the corresponding flag bits are set.

typedef struct {
PyObject_HEAD
Py_ssize_t length;
Py_hash_t hash;
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
} state;
wchar_t *wstr; /* removed in 3.13 */
} PyASCIIObject;

typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length; /* removed in 3.13 */
} PyCompactUnicodeObject;

typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
} PyUnicodeObject;

kind encodes the per-code-point width: PyUnicode_1BYTE_KIND (1), PyUnicode_2BYTE_KIND (2), PyUnicode_4BYTE_KIND (4). When compact and ascii are both set, the character buffer follows the PyASCIIObject header immediately in memory, so no heap pointer is stored at all.

PyUnicode_KIND and PyUnicode_DATA

#define PyUnicode_KIND(op) \
(assert(PyUnicode_Check(op)), \
((PyASCIIObject *)(op))->state.kind)

#define PyUnicode_DATA(op) \
(assert(PyUnicode_Check(op)), \
PyUnicode_IS_COMPACT(op) \
? (PyUnicode_IS_ASCII(op) \
? (void *)((PyASCIIObject *)(op) + 1) \
: (void *)((PyCompactUnicodeObject *)(op) + 1)) \
: ((PyUnicodeObject *)(op))->data.any)

PyUnicode_DATA performs pointer arithmetic relative to the struct header when the string is compact. Non-compact (legacy) strings store the buffer in data.any. In practice, CPython's allocator always produces compact strings; legacy strings exist only when PyUnicode_FromKindAndData is used with pre-3.3 extension code.

PyUnicodeWriter (3.14)

3.14 introduced PyUnicodeWriter as a stable replacement for the internal _PyUnicodeWriter type, exposing an append-then-finish builder pattern.

typedef struct PyUnicodeWriter PyUnicodeWriter;

PyUnicodeWriter *PyUnicodeWriter_Create(Py_ssize_t length);
int PyUnicodeWriter_WriteStr(PyUnicodeWriter *writer, PyObject *str);
int PyUnicodeWriter_WriteChar(PyUnicodeWriter *writer, Py_UCS4 ch);
int PyUnicodeWriter_WriteUTF8(PyUnicodeWriter *writer,
const char *str, Py_ssize_t size);
PyObject *PyUnicodeWriter_Finish(PyUnicodeWriter *writer);
void PyUnicodeWriter_Discard(PyUnicodeWriter *writer);

Create accepts a size hint; the writer over-allocates internally and upgrades the internal kind when a wide character is appended. Finish returns a ready-to-use str object and frees the writer. Discard frees without returning an object, used on error paths.

gopy notes

  • objects/str.go mirrors the three-level layout using Go struct embedding. The kind field is stored as a uint8 matching CPython's 3-bit encoding.
  • Compact ASCII strings are stored as a plain Go string value appended after the header struct; pointer arithmetic is avoided by keeping a []byte slice header pointing into the same allocation.
  • PyUnicode_DATA pointer logic is replaced by a data() method that switches on kind and returns the backing slice header.
  • The wstr and wstr_length fields are not ported; they were removed in CPython 3.12 and gopy targets 3.14 semantics only.
  • PyUnicodeWriter is tracked but not yet ported. Builder-pattern string construction currently uses Go's strings.Builder internally.