Objects/unicodeobject.c (search and format)

Source:

cpython 3.14 @ ab2d84fe1023/Objects/unicodeobject.c

Map

Symbol	Approx. lines	Purpose
`unicode_find`	9250-9310	Implements `str.find` and `str.index` via FASTSEARCH
`unicode_count`	9180-9245	Implements `str.count` via FASTSEARCH
`FASTSEARCH`	(Objects/stringlib/fastsearch.h)	Combined Bloom filter plus Boyer-Moore-Horspool
`unicode_split`	10200-10280	Implements `str.split(sep, maxsplit)`
`unicode_rsplit`	10285-10350	Implements `str.rsplit(sep, maxsplit)`
`unicode_splitlines`	10355-10420	Implements `str.splitlines(keepends)`
`unicode_format`	13800-14050	Implements the `%` operator for str
`_PyUnicodeWriter`	1800-1920	Incremental buffer used by format and join
`unicode_encode_utf8`	5100-5250	Fast path for UTF-8 encoding of ASCII-compact strings
`PyUnicode_Join`	10800-10950	Implements `sep.join(iterable)`
`unicode_subscript`	3400-3500	Implements `str[i]` and `str[i:j:k]`

Reading

FASTSEARCH: Bloom filter plus Boyer-Moore-Horspool

unicode_find and unicode_count both delegate to the FASTSEARCH macro defined in Objects/stringlib/fastsearch.h. The algorithm builds a 64-bit Bloom filter from the characters of the pattern, then uses a Boyer-Moore-Horspool bad-character skip table to advance the search window. The Bloom filter provides an O(1) pre-check before the more expensive character comparisons, making single-character searches essentially a linear scan with minimal overhead.

unicode_find wraps FASTSEARCH with FAST_SEARCH mode and returns the first match index. unicode_count uses FAST_COUNT mode and accumulates non-overlapping matches across the whole string.

// CPython: Objects/unicodeobject.c:9250 unicode_find
static Py_ssize_t
unicode_find(PyObject *self, PyObject *args)
{
    ...
    result = any_find_slice(self, substring, start, end, 1);
    ...
}

// CPython: Objects/unicodeobject.c:9180 unicode_count
static PyObject *
unicode_count(PyObject *self, PyObject *args)
{
    ...
    iresult = any_find_slice(self, substring, start, end, -1);
    ...
}

unicode_split, unicode_rsplit, and unicode_splitlines

All three split functions share the same internal helper split_whitespace (for the no-separator case) or split_char/split_subtype (for a separator argument). The maxsplit parameter is threaded through each helper and decremented on every successful split; when it reaches zero the remainder of the string becomes the final element of the result list.

unicode_rsplit mirrors unicode_split but scans from the right, building the result list in reverse and then calling PyList_Reverse before returning. unicode_splitlines uses a separate character-class table (LINEBREAK) to identify line boundaries and optionally retains the terminator in each segment when keepends=True.

// CPython: Objects/unicodeobject.c:10200 unicode_split
static PyObject *
unicode_split(PyObject *self, PyObject *args, PyObject *kwds)
{
    ...
    if (substring == Py_None)
        return split_whitespace(self, maxcount);
    ...
    return split(self, substring, maxcount);
}

// CPython: Objects/unicodeobject.c:10355 unicode_splitlines
static PyObject *
unicode_splitlines(PyObject *self, PyObject *args, PyObject *kwds)
{
    int keepends = 0;
    ...
    return PyUnicode_Splitlines(self, keepends);
}

unicode_format, _PyUnicodeWriter, unicode_encode_utf8, and PyUnicode_Join

unicode_format implements the % operator. It walks the format string looking for % conversion specifiers, dispatches each to a per-type formatter (e.g. %s calls PyObject_Str, %d calls formatlong), and appends each fragment to a _PyUnicodeWriter buffer.

_PyUnicodeWriter is an incremental string builder. It starts with a small stack-allocated buffer and promotes to a heap allocation only when the accumulated length exceeds the initial estimate. The writer tracks the maximum code point seen so far, which lets it choose the narrowest internal representation (Latin-1, UCS-2, or UCS-4) only at the final _PyUnicodeWriter_Finish call.

// CPython: Objects/unicodeobject.c:1800 _PyUnicodeWriter_Init
void
_PyUnicodeWriter_Init(_PyUnicodeWriter *writer)
{
    memset(writer, 0, sizeof(*writer));
    writer->min_char = 127;
    ...
}

unicode_encode_utf8 takes the fast path for ASCII-compact strings (kind == PyUnicode_1BYTE_KIND, ascii flag set): it copies the internal byte buffer directly without any per-character transcoding. Non-ASCII strings fall through to a loop that encodes each code point into the standard 2-4 byte UTF-8 sequences.

PyUnicode_Join pre-computes the total character count and maximum code point across all items in one pass, allocates a result buffer of exactly the right width, then fills it in a second pass. This avoids any reallocation during the copy phase.

// CPython: Objects/unicodeobject.c:10800 PyUnicode_Join
PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{
    ...
    /* Calculate result length and maximum character */
    for (i = 0; i < seqlen; i++) {
        ...
        sz += PyUnicode_GET_LENGTH(item);
        maxchar = Py_MAX(maxchar, PyUnicode_MAX_CHAR_VALUE(item));
    }
    ...
}

unicode_subscript dispatches on the index type. An integer index calls PyUnicode_ReadChar and wraps the result in a one-character string via unicode_char. A slice index calls PyUnicode_Substring, which in turn calls _PyUnicode_Copy for the appropriate kind (1, 2, or 4 bytes per code point).

// CPython: Objects/unicodeobject.c:3400 unicode_subscript
static PyObject *
unicode_subscript(PyObject *self, PyObject *item)
{
    if (_PyIndex_Check(item)) {
        ...
        return unicode_char(ch);
    }
    else if (PySlice_Check(item)) {
        ...
        return PyUnicode_Substring(self, start, stop);
    }
    ...
}

gopy notes

Status: not yet ported.

Planned package path: objects/ (as objects.Str, alongside objects/str.go).

The FASTSEARCH algorithm translates cleanly to a Go generic function parameterized on element width (uint8, uint16, uint32). The _PyUnicodeWriter builder maps to a strings.Builder wrapper that also tracks maxChar for deferred kind selection. UTF-8 encoding is partially handled by Go's native string representation, but the kind-selection logic (Latin-1 vs UCS-2 vs UCS-4 compact storage) still needs a full port before unicode_format and PyUnicode_Join can be translated correctly.

Map​

Reading​

FASTSEARCH: Bloom filter plus Boyer-Moore-Horspool​

unicode_split, unicode_rsplit, and unicode_splitlines​

unicode_format, _PyUnicodeWriter, unicode_encode_utf8, and PyUnicode_Join​

gopy notes​

Map

Reading

FASTSEARCH: Bloom filter plus Boyer-Moore-Horspool

unicode_split, unicode_rsplit, and unicode_splitlines

unicode_format, _PyUnicodeWriter, unicode_encode_utf8, and PyUnicode_Join

gopy notes