`Parser/tokenizer/string_tokenizer.c`

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c

The simplest of the three tokenizer back-ends. The entire source is already in memory, so there is no I/O, no line-read callback, and no need for a dynamically grown buffer. str_get_line returns a pointer into the original buffer, advancing tok->cur past the next \n. The file is 148 lines and exposes two public factories: _PyTokenizer_FromString (bytes already decoded by the caller) and _PyTokenizer_FromUTF8 (performs UTF-8 validation before proceeding).

Both are called from builtin_compile_impl and the ceval fast-paths for eval() and exec().

Map

Lines	Symbol	Role	gopy
1-22	`buf_get_line` / `str_get_line`	`underflow` callback: advances `tok->cur` to the next `\n` within the fixed buffer.	`(*StringTokenizer).readline`
24-78	`_PyTokenizer_FromString`	Factory: sets `tok->str`, `tok->input`, `tok->end`; installs `str_get_line`.	`NewStringTokenizer`
80-124	`_PyTokenizer_FromUTF8`	Validates UTF-8 then delegates to `_PyTokenizer_FromString`.	`NewUTF8Tokenizer`
126-148	encoding header detection	Scans the first two lines of a byte-string for a `coding:` comment; rejects non-UTF-8 encodings.	`detectEncoding`

Reading

`_PyTokenizer_FromString` (lines 24 to 78)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L24-78

struct tok_state *
_PyTokenizer_FromString(const char *str, int exec_input, int preserve_crlf)
{
    struct tok_state *tok = _PyTokenizer_New();
    if (tok == NULL)
        return NULL;

    tok->buf = tok->cur = tok->input = str;
    tok->end = str + strlen(str);
    tok->inp = tok->end;        /* all input is already present */
    tok->underflow = str_get_line;
    tok->encoding = new_string("utf-8", 5, tok);
    if (!tok->encoding)
        goto error;
    ...
    return tok;
error:
    _PyTokenizer_Free(tok);
    return NULL;
}

Three pointers into the same buffer: tok->buf is the absolute start (never changes), tok->cur advances as the scanner consumes characters, tok->end marks one past the last byte. Setting tok->inp = tok->end tells the core lexer that the buffer is fully loaded: the underflow callback will only be reached when tok->cur == tok->inp, which happens only at EOF.

exec_input controls whether a trailing newline is synthesised. CPython requires all source to end with \n; when the string does not, the factory appends one (via a one-byte internal copy) so the lexer always sees a clean end-of-file.

preserve_crlf suppresses the \r\n normalisation that the lexer would otherwise apply. The flag is set when the caller is the tokenize module operating in round-trip mode.

`_PyTokenizer_FromUTF8` (lines 80 to 124)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L80-124

struct tok_state *
_PyTokenizer_FromUTF8(const char *str, int exec_input, int preserve_crlf)
{
    if (!PyUnicode_DecodeUTF8(str, strlen(str), NULL)) {
        /* Let PyUnicode_DecodeUTF8 set the exception */
        return NULL;
    }
    return _PyTokenizer_FromString(str, exec_input, preserve_crlf);
}

The validation is a full decode-then-discard: PyUnicode_DecodeUTF8 is called with errors=NULL (strict mode). Any invalid byte sequence raises UnicodeDecodeError immediately, before the tokenizer is even allocated.

This is the path taken by compile(source, ...) when source is a str. CPython's str objects are internally stored as Latin-1, UCS-2, or UCS-4 depending on the maximum code point, but the source passed to compile has already been re-encoded to UTF-8 by builtin_compile_impl before reaching here.

`str_get_line` (lines 1 to 22)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L1-22

static int
str_get_line(struct tok_state *tok)
{
    const char *end = tok->inp;
    const char *start = tok->cur;
    /* find the next newline or end of buffer */
    while (tok->cur < end && *tok->cur != '\n')
        tok->cur++;
    if (tok->cur < end)
        tok->cur++;   /* consume the '\n' */
    tok->buf   = start;
    tok->inp   = tok->cur;
    tok->line_start = start;
    return 1;
}

No allocation, no copy. The function updates tok->buf and tok->inp to delimit the line just found within the original string, then returns 1 to indicate a line is available. tok_nextc will now walk from tok->buf to tok->inp one byte at a time.

When tok->cur == end at entry the function returns 1 with a zero-length line, which causes tok_nextc to return EOF on the very next call. This is the clean end-of-string signal.

Encoding header detection (lines 126 to 148)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L126-148

Even in-memory sources may carry a # coding: declaration in the first two lines. _PyTokenizer_FromString scans those lines using the same get_coding_spec helper as the file back-end. If a coding comment is found and the declared encoding is not utf-8 (or a UTF-8 alias), a SyntaxError is raised with the message "encoding declaration in Unicode string". The string back-end only accepts UTF-8 source; if the caller has non-UTF-8 bytes it must use the file tokenizer with explicit encoding.

Map​

Reading​

_PyTokenizer_FromString (lines 24 to 78)​

_PyTokenizer_FromUTF8 (lines 80 to 124)​

str_get_line (lines 1 to 22)​

Encoding header detection (lines 126 to 148)​

Map

Reading

`_PyTokenizer_FromString` (lines 24 to 78)

`_PyTokenizer_FromUTF8` (lines 80 to 124)

`str_get_line` (lines 1 to 22)

Encoding header detection (lines 126 to 148)