Skip to main content

Parser/tokenizer/string_tokenizer.c

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c

The simplest of the three tokenizer back-ends. The entire source is already in memory, so there is no I/O, no line-read callback, and no need for a dynamically grown buffer. str_get_line returns a pointer into the original buffer, advancing tok->cur past the next \n. The file is 148 lines and exposes two public factories: _PyTokenizer_FromString (bytes already decoded by the caller) and _PyTokenizer_FromUTF8 (performs UTF-8 validation before proceeding).

Both are called from builtin_compile_impl and the ceval fast-paths for eval() and exec().

Map

LinesSymbolRolegopy
1-22buf_get_line / str_get_lineunderflow callback: advances tok->cur to the next \n within the fixed buffer.(*StringTokenizer).readline
24-78_PyTokenizer_FromStringFactory: sets tok->str, tok->input, tok->end; installs str_get_line.NewStringTokenizer
80-124_PyTokenizer_FromUTF8Validates UTF-8 then delegates to _PyTokenizer_FromString.NewUTF8Tokenizer
126-148encoding header detectionScans the first two lines of a byte-string for a coding: comment; rejects non-UTF-8 encodings.detectEncoding

Reading

_PyTokenizer_FromString (lines 24 to 78)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L24-78

struct tok_state *
_PyTokenizer_FromString(const char *str, int exec_input, int preserve_crlf)
{
struct tok_state *tok = _PyTokenizer_New();
if (tok == NULL)
return NULL;

tok->buf = tok->cur = tok->input = str;
tok->end = str + strlen(str);
tok->inp = tok->end; /* all input is already present */
tok->underflow = str_get_line;
tok->encoding = new_string("utf-8", 5, tok);
if (!tok->encoding)
goto error;
...
return tok;
error:
_PyTokenizer_Free(tok);
return NULL;
}

Three pointers into the same buffer: tok->buf is the absolute start (never changes), tok->cur advances as the scanner consumes characters, tok->end marks one past the last byte. Setting tok->inp = tok->end tells the core lexer that the buffer is fully loaded: the underflow callback will only be reached when tok->cur == tok->inp, which happens only at EOF.

exec_input controls whether a trailing newline is synthesised. CPython requires all source to end with \n; when the string does not, the factory appends one (via a one-byte internal copy) so the lexer always sees a clean end-of-file.

preserve_crlf suppresses the \r\n normalisation that the lexer would otherwise apply. The flag is set when the caller is the tokenize module operating in round-trip mode.

_PyTokenizer_FromUTF8 (lines 80 to 124)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L80-124

struct tok_state *
_PyTokenizer_FromUTF8(const char *str, int exec_input, int preserve_crlf)
{
if (!PyUnicode_DecodeUTF8(str, strlen(str), NULL)) {
/* Let PyUnicode_DecodeUTF8 set the exception */
return NULL;
}
return _PyTokenizer_FromString(str, exec_input, preserve_crlf);
}

The validation is a full decode-then-discard: PyUnicode_DecodeUTF8 is called with errors=NULL (strict mode). Any invalid byte sequence raises UnicodeDecodeError immediately, before the tokenizer is even allocated.

This is the path taken by compile(source, ...) when source is a str. CPython's str objects are internally stored as Latin-1, UCS-2, or UCS-4 depending on the maximum code point, but the source passed to compile has already been re-encoded to UTF-8 by builtin_compile_impl before reaching here.

str_get_line (lines 1 to 22)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L1-22

static int
str_get_line(struct tok_state *tok)
{
const char *end = tok->inp;
const char *start = tok->cur;
/* find the next newline or end of buffer */
while (tok->cur < end && *tok->cur != '\n')
tok->cur++;
if (tok->cur < end)
tok->cur++; /* consume the '\n' */
tok->buf = start;
tok->inp = tok->cur;
tok->line_start = start;
return 1;
}

No allocation, no copy. The function updates tok->buf and tok->inp to delimit the line just found within the original string, then returns 1 to indicate a line is available. tok_nextc will now walk from tok->buf to tok->inp one byte at a time.

When tok->cur == end at entry the function returns 1 with a zero-length line, which causes tok_nextc to return EOF on the very next call. This is the clean end-of-string signal.

Encoding header detection (lines 126 to 148)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/string_tokenizer.c#L126-148

Even in-memory sources may carry a # coding: declaration in the first two lines. _PyTokenizer_FromString scans those lines using the same get_coding_spec helper as the file back-end. If a coding comment is found and the declared encoding is not utf-8 (or a UTF-8 alias), a SyntaxError is raised with the message "encoding declaration in Unicode string". The string back-end only accepts UTF-8 source; if the caller has non-UTF-8 bytes it must use the file tokenizer with explicit encoding.