Parser/tokenizer/helpers.c
cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/helpers.c
The utility layer shared by all three tokenizer back-ends (file, string,
readline). Back-end-specific code lives in file_tokenizer.c,
string_tokenizer.c, and readline_tokenizer.c; everything they have in
common lives here.
The two most-referenced pieces are _PyTokenizer_New / _PyTokenizer_Free
(lifecycle) and the _PyToken_OneChar / _PyToken_TwoChars /
_PyToken_ThreeChars operator-dispatch table. A third notable function,
_PyTokenizer_FindEncodingFilename, provides the encoding-detection entry
point for the -c flag and import machinery when no tok_state is yet
alive.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-48 | _PyTokenizer_New | callocs a tok_state, sets safe defaults (tabsize=8, done=E_OK). | newTokenizer in parser/tokenizer/helpers.go |
| 50-118 | _PyTokenizer_Free | Releases every heap field: line buffer, encoding string, prompt copies, mode stack. | (*Tokenizer).Free |
| 120-188 | _PyToken_OneChar | Single-character operator dispatch (switch on c1). | parser/tokenizer/op.go:OneChar |
| 190-328 | _PyToken_TwoChars | Two-character operator dispatch (nested switch c1, c2). | parser/tokenizer/op.go:TwoChars |
| 330-400 | _PyToken_ThreeChars | Three-character operator dispatch (triple-nested switch). | parser/tokenizer/op.go:ThreeChars |
| 402-530 | _PyTokenizer_FindEncodingFilename | Opens a file, reads two lines, returns any coding: declaration. | FindEncodingFilename |
| 532-581 | _PyTokenizer_Get | Public entry: delegates to tok->underflow-backed tok_get; coerces decode errors to ERRORTOKEN. | (*Tokenizer).Get |
Reading
_PyTokenizer_New (lines 1 to 48)
cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/helpers.c#L1-48
struct tok_state *
_PyTokenizer_New(void)
{
struct tok_state *tok = (struct tok_state *)PyMem_Calloc(1, sizeof(*tok));
if (tok == NULL) {
PyErr_NoMemory();
return NULL;
}
tok->filename = NULL;
tok->decoding_state = STATE_INIT;
tok->decoding_erred = 0;
tok->indent = 0;
tok->indstack[0] = 0;
tok->atbol = 1; /* very first token is at beginning of line */
tok->pendin = 0;
tok->tabsize = 8; /* PEP 8 default */
tok->done = E_OK;
tok->async_def = 0;
tok->async_def_indent = 0;
tok->async_def_nl = 0;
tok->tok_mode_stack_index = 0;
tok->tok_mode_stack[0].kind = TOK_REGULAR_MODE;
return tok;
}
PyMem_Calloc zeroes the struct, so most fields need no explicit
initialisation. The ones set explicitly are those where zero is not the right
default: atbol=1 means "we are at the beginning of the very first line" so
indentation checking fires immediately; tabsize=8 is the Python default tab
width; tok_mode_stack[0].kind = TOK_REGULAR_MODE puts the f-string mode
stack in its ground state.
In the gopy mirror, newTokenizer() allocates a Tokenizer value on the Go
heap and sets the same fields. The mode stack becomes a []mode slice with
one pre-pushed entry.
_PyTokenizer_Free (lines 50 to 118)
cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/helpers.c#L50-118
void
_PyTokenizer_Free(struct tok_state *tok)
{
if (tok->encoding != NULL)
PyMem_Free(tok->encoding);
Py_XDECREF(tok->filename);
if (tok->fp != NULL && tok->fp != stdin)
fclose(tok->fp);
if (tok->input)
PyMem_Free((char *)tok->input);
PyMem_Free(tok);
}
The file is closed here only if it was opened by _PyTokenizer_FromFile;
when the caller passes an already-open FILE* the tokenizer sets a flag to
suppress the fclose. The input field is the copy of the source string made
for string tokenizers that needed a trailing newline appended; it is NULL
when the original buffer was used directly.
_PyToken_OneChar / TwoChars / ThreeChars (lines 120 to 400)
cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/helpers.c#L120-400
The three operator-dispatch functions implement a longest-match probe. The
lexer in tok_get_normal_mode calls them in sequence, keeping the last
non-OP result:
/* inside tok_get_normal_mode */
int type = _PyToken_OneChar(c);
if (/* peek succeeds */) {
int type2 = _PyToken_TwoChars(c, c2);
if (type2 != OP) {
type = type2;
if (/* peek succeeds */) {
int type3 = _PyToken_ThreeChars(c, c2, c3);
if (type3 != OP)
type = type3;
else
tok_backup(tok, c3);
}
} else
tok_backup(tok, c2);
}
_PyToken_OneChar covers the 23 single-character operators including (,
), [, ], {, }, ,, ;, :, @, ., +, -, *, /, |,
&, ^, ~, <, >, =, %. It returns OP as a sentinel meaning "no
one-char match" (not to be confused with the OP pseudo-token kind).
_PyToken_TwoChars handles 32 two-character sequences including walrus :=
(PEP 572), floor-division //, matrix multiply @=, return annotation ->,
and the Python-2-compat not-equal <>.
_PyToken_ThreeChars covers five sequences: **=, ..., //=, <<=,
>>=.
The Go port in parser/tokenizer/op.go preserves the switch shape verbatim.
Token-kind constants become a type Kind int in parser/token/.
_PyTokenizer_FindEncodingFilename (lines 402 to 530)
cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/helpers.c#L402-530
const char *
_PyTokenizer_FindEncodingFilename(int fd, PyObject *filename)
{
struct tok_state *tok;
FILE *fp;
const char *encoding = NULL;
fd = dup(fd);
if (fd < 0)
return NULL;
fp = fdopen(fd, "r");
if (fp == NULL) {
close(fd);
return NULL;
}
tok = _PyTokenizer_FromFile(fp, NULL, NULL, NULL);
if (tok == NULL) {
fclose(fp);
return NULL;
}
...
/* read at most two tokens; stop at NEWLINE or ENCODING */
while (tok->done == E_OK) {
struct token token;
(void) _PyTokenizer_Get(tok, &token);
if (token.type == ENCODING) {
encoding = tok->encoding;
tok->encoding = NULL; /* steal the string */
break;
}
if (token.type == NEWLINE)
break;
}
_PyTokenizer_Free(tok);
fclose(fp);
return encoding; /* caller must PyMem_Free */
}
Used by importlib._bootstrap_external (via _imp.source_hash) and the -c
command-line path to determine the encoding of a file before creating a full
tokenizer or parser. It duplicates the file descriptor so the caller's
position is not disturbed, then creates a throw-away tok_state just to
consume the ENCODING pseudo-token.
The caller receives a heap-allocated string and must call PyMem_Free on it.
The gopy equivalent returns a plain string (no explicit free needed).
_PyTokenizer_Get (lines 532 to 581)
cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/helpers.c#L532-581
int
_PyTokenizer_Get(struct tok_state *tok, struct token *token)
{
int result = tok_get(tok, token);
if (tok->decoding_erred) {
result = ERRORTOKEN;
tok->done = E_DECODE;
}
return result;
}
The sole public entry point for advancing the tokenizer. All callers (the PEG
parser, tokenize.py's C accelerator, the encoding scanner above) go through
here. The decode-error coercion centralises a concern that would otherwise need
to be handled in every caller: if iconv failed on a non-UTF-8 file, the next
token is unconditionally ERRORTOKEN and tok->done is set to stop further
scanning. The internal tok_get that does the real work is defined in
Parser/lexer/lexer.c:1616.