Skip to main content

Parser/token.c

cpython 3.14 @ ab2d84fe1023/Parser/token.c

Auto-generated by Tools/build/generate_token.py from Grammar/Tokens. The file holds the token-kind string table and three operator-lookup helpers. Tiny at 250 lines, but it is the canonical list of token kinds the rest of the front end depends on.

Map

LinesSymbolRolegopy
8-79_PyParser_TokenNames[]String table indexed by token kind.parser/token/names.go:Names
83-113_PyToken_OneCharSingle-char operator dispatch.parser/token/lookup.go:OneChar
115-197_PyToken_TwoCharsTwo-char operator dispatch.parser/token/lookup.go:TwoChars
199-250_PyToken_ThreeCharsThree-char operator dispatch.parser/token/lookup.go:ThreeChars

Reading

_PyParser_TokenNames (lines 8 to 79)

cpython 3.14 @ ab2d84fe1023/Parser/token.c#L8-79

A flat const char * array indexed by the token-kind enum from Include/internal/pycore_token.h. The order matters: the lexer returns indices into this array, and tokenize.py uses the same ordering when round-tripping source.

const char * const _PyParser_TokenNames[] = {
"ENDMARKER",
"NAME",
"NUMBER",
"STRING",
"NEWLINE",
"INDENT",
"DEDENT",
...
"EXCLAMATION", /* line 63: f-string conversion suffix (3.12+) */
...
"FSTRING_START", /* line 68 */
"FSTRING_MIDDLE",
"FSTRING_END",
"TSTRING_START", /* line 71: PEP 750 template strings */
"TSTRING_MIDDLE",
"TSTRING_END",
"COMMENT",
"NL",
"<ERRORTOKEN>",
"<ENCODING>",
"<N_TOKENS>",
};

The three sentinels at the tail are not real tokens. <ENCODING> is emitted at most once, as the first token from the file tokenizer, and carries the source encoding name. <ERRORTOKEN> is what the lexer returns when it cannot recognise the next character. <N_TOKENS> is the count and never appears in token streams.

_PyToken_OneChar (lines 83 to 113)

cpython 3.14 @ ab2d84fe1023/Parser/token.c#L83-113

int
_PyToken_OneChar(int c1)
{
switch (c1) {
case '!': return EXCLAMATION;
case '%': return PERCENT;
case '&': return AMPER;
case '(': return LPAR;
case ')': return RPAR;
...
case '~': return TILDE;
}
return OP;
}

Twenty-three single-character operators map straight through. Anything not listed (digits, letters, whitespace, multi-char leads such as =) falls into return OP, which the caller treats as "no one-char match".

_PyToken_TwoChars (lines 115 to 197)

cpython 3.14 @ ab2d84fe1023/Parser/token.c#L115-197

Nested switch on c1 then c2. Three cases worth singling out:

case '<':
switch (c2) {
case '<': return LEFTSHIFT;
case '=': return LESSEQUAL;
case '>': return NOTEQUAL; /* line 166: Py2 <> kept for embedders */
}
break;
case '-':
switch (c2) {
case '=': return MINEQUAL;
case '>': return RARROW; /* line 148: return-annotation arrow */
}
break;
case ':':
switch (c2) {
case '=': return COLONEQUAL; /* line 159: walrus, PEP 572 */
}
break;

<> returns NOTEQUAL for Python 2 compatibility. The lexer rejects <> upstream, but the helper keeps the case for embedders that call it directly.

_PyToken_ThreeChars (lines 199 to 250)

cpython 3.14 @ ab2d84fe1023/Parser/token.c#L199-250

Five three-character operators in Python 3.14: **=, ..., //=, <<=, >>=. The function is structured as a triple-nested switch.

case '*':
switch (c2) {
case '*':
switch (c3) {
case '=': return DOUBLESTAREQUAL;
}
break;
}
break;
case '.':
switch (c2) {
case '.':
switch (c3) {
case '.': return ELLIPSIS;
}
break;
}
break;

The lexer probes optimistically one, then two, then three characters, keeping the first non-OP return.

gopy mirror

The Go port preserves the table layout and switch shape. Token kinds become a type Kind int with Stringer; the string array is generated at package init from pycore_token.h. The escalating one/two/three-char probes live in parser/lexer/op.go, not the token package, so the token package stays purely tabular.

Regeneration

Edit Grammar/Tokens in CPython and rerun Tools/build/generate_token.py. gopy mirrors the same generator output. Do not hand-edit token.c.