Parser/lexer/lexer.c
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c
The byte-level scanner. Consumes from tok->cur, produces struct token records, drives indent/dedent emission, and handles the
f-string / t-string mode stack added in 3.12+. Everything is keyed off
a single struct tok_state (defined in Parser/lexer/state.h).
Three top-level entry points (tok_get_normal_mode,
tok_get_fstring_mode, and the public _PyTokenizer_Get) plus a
collection of helpers.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 9-23 | is_potential_identifier_* | ASCII + utf-8-lead classification. | parser/lexer/ident.go |
| 25-39 | TOK_GET_MODE / TOK_NEXT_MODE | F-string mode stack accessors. | (*Tokenizer).mode |
| 41-46 | FTSTRING_*, MAKE_TOKEN | Common token-emit macros. | emit* helpers |
| 59-96 | tok_nextc | Pull next char; refills the line buffer. | (*Tokenizer).nextc |
| 99-225 | tok_backup | Push back one char. | (*Tokenizer).backup |
| 227-280 | _PyLexer_update_ftstring_expr | Track current f-string expression. | updateFTStringExpr |
| 282-362 | lookahead | Multi-char lookahead without consuming. | lookahead |
| 364-411 | verify_identifier | UCD check via unicodedata. | verifyIdentifier |
| 413-499 | tok_decimal_tail | Scan integer/float tail. | decimalTail |
| 501-1391 | tok_get_normal_mode | The main scanner. ~890 lines. | (*Tokenizer).normalMode |
| 1393-1614 | tok_get_fstring_mode | Scan inside an f-string / t-string body. | (*Tokenizer).fstringMode |
| 1616-1624 | tok_get | Dispatch on mode stack top. | (*Tokenizer).Next |
| 1626-1635 | _PyTokenizer_Get | Public entry; coerces decode errors to ERRORTOKEN. | (*Tokenizer).Get |
Reading
identifier-character macros (lines 9-23)
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L9-23
#define is_potential_identifier_start(c) (\
(c >= 'a' && c <= 'z')\
|| (c >= 'A' && c <= 'Z')\
|| c == '_'\
|| (c >= 128))
#define is_potential_identifier_char(c) (\
(c >= 'a' && c <= 'z')\
|| (c >= 'A' && c <= 'Z')\
|| (c >= '0' && c <= '9')\
|| c == '_'\
|| (c >= 128))
The >= 128 clause is the lead byte of a UTF-8 sequence. The lexer
admits any non-ASCII byte as a possible identifier character at this
stage; whether the full code point is actually allowed is checked
afterwards in verify_identifier (364-411)
using the Unicode character database.
tok_nextc (lines 59-96)
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L59-96
static int
tok_nextc(struct tok_state *tok)
{
int rc;
for (;;) {
if (tok->cur != tok->inp) {
if ((unsigned int) tok->col_offset >= (unsigned int) INT_MAX) {
tok->done = E_COLUMNOVERFLOW;
return EOF;
}
tok->col_offset++;
return Py_CHARMASK(*tok->cur++); /* Fast path */
}
if (tok->done != E_OK) {
return EOF;
}
rc = tok->underflow(tok);
...
if (!rc) {
tok->cur = tok->inp;
return EOF;
}
tok->line_start = tok->cur;
if (contains_null_bytes(tok->line_start, tok->inp - tok->line_start)) {
_PyTokenizer_syntaxerror(tok, "source code cannot contain null bytes");
tok->cur = tok->inp;
return EOF;
}
}
Py_UNREACHABLE();
}
Fast path is one increment of tok->cur plus a Py_CHARMASK to
coerce to unsigned. Slow path calls tok->underflow, the polymorphic
refill function installed by the tokenizer factory (file, string,
readline, utf-8). Two failure modes:
col_offsetexceedsINT_MAX, which setsE_COLUMNOVERFLOW. The column counter wraps at INT_MAX so absurdly long lines are rejected cleanly.- The refilled buffer contains a NUL byte. Python source may not
contain NUL; the lexer emits
SyntaxErrorand stops.
verify_identifier (lines 364-411)
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L364-411
After we've collected what we thought might be a NAME, decode it as
UTF-8 and check every code point against the UCD's XID_Start /
XID_Continue properties. Bad code points produce a SyntaxError
with a pointer at the offending character. This is what makes café
legal and ¬x not.
tok_get_normal_mode (lines 501-1391)
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L501-1391
The bulk of the file. A single function with a nextline: label at
the top and a colossal switch (c) over the next character. Inside
each case it consumes the longest legal token then jumps to MAKE_TOKEN.
The indentation block at the head of the function (514-580):
/* Get indentation level */
if (tok->atbol) {
int col = 0;
int altcol = 0;
tok->atbol = 0;
int cont_line_col = 0;
for (;;) {
c = tok_nextc(tok);
if (c == ' ') {
col++, altcol++;
}
else if (c == '\t') {
col = (col / tok->tabsize + 1) * tok->tabsize;
altcol = (altcol / ALTTABSIZE + 1) * ALTTABSIZE;
}
else if (c == '\014') {/* Control-L (formfeed) */
col = altcol = 0;
}
...
}
...
if (col == tok->indstack[tok->indent]) {
/* No change */
if (altcol != tok->altindstack[tok->indent]) {
return MAKE_TOKEN(_PyTokenizer_indenterror(tok));
}
}
else if (col > tok->indstack[tok->indent]) {
...
Two parallel columns are tracked: col uses the configured tab size,
altcol uses ALTTABSIZE = 1. They must agree at every indent level;
disagreement is the classic "inconsistent use of tabs and spaces"
TabError.
Order of operations inside the main switch:
- Leading whitespace and indentation. Tabs and spaces are counted; when the first non-whitespace character of a logical line is reached, the column is compared against the indent stack. A deeper column pushes one INDENT token; a shallower column pops and emits one DEDENT per level.
- Comment handling. A
#starts a comment; the lexer either skips to end of line or emits aTYPE_COMMENTif the comment begins with# type:. - Newlines. A real
\noutside brackets emits NEWLINE; inside brackets it is silently consumed (implicit line joining). - String prefixes.
r,b,u,f,t, plus all the case-insensitive combos. After parsing the prefix the lexer peeks for'or"; if neither, it backs up and falls through to NAME. - f-strings and t-strings. When the prefix is
fortthe lexer emitsFSTRING_START/TSTRING_STARTand pushes a newtokenizer_modeonto the stack. The next call to_PyTokenizer_Getdispatches totok_get_fstring_mode. - Numbers. Hex/oct/bin prefixes are handled inline; otherwise
tok_decimal_tailconsumes digits and underscores, plus optional., exponent, andjsuffix. - Operators. The lexer probes one, then two, then three
characters via
_PyToken_OneChar/_PyToken_TwoChars/_PyToken_ThreeChars(seetoken.c), keeping the longest match.
tok_get_fstring_mode (lines 1393-1614)
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L1393-1614
Scans inside an f-string body. The tricky part is the alternation
between FSTRING_MIDDLE chunks (raw text and {{ / }} escapes) and
the embedded expressions inside {...}. Each { pops back to normal
mode (with the f-string mode still on the stack underneath); each
matching } pops the normal mode back off and resumes scanning the
template. The function also handles !r/!s/!a conversion and
:format format-spec.
PEP 750 t-strings reuse the same function via the string_kind field
on tokenizer_mode; the only difference is which token kinds get
emitted (FTSTRING_MIDDLE / FTSTRING_END macros at 41-42).
tok_get and _PyTokenizer_Get (lines 1616-1635)
cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L1616-1635
static int
tok_get(struct tok_state *tok, struct token *token)
{
tokenizer_mode *current_tok = TOK_GET_MODE(tok);
if (current_tok->kind == TOK_REGULAR_MODE) {
return tok_get_normal_mode(tok, current_tok, token);
} else {
return tok_get_fstring_mode(tok, current_tok, token);
}
}
int
_PyTokenizer_Get(struct tok_state *tok, struct token *token)
{
int result = tok_get(tok, token);
if (tok->decoding_erred) {
result = ERRORTOKEN;
tok->done = E_DECODE;
}
return result;
}
The public entry coerces a decode error to ERRORTOKEN with
E_DECODE. Keeping the coercion here means every internal caller can
ignore the decode-error flag.
Notes for the gopy mirror
parser/lexer/lexer.gofollows the same structure: onenormalModemethod and onefstringModemethod.- The
tok_state->Tokenizerrename is the only structural change. - gopy uses
runefor character classification but keeps the byte cursor; the>= 128lead-byte trick still applies. - The indent stack is a
[]int; the f-string mode stack is a slice ofmodestructs.
CPython 3.14 changes worth noting
EXCLAMATIONtoken added for f-string conversion syntax.TSTRING_*triplet added for PEP 750.# type: ignore[code]form is recognised; the bracketed code is captured intoken.type_ignores.