`Parser/lexer/lexer.c`

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c

The byte-level scanner. Consumes from tok->cur, produces struct token records, drives indent/dedent emission, and handles the f-string / t-string mode stack added in 3.12+. Everything is keyed off a single struct tok_state (defined in Parser/lexer/state.h).

Three top-level entry points (tok_get_normal_mode, tok_get_fstring_mode, and the public _PyTokenizer_Get) plus a collection of helpers.

Map

Lines	Symbol	Role	gopy
9-23	`is_potential_identifier_*`	ASCII + utf-8-lead classification.	`parser/lexer/ident.go`
25-39	`TOK_GET_MODE` / `TOK_NEXT_MODE`	F-string mode stack accessors.	`(*Tokenizer).mode`
41-46	`FTSTRING_*`, `MAKE_TOKEN`	Common token-emit macros.	`emit*` helpers
59-96	`tok_nextc`	Pull next char; refills the line buffer.	`(*Tokenizer).nextc`
99-225	`tok_backup`	Push back one char.	`(*Tokenizer).backup`
227-280	`_PyLexer_update_ftstring_expr`	Track current f-string expression.	`updateFTStringExpr`
282-362	`lookahead`	Multi-char lookahead without consuming.	`lookahead`
364-411	`verify_identifier`	UCD check via `unicodedata`.	`verifyIdentifier`
413-499	`tok_decimal_tail`	Scan integer/float tail.	`decimalTail`
501-1391	`tok_get_normal_mode`	The main scanner. ~890 lines.	`(*Tokenizer).normalMode`
1393-1614	`tok_get_fstring_mode`	Scan inside an f-string / t-string body.	`(*Tokenizer).fstringMode`
1616-1624	`tok_get`	Dispatch on mode stack top.	`(*Tokenizer).Next`
1626-1635	`_PyTokenizer_Get`	Public entry; coerces decode errors to `ERRORTOKEN`.	`(*Tokenizer).Get`

Reading

identifier-character macros (lines 9-23)

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L9-23

#define is_potential_identifier_start(c) (\
              (c >= 'a' && c <= 'z')\
               || (c >= 'A' && c <= 'Z')\
               || c == '_'\
               || (c >= 128))

#define is_potential_identifier_char(c) (\
              (c >= 'a' && c <= 'z')\
               || (c >= 'A' && c <= 'Z')\
               || (c >= '0' && c <= '9')\
               || c == '_'\
               || (c >= 128))

The >= 128 clause is the lead byte of a UTF-8 sequence. The lexer admits any non-ASCII byte as a possible identifier character at this stage; whether the full code point is actually allowed is checked afterwards in verify_identifier (364-411) using the Unicode character database.

`tok_nextc` (lines 59-96)

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L59-96

static int
tok_nextc(struct tok_state *tok)
{
    int rc;
    for (;;) {
        if (tok->cur != tok->inp) {
            if ((unsigned int) tok->col_offset >= (unsigned int) INT_MAX) {
                tok->done = E_COLUMNOVERFLOW;
                return EOF;
            }
            tok->col_offset++;
            return Py_CHARMASK(*tok->cur++); /* Fast path */
        }
        if (tok->done != E_OK) {
            return EOF;
        }
        rc = tok->underflow(tok);
        ...
        if (!rc) {
            tok->cur = tok->inp;
            return EOF;
        }
        tok->line_start = tok->cur;

        if (contains_null_bytes(tok->line_start, tok->inp - tok->line_start)) {
            _PyTokenizer_syntaxerror(tok, "source code cannot contain null bytes");
            tok->cur = tok->inp;
            return EOF;
        }
    }
    Py_UNREACHABLE();
}

Fast path is one increment of tok->cur plus a Py_CHARMASK to coerce to unsigned. Slow path calls tok->underflow, the polymorphic refill function installed by the tokenizer factory (file, string, readline, utf-8). Two failure modes:

col_offset exceeds INT_MAX, which sets E_COLUMNOVERFLOW. The column counter wraps at INT_MAX so absurdly long lines are rejected cleanly.
The refilled buffer contains a NUL byte. Python source may not contain NUL; the lexer emits SyntaxError and stops.

`verify_identifier` (lines 364-411)

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L364-411

After we've collected what we thought might be a NAME, decode it as UTF-8 and check every code point against the UCD's XID_Start / XID_Continue properties. Bad code points produce a SyntaxError with a pointer at the offending character. This is what makes café legal and ¬x not.

`tok_get_normal_mode` (lines 501-1391)

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L501-1391

The bulk of the file. A single function with a nextline: label at the top and a colossal switch (c) over the next character. Inside each case it consumes the longest legal token then jumps to MAKE_TOKEN.

The indentation block at the head of the function (514-580):

/* Get indentation level */
if (tok->atbol) {
    int col = 0;
    int altcol = 0;
    tok->atbol = 0;
    int cont_line_col = 0;
    for (;;) {
        c = tok_nextc(tok);
        if (c == ' ') {
            col++, altcol++;
        }
        else if (c == '\t') {
            col = (col / tok->tabsize + 1) * tok->tabsize;
            altcol = (altcol / ALTTABSIZE + 1) * ALTTABSIZE;
        }
        else if (c == '\014')  {/* Control-L (formfeed) */
            col = altcol = 0;
        }
        ...
    }
    ...
    if (col == tok->indstack[tok->indent]) {
        /* No change */
        if (altcol != tok->altindstack[tok->indent]) {
            return MAKE_TOKEN(_PyTokenizer_indenterror(tok));
        }
    }
    else if (col > tok->indstack[tok->indent]) {
        ...

Two parallel columns are tracked: col uses the configured tab size, altcol uses ALTTABSIZE = 1. They must agree at every indent level; disagreement is the classic "inconsistent use of tabs and spaces" TabError.

Order of operations inside the main switch:

Leading whitespace and indentation. Tabs and spaces are counted; when the first non-whitespace character of a logical line is reached, the column is compared against the indent stack. A deeper column pushes one INDENT token; a shallower column pops and emits one DEDENT per level.
Comment handling. A # starts a comment; the lexer either skips to end of line or emits a TYPE_COMMENT if the comment begins with # type:.
Newlines. A real \n outside brackets emits NEWLINE; inside brackets it is silently consumed (implicit line joining).
String prefixes. r, b, u, f, t, plus all the case-insensitive combos. After parsing the prefix the lexer peeks for ' or "; if neither, it backs up and falls through to NAME.
f-strings and t-strings. When the prefix is f or t the lexer emits FSTRING_START / TSTRING_START and pushes a new tokenizer_mode onto the stack. The next call to _PyTokenizer_Get dispatches to tok_get_fstring_mode.
Numbers. Hex/oct/bin prefixes are handled inline; otherwise tok_decimal_tail consumes digits and underscores, plus optional ., exponent, and j suffix.
Operators. The lexer probes one, then two, then three characters via _PyToken_OneChar / _PyToken_TwoChars / _PyToken_ThreeChars (see token.c), keeping the longest match.

`tok_get_fstring_mode` (lines 1393-1614)

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L1393-1614

Scans inside an f-string body. The tricky part is the alternation between FSTRING_MIDDLE chunks (raw text and {{ / }} escapes) and the embedded expressions inside {...}. Each { pops back to normal mode (with the f-string mode still on the stack underneath); each matching } pops the normal mode back off and resumes scanning the template. The function also handles !r/!s/!a conversion and :format format-spec.

PEP 750 t-strings reuse the same function via the string_kind field on tokenizer_mode; the only difference is which token kinds get emitted (FTSTRING_MIDDLE / FTSTRING_END macros at 41-42).

`tok_get` and `_PyTokenizer_Get` (lines 1616-1635)

cpython 3.14 @ ab2d84fe1023/Parser/lexer/lexer.c#L1616-1635

static int
tok_get(struct tok_state *tok, struct token *token)
{
    tokenizer_mode *current_tok = TOK_GET_MODE(tok);
    if (current_tok->kind == TOK_REGULAR_MODE) {
        return tok_get_normal_mode(tok, current_tok, token);
    } else {
        return tok_get_fstring_mode(tok, current_tok, token);
    }
}

int
_PyTokenizer_Get(struct tok_state *tok, struct token *token)
{
    int result = tok_get(tok, token);
    if (tok->decoding_erred) {
        result = ERRORTOKEN;
        tok->done = E_DECODE;
    }
    return result;
}

The public entry coerces a decode error to ERRORTOKEN with E_DECODE. Keeping the coercion here means every internal caller can ignore the decode-error flag.

Notes for the gopy mirror

parser/lexer/lexer.go follows the same structure: one normalMode method and one fstringMode method.
The tok_state -> Tokenizer rename is the only structural change.
gopy uses rune for character classification but keeps the byte cursor; the >= 128 lead-byte trick still applies.
The indent stack is a []int; the f-string mode stack is a slice of mode structs.

CPython 3.14 changes worth noting

EXCLAMATION token added for f-string conversion syntax.
TSTRING_* triplet added for PEP 750.
# type: ignore[code] form is recognised; the bracketed code is captured in token.type_ignores.

Map​

Reading​

identifier-character macros (lines 9-23)​

tok_nextc (lines 59-96)​

verify_identifier (lines 364-411)​

tok_get_normal_mode (lines 501-1391)​

tok_get_fstring_mode (lines 1393-1614)​

tok_get and _PyTokenizer_Get (lines 1616-1635)​

Notes for the gopy mirror​

CPython 3.14 changes worth noting​

Map