`Parser/tokenizer/file_tokenizer.c`

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c

One of three tokenizer back-ends (file, string, readline). This one wraps a C FILE* and feeds the core lexer one logical line at a time. It handles all encoding detection that must happen before the first byte is handed to the scanner: the UTF-8 BOM, the UTF-16 BOM, and the PEP 263 # -*- coding: … comment in the first two lines.

The detected encoding is stored in tok->encoding and propagated to compile() and the io layer. If detection fails, a SyntaxError is raised before any token is emitted.

Map

Lines	Symbol	Role	gopy
1-42	static helpers / macros	`BOM_UTF8`, `BOM_UTF16_LE`, column-safe `fgetc` wrapper.	internal constants in `parser/tokenizer/file.go`
44-118	`get_coding_spec`	Scans one text line for `coding[=:]\s*[-\w.]+` and returns the encoding name.	`getCodingSpec`
120-219	`check_coding_spec`	Reads up to two lines and calls `get_coding_spec`; stores result in `tok->encoding`.	`checkCodingSpec`
221-298	`fp_readline`	The `underflow` callback: reads one line from `tok->fp`, expands the line buffer if needed.	`(*FileTokenizer).readline`
300-410	`_PyTokenizer_FromFile`	Factory: opens or adopts a `FILE*`, detects encoding, installs `fp_readline` as `underflow`.	`NewFileTokenizer`
412-493	`_PyTokenizer_Free`	Releases line buffer, closes file if owned, frees encoding string.	`(*FileTokenizer).Free`

Reading

`_PyTokenizer_FromFile` (lines 300 to 410)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c#L300-410

struct tok_state *
_PyTokenizer_FromFile(FILE *fp, const char *ps1, const char *ps2,
                      const char *enc)
{
    struct tok_state *tok = _PyTokenizer_New();
    if (tok == NULL)
        return NULL;
    if (enc != NULL) {
        /* encoding forced by the caller (e.g. from a BOM already read) */
        tok->encoding = new_string(enc, strlen(enc), tok);
        if (!tok->encoding) goto error;
        tok->decoding_state = STATE_NORMAL;
    }
    tok->fp = fp ? fp : stdin;
    tok->prompt      = ps1;
    tok->nextprompt  = ps2;
    tok->underflow   = fp_readline;

    if (fp != NULL && enc == NULL) {
        if (!check_coding_spec(tok))
            goto error;
    }
    return tok;
error:
    _PyTokenizer_Free(tok);
    return NULL;
}

When fp is NULL the tokenizer reads from stdin (the interactive interpreter path). When enc is already known (the caller stripped a BOM before handing the file over) encoding detection is skipped and the provided name is adopted directly.

ps1 / ps2 are the interactive prompts (>>> and ...). They are written to stderr by fp_readline whenever it needs another line from the user.

Encoding detection: BOM and coding comment (lines 1 to 219)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c#L1-219

Three cases are handled in order:

UTF-8 BOM (\xEF\xBB\xBF). Detected at the start of check_coding_spec by peeking three bytes with fgetc/ungetc. If found, tok->encoding is set to "utf-8-sig" and the three bytes are consumed so the scanner never sees them.

/* line 140-155 */
c1 = fgetc(fp);
if (c1 == EOF) goto done;
c2 = fgetc(fp);
if (c2 == EOF) { ungetc(c1, fp); goto done; }
c3 = fgetc(fp);
if (c1 == 0xEF && c2 == 0xBB && c3 == 0xBF) {
    tok->encoding = new_string("utf-8-sig", 9, tok);
    tok->decoding_state = STATE_NORMAL;
    goto done;
}
ungetc(c3, fp); ungetc(c2, fp); ungetc(c1, fp);

UTF-16 BOM (\xFF\xFE or \xFE\xFF). Detected by the same two-byte peek. When found, tok->encoding is set to "utf-16" and the iconv decoder is initialized. UTF-16 source files are rare in practice but the tokenizer must handle them for compatibility.

PEP 263 coding: comment. get_coding_spec is called on the first line, and if it returns nothing, on the second line. The regex is:

coding[=:]\s*([-\w.]+)

The match is case-insensitive and the prefix must be preceded by a space, tab, or the start of the comment body to avoid false matches in ordinary strings.

If none of the three detection paths fires, tok->encoding defaults to "utf-8" (PEP 3120).

`fp_readline` (lines 221 to 298)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c#L221-298

static int
fp_readline(struct tok_state *tok)
{
    ...
    if (tok->prompt != NULL) {
        char *new_line = PyOS_Readline(stdin, stdout, tok->prompt);
        tok->prompt = tok->nextprompt;
        if (new_line == NULL) {
            tok->done = E_INTR;
            return 0;
        }
        ...
    }
    /* Non-interactive: read directly from fp */
    if (tok->decoding_fgets != NULL) {
        /* decoding path: iconv-wrapped fgets */
        ...
    }
    else {
        bytes = fgets_impl(tok->buf, tok->buf_end - tok->buf, tok->fp);
    }
    ...
}

The interactive path calls PyOS_Readline so the host application can supply its own line-editing library (e.g. readline). The non-interactive path uses either a plain fgets or the iconv-wrapped variant depending on whether a non-UTF-8 encoding was detected.

The line buffer starts at 1000 bytes and doubles on each reallocation. The tok->inp pointer is advanced to the end of the new data, and tok->cur is left pointing at the start so tok_nextc can begin scanning.

Map​

Reading​

_PyTokenizer_FromFile (lines 300 to 410)​

Encoding detection: BOM and coding comment (lines 1 to 219)​

fp_readline (lines 221 to 298)​

Map

Reading

`_PyTokenizer_FromFile` (lines 300 to 410)

Encoding detection: BOM and coding comment (lines 1 to 219)

`fp_readline` (lines 221 to 298)