Skip to main content

Parser/tokenizer/file_tokenizer.c

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c

One of three tokenizer back-ends (file, string, readline). This one wraps a C FILE* and feeds the core lexer one logical line at a time. It handles all encoding detection that must happen before the first byte is handed to the scanner: the UTF-8 BOM, the UTF-16 BOM, and the PEP 263 # -*- coding: … comment in the first two lines.

The detected encoding is stored in tok->encoding and propagated to compile() and the io layer. If detection fails, a SyntaxError is raised before any token is emitted.

Map

LinesSymbolRolegopy
1-42static helpers / macrosBOM_UTF8, BOM_UTF16_LE, column-safe fgetc wrapper.internal constants in parser/tokenizer/file.go
44-118get_coding_specScans one text line for coding[=:]\s*[-\w.]+ and returns the encoding name.getCodingSpec
120-219check_coding_specReads up to two lines and calls get_coding_spec; stores result in tok->encoding.checkCodingSpec
221-298fp_readlineThe underflow callback: reads one line from tok->fp, expands the line buffer if needed.(*FileTokenizer).readline
300-410_PyTokenizer_FromFileFactory: opens or adopts a FILE*, detects encoding, installs fp_readline as underflow.NewFileTokenizer
412-493_PyTokenizer_FreeReleases line buffer, closes file if owned, frees encoding string.(*FileTokenizer).Free

Reading

_PyTokenizer_FromFile (lines 300 to 410)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c#L300-410

struct tok_state *
_PyTokenizer_FromFile(FILE *fp, const char *ps1, const char *ps2,
const char *enc)
{
struct tok_state *tok = _PyTokenizer_New();
if (tok == NULL)
return NULL;
if (enc != NULL) {
/* encoding forced by the caller (e.g. from a BOM already read) */
tok->encoding = new_string(enc, strlen(enc), tok);
if (!tok->encoding) goto error;
tok->decoding_state = STATE_NORMAL;
}
tok->fp = fp ? fp : stdin;
tok->prompt = ps1;
tok->nextprompt = ps2;
tok->underflow = fp_readline;

if (fp != NULL && enc == NULL) {
if (!check_coding_spec(tok))
goto error;
}
return tok;
error:
_PyTokenizer_Free(tok);
return NULL;
}

When fp is NULL the tokenizer reads from stdin (the interactive interpreter path). When enc is already known (the caller stripped a BOM before handing the file over) encoding detection is skipped and the provided name is adopted directly.

ps1 / ps2 are the interactive prompts (>>> and ...). They are written to stderr by fp_readline whenever it needs another line from the user.

Encoding detection: BOM and coding comment (lines 1 to 219)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c#L1-219

Three cases are handled in order:

UTF-8 BOM (\xEF\xBB\xBF). Detected at the start of check_coding_spec by peeking three bytes with fgetc/ungetc. If found, tok->encoding is set to "utf-8-sig" and the three bytes are consumed so the scanner never sees them.

/* line 140-155 */
c1 = fgetc(fp);
if (c1 == EOF) goto done;
c2 = fgetc(fp);
if (c2 == EOF) { ungetc(c1, fp); goto done; }
c3 = fgetc(fp);
if (c1 == 0xEF && c2 == 0xBB && c3 == 0xBF) {
tok->encoding = new_string("utf-8-sig", 9, tok);
tok->decoding_state = STATE_NORMAL;
goto done;
}
ungetc(c3, fp); ungetc(c2, fp); ungetc(c1, fp);

UTF-16 BOM (\xFF\xFE or \xFE\xFF). Detected by the same two-byte peek. When found, tok->encoding is set to "utf-16" and the iconv decoder is initialized. UTF-16 source files are rare in practice but the tokenizer must handle them for compatibility.

PEP 263 coding: comment. get_coding_spec is called on the first line, and if it returns nothing, on the second line. The regex is:

coding[=:]\s*([-\w.]+)

The match is case-insensitive and the prefix must be preceded by a space, tab, or the start of the comment body to avoid false matches in ordinary strings.

If none of the three detection paths fires, tok->encoding defaults to "utf-8" (PEP 3120).

fp_readline (lines 221 to 298)

cpython 3.14 @ ab2d84fe1023/Parser/tokenizer/file_tokenizer.c#L221-298

static int
fp_readline(struct tok_state *tok)
{
...
if (tok->prompt != NULL) {
char *new_line = PyOS_Readline(stdin, stdout, tok->prompt);
tok->prompt = tok->nextprompt;
if (new_line == NULL) {
tok->done = E_INTR;
return 0;
}
...
}
/* Non-interactive: read directly from fp */
if (tok->decoding_fgets != NULL) {
/* decoding path: iconv-wrapped fgets */
...
}
else {
bytes = fgets_impl(tok->buf, tok->buf_end - tok->buf, tok->fp);
}
...
}

The interactive path calls PyOS_Readline so the host application can supply its own line-editing library (e.g. readline). The non-interactive path uses either a plain fgets or the iconv-wrapped variant depending on whether a non-UTF-8 encoding was detected.

The line buffer starts at 1000 bytes and doubles on each reallocation. The tok->inp pointer is advanced to the end of the new data, and tok->cur is left pointing at the start so tok_nextc can begin scanning.