Skip to main content

Python/Python-tokenize.c

Overview

Python/Python-tokenize.c is a roughly 150-line C file that implements the _tokenize built-in extension module. Its sole public function exposed to Python is _tokenize.detect_encoding, which backs the tokenize.open() convenience function in Lib/tokenize.py.

The file exists because encoding detection has to read raw bytes before the file is decoded, and the logic (BOM recognition plus a two-line scan for a coding: cookie) is already implemented in the C tokenizer as PyTokenizer_FindEncodingFilename. Wrapping that function in a small C extension avoids duplicating the logic in pure Python and keeps tokenize.detect_encoding consistent with the encoding rules used by the parser itself.

The module is compiled unconditionally (unlike optional extension modules in Modules/) because the tokenize package is part of the standard library and encoding detection is required to run .py source files reliably.

Reading Subsections

1. detect_encoding: the sole exported function

// Python/Python-tokenize.c (CPython 3.14)

/*[clinic input]
_tokenize.detect_encoding
filename: object
/
Detect the encoding of a Python source file.

Scan the first two lines of `filename` for a BOM or coding cookie
and return (encoding, lines) where `lines` is a list containing the
raw bytes of the two lines read.
[clinic start generated code]*/

static PyObject *
_tokenize_detect_encoding_impl(PyObject *module, PyObject *filename)
{
PyObject *result = NULL;
FILE *fp = NULL;
char *encoding = NULL;

const char *fname = PyUnicode_AsUTF8(filename);
if (fname == NULL) {
return NULL;
}
fp = fopen(fname, "rb");
if (fp == NULL) {
PyErr_SetFromErrnoWithFilenameObject(PyExc_OSError, filename);
return NULL;
}
encoding = PyTokenizer_FindEncodingFilename(fileno(fp), filename);
...
result = Py_BuildValue("(sO)", encoding ? encoding : "utf-8", lines);
...
return result;
}

2. tokenize.open() integration in pure Python

# Lib/tokenize.py

try:
from _tokenize import detect_encoding as _detect_encoding_c
def detect_encoding(readline):
...
except ImportError:
# Fallback pure-Python implementation used when running from source.
def detect_encoding(readline):
...

def open(filename):
"""Open a file in read only mode using the encoding detected by
detect_encoding()."""
encoding, lines = detect_encoding(filename)
buffered = io.open(filename, "rb")
...
return io.TextIOWrapper(buffered, encoding=encoding,
line_buffering=True)

3. Module initialisation and method table

// Python/Python-tokenize.c (CPython 3.14)

static PyMethodDef tokenize_methods[] = {
_TOKENIZE_DETECT_ENCODING_METHODDEF /* macro from clinic output */
{NULL, NULL}
};

static struct PyModuleDef tokenize_module = {
PyModuleDef_HEAD_INIT,
"_tokenize",
NULL, /* docstring */
-1,
tokenize_methods,
};

PyMODINIT_FUNC
PyInit__tokenize(void)
{
return PyModule_Create(&tokenize_module);
}

Port status

Not yet ported to gopy. The dependency is PyTokenizer_FindEncodingFilename from Parser/tokenize.c. Once the gopy tokenizer port reaches the encoding detection layer, this module becomes a thin wrapper around it and can be shipped as module/tokenize/.