Lib/tokenize.py (part 2)

Source:

cpython 3.14 @ ab2d84fe1023/Lib/tokenize.py

This annotation covers encoding detection. See lib_tokenize_detail for tokenize, generate_tokens, token types, and the FSM.

Map

Lines	Symbol	Role
1-60	`detect_encoding`	Detect file encoding from BOM or `# -- coding: --` comment
61-140	`open`	Open a Python source file with the correct encoding
141-250	`_get_normal_name`	Normalize encoding name (e.g. `UTF-8` → `utf-8`)
251-400	`_detect_encoding`	Read first two lines; look for BOM or encoding cookie
401-550	`cookie_re`	Regex for PEP 263 encoding declaration
551-700	`TokenError` / `StopTokenizing`	Exception types used by the tokenizer

Reading

`detect_encoding`

# CPython: Lib/tokenize.py:294 detect_encoding
def detect_encoding(readline):
    """Return (encoding, lines_consumed) for a Python source file.
    Uses BOM detection first, then PEP 263 coding cookie.
    readline: callable returning the next line as bytes.
    """
    bom_found = False
    encoding = None
    default = 'utf-8'

    def find_cookie(line):
        line_string = line.decode('latin-1')
        match = cookie_re.match(line_string)
        if not match: return None
        return _get_normal_name(match.group(1))

    first = readline()
    if first.startswith(BOM_UTF8):
        bom_found = True
        first = first[3:]
        default = 'utf-8-sig'
    if first:
        encoding = find_cookie(first)
        if encoding:
            return encoding, [first]

    second = readline()
    if second:
        encoding = find_cookie(second)
        if encoding:
            return encoding, [first, second]

    return default, [first, second]

`cookie_re`

# CPython: Lib/tokenize.py:62 cookie_re
# PEP 263 encoding cookie:
# "coding[=:]\s*([-\w.]+)" must appear in the first two lines
# Examples:
#   # -*- coding: utf-8 -*-
#   # vim: set fileencoding=utf-8 :
#   # coding: latin-1
cookie_re = re.compile(
    r'^[ \t\f]*#.*?coding[:=][ \t]*([-\w.]+)', re.ASCII | re.MULTILINE)

BOM detection

# CPython: Lib/tokenize.py:310 BOM handling
BOM_UTF8 = b'\xef\xbb\xbf'
# If present at file start: encoding is UTF-8.
# Presence without a coding cookie is fine; the BOM itself declares UTF-8.
# BOM + coding cookie must agree (both UTF-8).

`tokenize.open`

# CPython: Lib/tokenize.py:400 open
def open(filename):
    """Open a file in read mode using the encoding detected by detect_encoding."""
    buffer = builtins.open(filename, 'rb')
    try:
        encoding, lines = detect_encoding(buffer.readline)
        buffer.seek(0)
        text = io.TextIOWrapper(buffer, encoding, line_buffering=True)
        text.mode = 'r'
        return text
    except:
        buffer.close()
        raise

gopy notes

tokenize.open uses builtins.open (gopy's vm.BuiltinOpen), io.TextIOWrapper (gopy objects/textiowrapper.go), and detect_encoding which is pure Python string manipulation. The BOM bytes constant is a Python bytes literal.

Map​

Reading​

detect_encoding​

cookie_re​

BOM detection​

tokenize.open​

gopy notes​

Map