Lib/tokenize.py
cpython 3.14 @ ab2d84fe1023/Lib/tokenize.py
tokenize is a pure-Python lexer that operates on a readline callable
rather than a raw string. Every call to readline must return a single
line including its trailing newline, or an empty bytes/str to signal
EOF. The module builds a set of compiled regular expressions at import
time and drives them through a single _tokenize generator.
The public API has two entry points. tokenize(readline) expects a
bytes-returning callable and auto-detects the source encoding via
detect_encoding. generate_tokens(readline) expects a
str-returning callable and skips encoding detection. Both yield
TokenInfo namedtuples. A companion untokenize function reconstructs
source text from a token sequence.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-100 | TokenInfo, token constants | TokenInfo namedtuple (type, string, start, end, line); re-export of token module constants (COMMENT, NL, NEWLINE, INDENT, DEDENT, STRING, NUMBER, NAME, OP, ENCODING, ENDMARKER). | (stdlib pending) |
| 100-250 | detect_encoding, _get_normal_name, _get_encoding_from_cookie | BOM detection for UTF-8/UTF-16/UTF-32; two-line cookie scan using the coding[=:]\s*([-\w.]+) regex; returns (encoding, bom_lines). | (stdlib pending) |
| 250-450 | _tokenize | Inner generator: drives compiled regexes line-by-line, emits STRING, NUMBER, NAME, OP, COMMENT, NL, NEWLINE, and ERRORTOKEN tokens; handles string prefix detection (f, b, r, u, rb, br, etc.) and multi-line string continuation. | (stdlib pending) |
| 450-550 | INDENT/DEDENT stack inside _tokenize | Maintains an indents list (stack of column widths); emits INDENT when the current line's leading whitespace exceeds the top of the stack, and one DEDENT per popped level otherwise; raises IndentationError on mismatched dedents. | (stdlib pending) |
| 550-650 | untokenize, open, tokenize, generate_tokens | untokenize rebuilds source from (type, string) or full TokenInfo pairs, inserting whitespace from (row, col) positions when available; open wraps detect_encoding and returns an io.TextIOWrapper; public entry points. | (stdlib pending) |
Reading
detect_encoding cookie regex (lines 100 to 250)
cpython 3.14 @ ab2d84fe1023/Lib/tokenize.py#L100-250
# coding[=:]\s*([-\w.]+)
_CODING_SPEC = re.compile(r'coding[=:]\s*([-\w.]+)')
def _get_encoding_from_cookie(first_two_lines):
for line in first_two_lines:
# Only look at the first two lines
m = _CODING_SPEC.search(line)
if m:
return _get_normal_name(m.group(1))
return None
def detect_encoding(readline):
bom_enc = None
bom = readline()
# Check for BOM
if bom.startswith(BOM_UTF8):
bom_enc = 'utf-8-sig'
bom = bom[3:]
elif bom.startswith((BOM_UTF32_BE, BOM_UTF32_LE)):
bom_enc = 'utf-32'
elif bom.startswith((BOM_UTF16_BE, BOM_UTF16_LE)):
bom_enc = 'utf-16'
# Read second line for cookie
second = readline()
cookie_enc = _get_encoding_from_cookie(
[bom.decode('latin-1'), second.decode('latin-1')]
)
if bom_enc and cookie_enc:
if cookie_enc != bom_enc.replace('-sig', ''):
raise SyntaxError('encoding mismatch in coding cookie')
return (bom_enc or cookie_enc or 'utf-8'), [bom, second]
detect_encoding consumes at most two lines from the readline
callable. It first checks for byte-order marks (BOM) at the start of the
first line; a UTF-8 BOM is b'\xef\xbb\xbf'. It then searches each line
for the coding[=:]\s*([-\w.]+) pattern as required by PEP 263. If both
a BOM and a cookie are present the encodings must agree (after stripping
the -sig suffix from utf-8-sig). The two consumed lines are returned
alongside the encoding name so the caller can prepend them to the token
stream as ENCODING-type tokens.
_get_normal_name normalises aliases: latin-1, latin1, and
iso-8859-1 all map to iso-8859-1; utf-8 and utf8 both map to
utf-8.
INDENT/DEDENT stack algorithm (lines 450 to 550)
cpython 3.14 @ ab2d84fe1023/Lib/tokenize.py#L450-550
indents = [0] # stack of indent levels (column widths)
...
# At the start of each logical line:
column = 0
while pos < max:
if line[pos] == ' ':
column += 1
elif line[pos] == '\t':
column = (column // tabsize + 1) * tabsize
elif line[pos] == '\x0c': # form feed
column = 0
else:
break
pos += 1
if column > indents[-1]:
indents.append(column)
yield TokenInfo(INDENT, line[:pos], (lnum, 0), (lnum, pos), line)
elif column < indents[-1]:
while indents[-1] != column:
if column > indents[-1]:
raise IndentationError(
"unindent does not match any outer indentation level",
("<tokenize>", lnum, pos, line),
)
indents.pop()
yield TokenInfo(DEDENT, '', (lnum, pos), (lnum, pos), line)
The indents list acts as a stack of column widths for each active
indentation level. At the start of each non-blank, non-continuation line
the tokenizer counts leading spaces and tabs (expanding tabs to the next
multiple of 8 by default). If the resulting column exceeds the stack top,
one INDENT token is emitted and the new level is pushed. If the column
is smaller, DEDENT tokens are emitted for each level popped until the
stack top matches. A column that is smaller than the current top but not
present anywhere in the stack raises IndentationError.
_tokenize string detection (lines 250 to 450)
cpython 3.14 @ ab2d84fe1023/Lib/tokenize.py#L250-450
PseudoExtras = r'(\\\r?\n|\Z|\#[^\r\n]*)'
PseudoToken = Whitespace + r'|' + '|'.join([
PseudoExtras, Number, Funny, ContStr, Name])
# ContStr matches the opening of a string literal including its prefix:
# r'(?i)(?:r|u|rb|br|b|f|fr|rf)?'
# followed by either a triple-quoted opener or a single-quoted opener
# If the closing quote is not found on the same line, _tokenize enters
# multi-line string continuation mode:
endprog = _compile(endpats.get(initial) or endpats.get(token[:2]))
...
if endmatch:
pos = end = endmatch.end(0)
yield TokenInfo(STRING, token, spos, (lnum, end), line)
else:
strstart = (lnum, start)
contstr = line[start:]
contline = line
break # continue reading lines until closing quote found
_tokenize applies PseudoToken to each line in a loop. When the match
group corresponds to a string prefix (r, u, b, f, rb, br,
fr, rf, case-insensitive) the tokenizer looks up the correct end
pattern in endpats. For triple-quoted strings (""" or ''') the end
pattern crosses line boundaries, so _tokenize accumulates subsequent
lines in contstr until the closing triple quote is matched, then emits
the full multi-line STRING token.
gopy mirror
Lib/tokenize.py is not yet bundled in stdlib/MANIFEST.txt. The Go
side will need to expose a readline-compatible interface so that
compile.Tokenize can be driven by both bytes- and str-producing
callables. The detect_encoding logic needs to run before the lexer
proper, matching the two-step approach in CPython.