Skip to main content

1665. gopy tokenize

What we are porting

One file from cpython/Python/:

C sourceLinesGo target
Python-tokenize.c445tokenize/tokenize.go

Python-tokenize.c is the Python-visible wrapper around the lexer. It exposes the _tokenize.TokenizerIter class that tokenize.tokenize() in the stdlib delegates to. The actual lexer (state machine, indent tracking, f-string handling, encoding detection) lives in cpython/Parser/tokenizer/ and is in the parser spec series.

This spec is small on purpose. It covers only the iterator surface and its mapping to a Go iterator. The lexer port underneath is tracked elsewhere; v0.9 wires this module through the import system.

What the wrapper does

tokenizeriter_new constructs a tokenizeriterobject from:

  • a readline callable (io.TextIO.readline-shaped), or
  • a source string (source keyword), with optional extra_tokens flag.

tokenizeriter_next advances the underlying struct tok_state* one token at a time and returns a 5-tuple (type, string, start, end, line). The type value is one of the constants exported from token.h (NAME=1, NUMBER=2, STRING=3, OP=54, etc.). When extra_tokens=False the wrapper filters out COMMENT, NL, ENCODING, and NEWLINE-at-EOF.

The wrapper also handles:

  • Encoding detection on byte input. The first physical line is parsed for coding: BOM or PEP 263 cookie.
  • SyntaxError lifting. A lexer error becomes a Python SyntaxError with the right offset, end-offset, and source line.
  • tok->cur to source-line slicing for the line field of every token.

Go shape

package tokenize

type Type int

const (
ENDMARKER Type = iota
NAME
NUMBER
STRING
NEWLINE
INDENT
DEDENT
LPAR
// ... full set generated from cpython/Grammar/Tokens.
)

type Pos struct{ Line, Col int }

type Token struct {
Type Type
Value string
Start Pos
End Pos
Line string
}

// Iter is the Go-side TokenizerIter equivalent. Next returns
// io.EOF when the stream is exhausted.
type Iter struct {
// unexported handle into the lexer state
}

func New(src string, extraTokens bool) *Iter
func NewReadline(rl func() (string, error), extraTokens bool) *Iter

func (it *Iter) Next() (Token, error)

The lexer state behind Iter is the Go port of cpython/Parser/tokenizer/. v0.5 stubs this with a parser-side lexer skeleton sufficient for the pipeline gate; the public surface in this package is stable across that change.

Token type constants are generated from cpython/Grammar/Tokens by tools/tokens_go and emitted into tokenize/types_gen.go. The generator is shared with the parser spec.

Stdlib bridge

Python's tokenize.tokenize(readline) is implemented in Lib/tokenize.py and calls _tokenize.TokenizerIter. Once v0.9 wires the import system through frozen modules, our tokenize/ package backs that import via a Go module exposing TokenizerIter through the objects.Module protocol.

Errors

The wrapper preserves CPython's exact SyntaxError strings. Cases:

  • Unterminated string literal.
  • Inconsistent indentation.
  • Unmatched bracket.
  • Encoding declaration mismatch.
  • Tab/space mixing under -tt.

Each is reproduced verbatim from Python-tokenize.c plus the lexer error path.

Gate

Round-trip tokenize a panel of source files (test_tokenize.py fixtures from CPython) and assert the token stream matches what CPython's tokenize.tokenize produces, byte-for-byte on the value field and exact on type plus position. The gate lives in tokenize/tokenize_test.go and ships with v0.9 once the import plumbing exists; v0.5 adds the package skeleton plus a Go-only unit test of the iterator contract.

Out of scope

  • The lexer state machine itself (Parser/tokenizer/). Separate spec.
  • untokenize. Lives in Lib/tokenize.py, ported as part of the stdlib effort.
  • Async tokenization. Not a CPython surface.

v0.5 checklist

This is the implementation contract. Each box maps to a test in 1625 §16.

Files

  • token/token.go: hand-written Type declaration plus String() method that delegates to the generated tokenNames table. Lives in its own package to mirror CPython's pycore_token.h / Lib/token.py split and avoid an import cycle with the parser's lexer.
  • tools/tokens_go/main.go: generator that reads Include/internal/pycore_token.h (the #define NAME N lines) and Lib/token.py (the NAME = N lines), filters N_TOKENS/NT_OFFSET/ENCODING_OFFSET, and emits token/types_gen.go.
  • token/types_gen.go: 69 token kinds (ENDMARKER=0..ENCODING=68) plus NTokens=69 and the tokenNames table.
  • tokenize/tokenize.go:
    • Pos struct{ Line, Col int }.
    • Token struct{ Type token.Type; Value string; Start, End Pos; Line string }. (Line is left empty; full source-line slicing tracked under the v0.9 surface guarantee below.)
    • Iter opaque struct holding the underlying *lexer.State.
    • New(src string, extraTokens bool) *Iter: wires lexer.FromString plus SetExtraTokens.
    • NewReadline(rl func() (string, error), extraTokens bool) *Iter : wires lexer.FromReadline with a []byte/string adapter.
    • (it *Iter) Next() (Token, error) returning io.EOF after the lexer's ENDMARKER lands.
  • tokenize/tokenize_test.go: ENDMARKER-then-EOF, simple assignment kinds, readline driver smoke. The numeric/Type.String panel moved to token/token_test.go along with the constants.

Surface guarantees

  • Numeric values of Type constants match cpython/Include/internal/pycore_token.h and cpython/Lib/token.py one-to-one. Pinned by TestTypeNumericValues.
  • extraTokens=false filters COMMENT, NL, ENCODING, and the trailing NEWLINE-at-EOF (CPython parity). (v0.9.)
  • Encoding detection on byte input recognises BOM and PEP 263 cookies. (v0.9.)
  • Lexer errors lift to a Go *SyntaxError whose Error() string is verbatim from Python-tokenize.c plus the lexer error path. (v0.9.)
  • tok->cur slicing for the Line field of every token. (v0.9.)

Stdlib bridge (v0.9, tracked here)

  • objects.Module shim that exposes TokenizerIter so Lib/tokenize.py works once the import system is wired.

Out of scope for v0.5

  • The actual lexer state machine. Tracked under Parser/tokenizer/.
  • untokenize. Stdlib effort.