Skip to main content

TextIOWrapper detail

Overview

textio.c is the thickest file in the _io module. It implements TextIOWrapper, the layer that converts between raw bytes (from a BufferedReader/BufferedWriter) and Python str objects. The key moving parts are: the incremental codec pair, the line accumulator (pending_bytes), the universal-newlines state machine, and the snapshot used by tell() to reconstruct a byte position from a codec state.

Map

RegionLines (approx)What lives there
Struct + init helpers1-350textio C struct, textiowrapper_init, codec lookup
_textiowrapper_writeflush350-500drain pending_bytes, call codec encoder
textiowrapper_write500-700accumulate to pending_bytes, write_through path
_textiowrapper_read_chunk700-950read raw bytes, run incremental decoder
textiowrapper_readline950-1350universal newlines state machine
textiowrapper_read1350-1550bulk read, merge decoded chunks
textiowrapper_tell1550-1800snapshot: bytes-offset + codec state pickle
textiowrapper_seek1800-2100restore snapshot, re-decode lead bytes
Properties2100-2400line_buffering, write_through, encoding, errors, newlines
textiowrapper_iternext2400-2600__next__ delegates to readline
Utility / class def2600-3500repr, closed check, type slot wiring

Reading

pending_bytes and write_through

undefined #L500-700

textiowrapper_write does not encode immediately. Instead it appends the str argument to self->pending_bytes (a list of str objects). The list is flushed to the underlying buffer only when:

  • self->write_through is set (forced immediate encode + flush on every write),
  • the accumulated total exceeds self->chunk_size (8192 bytes by default), or
  • an explicit flush() is called.

The flush path (_textiowrapper_writeflush) joins pending_bytes into a single str, calls self->encoder.encode(s), and writes the resulting bytes to the underlying buffer. Joining first (rather than encoding each fragment) matters because some codecs (e.g. UTF-16) produce a BOM only for the first chunk; a single encode call keeps the BOM logic correct.

write_through bypasses the list entirely: the string is encoded immediately and the underlying buffer's own flush is called. This is the mode used by sys.stdout in unbuffered mode.

_read_chunk and the incremental decoder

undefined #L700-950

_textiowrapper_read_chunk reads up to self->chunk_size bytes from the raw buffer and feeds them to self->decoder (an IncrementalDecoder). The decoder is called as decoder.decode(b, final=False). The decoded str is appended to self->decoded_chars and the raw byte count consumed is stored in self->decoded_chars_used for accounting by tell().

The snapshot saved before each chunk read is the pair (raw_offset, decoder_state) where decoder_state is the opaque bytes returned by decoder.getstate(). This snapshot is what tell() returns (encoded as a single integer via a positional packing scheme described below).

readline, universal newlines, and tell/seek

undefined #L950-1200

textiowrapper_readline scans self->decoded_chars for a newline. When self->readnl is None (universal newlines mode) it accepts \n, \r, and \r\n. The \r case is tricky: after consuming \r, the function must peek at the next character to decide whether to absorb a following \n. If the buffer is exhausted at the \r, another _read_chunk is issued before deciding. The current newline translation is tracked in self->seennl (a bitmask) which is exposed as the newlines property.

textiowrapper_tell returns a cookie integer. The cookie encodes three fields packed into a 64-bit int: the raw byte offset (lower bits), the number of bytes to feed the decoder before the position is valid (dec_flags), and the number of characters to skip after decoding (chars_to_skip). This scheme allows textiowrapper_seek to reconstruct exact mid-stream positions without storing state externally.

textiowrapper_seek for a non-zero position calls raw.seek(start_pos), feeds dec_flags bytes through decoder.decode with final=False to restore codec state, then discards chars_to_skip characters from the decoded output. Seeking to offset 0 is a fast path that resets the decoder with decoder.reset().

The 3.14 change adds a newline_decoder fast path for \n-only files where \r translation is entirely skipped, improving readline throughput by roughly 8% on large log files.

gopy notes

  • pending_bytes is modelled as []string in module/io/textiowrapper.go. The join-before-encode invariant is preserved by writeFlush.
  • The incremental decoder interface maps to golang.org/x/text/transform.Transformer. decoder.getstate() has no direct Go equivalent; the port saves the transformer's internal byte buffer (pendingBytes []byte) as the snapshot state.
  • The tell cookie packing uses encoding/binary with a fixed layout struct. The field widths match CPython's constants COOKIE_BUF_OFFSET_SIZE etc. so cookies produced by gopy are interchangeable with CPython's for the same file.
  • Universal newlines scanning is a hand-ported loop in readLine; the seennl bitmask is kept as a uint8 with the same bit positions as CPython's SEEN_CR, SEEN_LF, SEEN_CRLF constants (1, 2, 4).
  • write_through is a bool field; when set, Write calls b.Flush() on the underlying buffered writer after every encode, matching CPython behaviour.