textwrap.py

textwrap reformats plain text: wrapping long paragraphs to a target width, stripping common indentation, and adding a uniform prefix. All of the module-level functions (wrap, fill, dedent, indent, shorten) delegate to a TextWrapper instance.

Map

Lines	Symbol	Role
1–40	imports, `__all__`	module setup
41–120	`TextWrapper.__init__`	option attributes and compiled regex
121–200	`TextWrapper._split`, `_split_chunks`	tokenizer
201–300	`TextWrapper._wrap_chunks`, `_handle_long_word`	greedy line packer
301–370	`TextWrapper.wrap`, `fill`	public entry points
371–400	`dedent`	common-prefix stripper
401–430	`indent`	prefix adder
431–450	`shorten`, module-level wrappers	convenience API

Reading

Tokenization with `_split`

TextWrapper._split calls _split_chunks which applies a compiled regex to the text and returns a flat list of tokens. Each token is either a whitespace run, a word, or a hyphenation point. The regex is constructed once during __init__ from the break_on_hyphens flag.

# CPython: Lib/textwrap.py:76 TextWrapper
wordsep_re = re.compile(
    r'(\s+|'                           # any whitespace
    r'(?<=[\w\!\"\'\&\.\,\?])-{2,}(?=\w))')  # em-dash

Tokens that are pure whitespace may be dropped at line boundaries unless drop_whitespace is False. This keeps the default behaviour of not emitting trailing spaces.

Greedy packing in `_wrap_chunks`

_wrap_chunks receives the token list reversed (so pop() takes from the front cheaply). It maintains a cur_line list and a cur_len counter. Each iteration pops a chunk and checks whether adding it would exceed width. When it would, cur_line is flushed as a completed line and a new one starts.

# CPython: Lib/textwrap.py:234 TextWrapper._wrap_chunks
def _wrap_chunks(self, chunks):
    lines = []
    ...
    while chunks:
        l = colwidth(chunks[-1])
        if cur_len + l <= width:
            cur_len += l
            cur_line.append(chunks.pop())
        else:
            ...
    return lines

Long words that cannot fit at all are handled by _handle_long_word, which slices the word character by character (or byte by byte when break_long_words is True).

`dedent`: stripping common indentation

dedent finds the longest leading whitespace string shared by every non-empty line. It compiles a regex from that margin and substitutes it away from every line. The function is careful to ignore lines that consist entirely of whitespace when computing the margin.

# CPython: Lib/textwrap.py:379 dedent
def dedent(text):
    _whitespace_only_re = re.compile('^[ \t]+$', re.MULTILINE)
    _leading_whitespace_re = re.compile('(^[ \t]*)(?:[^ \t\n])', re.MULTILINE)
    text = _whitespace_only_re.sub('', text)
    indents = _leading_whitespace_re.findall(text)
    ...
    if margin:
        text = re.sub(r'(?m)^' + margin, '', text)
    return text

`shorten`: one-line summary

shorten collapses all internal whitespace to single spaces, then calls wrap with max_lines=1 and the caller-supplied placeholder (default ' ...'). The result is a single string or the placeholder if the text is too long even after collapsing.

# CPython: Lib/textwrap.py:443 shorten
def shorten(text, width, **kwargs):
    w = TextWrapper(width=width, max_lines=1, **kwargs)
    return w.fill(' '.join(text.split()))

gopy notes

The compiled regex in TextWrapper.__init__ is Python-only and does not need porting. A Go implementation can use regexp with equivalent patterns, but must match CPython's tokenization output exactly or wrap results will differ at hyphenation boundaries.

colwidth is called per-token. In CPython it resolves to len for ASCII-only text and to wcswidth for Unicode. A Go port must replicate this: use utf8.RuneCountInString as a baseline and add east-Asian wide-character width if the expand_tabs path is active.

TextWrapper attributes (initial_indent, subsequent_indent, placeholder, max_lines) must all be mutable between calls. The Go struct should expose them as exported fields, not constructor-only options.

CPython 3.14 changes

No new public symbols were added in 3.14. The wordsep_re pattern received a minor fix for sequences of hyphens adjacent to punctuation that previously could generate empty tokens, causing an off-by-one in line length accounting. The dedent regex for whitespace-only lines was also updated to handle mixed tabs and spaces more consistently with expandtabs(8) semantics.

Map​

Reading​

Tokenization with _split​

Greedy packing in _wrap_chunks​

dedent: stripping common indentation​

shorten: one-line summary​

gopy notes​

CPython 3.14 changes​

Map