Lib/html/ (part 2)

Source:

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py

This annotation covers the push-parser internals. See lib_html_detail for html.escape, entity tables, and HTMLParser.__init__.

Map

Lines	Symbol	Role
1-80	`HTMLParser.feed`	Append data to the buffer and call `goahead`
81-200	`goahead`	Main dispatch loop: tags, data, comments, declarations
201-320	`handle_starttag` / `handle_endtag`	Default no-op hooks for subclassing
321-420	`unescape`	Convert `&`, ` `, `’` to Unicode
421-500	`handle_data` / `handle_comment`	Default hooks; `handle_entityref` removed in 3.4

Reading

`HTMLParser.feed`

# CPython: Lib/html/parser.py:108 HTMLParser.feed
def feed(self, data):
    """Feed some text to the parser.

    Call this as many times as you want, with new text each time.
    """
    self.rawdata = self.rawdata + data
    self.goahead(0)

feed is idempotent: partial tags are left in self.rawdata and re-processed on the next call. close() calls goahead(1) (end=1) which treats leftover data as a final tag.

`goahead`

# CPython: Lib/html/parser.py:130 HTMLParser.goahead
def goahead(self, end):
    rawdata = self.rawdata
    i = 0
    n = len(rawdata)
    while i < n:
        if self.convert_charrefs and not self.cdata_content_elements:
            j = rawdata.find('<', i)
            if j < 0:
                if end:
                    self.handle_data(rawdata[i:n])
                    i = n
                break
            if i < j:
                self.handle_data(rawdata[i:j])
            i = self.__starttag_text = j
        match = self.interesting.search(rawdata, i)
        if not match:
            ...
            break
        j = match.start()
        if i < j:
            self.handle_data(rawdata[i:j])
        i = j
        startswith = rawdata.startswith
        if startswith('<', i):
            if starttagopen.match(rawdata, i):
                k = self.handle_starttag_or_decl(rawdata, i)
            ...

goahead is the hot path. self.interesting is a precompiled regex that skips past plain text, stopping at < or &. The convert_charrefs fast path (added in 3.4) batches character reference conversion over whole text runs.

`unescape`

# CPython: Lib/html/parser.py:431 HTMLParser.unescape
@staticmethod
def unescape(s):
    """Deprecated — use html.unescape() instead."""
    import warnings
    warnings.warn(...)
    return html.unescape(s)

html.unescape in Lib/html/__init__.py uses re.sub with a callback that calls html.entities.html5 for named references and chr(int(...)) for numeric ones. ’ → ’ (right single quotation mark).

`handle_starttag` attrs parsing

# CPython: Lib/html/parser.py:210 handle_starttag
def handle_starttag(self, tag, attrs):
    """Overrideable hook called for <tag attr="val" ...>.
    attrs is a list of (name, value) pairs; value is None for bare attrs.
    """
    pass

Attribute parsing happens in handle_starttag_or_decl which calls _HTMLParser__starttag_text and attr_pattern. Values are unescaped before being passed to the hook.

gopy notes

HTMLParser is module/html/parser.HTMLParser in module/html/parser/module.go. feed appends to a strings.Builder. goahead uses Go's strings.Index and a compiled regexp.Regexp for interesting. unescape delegates to module/html.Unescape.

Map​

Reading​

HTMLParser.feed​

goahead​

unescape​

handle_starttag attrs parsing​

gopy notes​

Map