Lib/html/parser.py (part 3)

Source:

cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py

This annotation covers the incremental HTML parsing loop. See lib_html2_detail for HTMLParser.__init__, entity reference handling, and the convert_charrefs option.

Map

Lines	Symbol	Role
1-80	`HTMLParser.feed`	Incrementally feed data to the parser
81-180	`HTMLParser.goahead`	Main parsing loop
181-280	`handle_starttag` / `handle_endtag`	Override-able tag callbacks
281-380	`handle_data`	Character data callback
381-500	Attribute parsing	`handle_starttag_attrs` regex

Reading

`HTMLParser.feed`

# CPython: Lib/html/parser.py:108 feed
def feed(self, data):
    self.rawdata = self.rawdata + data
    self.goahead(0)

feed appends new data to self.rawdata (a buffer for incomplete tokens) and calls goahead. The 0 argument means "don't force end-of-file handling". Incremental feeds allow parsing HTML as it streams in.

`HTMLParser.goahead`

# CPython: Lib/html/parser.py:130 goahead
def goahead(self, end):
    rawdata = self.rawdata
    i = 0
    n = len(rawdata)
    while i < n:
        if self.convert_charrefs and not self.cdata_content_elements:
            j = rawdata.find('<', i)
            if j < 0: j = n
        else:
            match = self.interesting.search(rawdata, i)  # '<' or '&'
            j = match.start() if match else n
        if i < j:
            self.handle_data(rawdata[i:j])
        i = self.updatepos(i, j)
        if i == n: break
        startswith = rawdata.startswith
        if startswith('<', i):
            if starttagopen.match(rawdata, i):
                k = self.handle_starttag(i)
            elif startswith("</", i):
                k = self.handle_endtag_stuff(rawdata, i)
            elif startswith("<!--", i):
                k = self.handle_comment(rawdata, i)
            ...
        elif startswith('&', i):
            k = self.handle_entityref(rawdata, i)
        if k < 0:
            break  # incomplete token: wait for more data
        i = self.updatepos(i, k)
    self.rawdata = rawdata[i:]  # save unconsumed remainder

goahead scans for < and & characters. Each recognized token calls the appropriate handler. If a token is incomplete (e.g., < without a matching >), k < 0 and the remainder is preserved for the next feed call.

Attribute parsing

# CPython: Lib/html/parser.py:260 handle_starttag (attr parsing excerpt)
attrfind_tolerant = re.compile(
    r'((?<=[\'"\s/])[^\s/>][^\s/=>]*)(\s*=+\s*'
    r'(\'[^\']*\'|"[^"]*"|(?![\'"])[^>\s]*))?(?:\s|/(?!>))*',
    re.ASCII)

HTML attributes are parsed by attrfind_tolerant which handles quoted values ('val', "val"), unquoted values (<img width=100>), and boolean attributes (<input disabled>). The regex is "tolerant" — it does not reject malformed attributes, mirroring browser behavior.

Callback model

# CPython: Lib/html/parser.py:320 handle_* stubs
def handle_starttag(self, tag, attrs): pass
def handle_endtag(self, tag): pass
def handle_data(self, data): pass
def handle_comment(self, data): pass
def handle_entityref(self, name): pass
def handle_charref(self, name): pass

HTMLParser uses the callback pattern: subclasses override handle_* methods. HTMLParser.feed('<p>Hello</p>') calls handle_starttag('p', []), handle_data('Hello'), then handle_endtag('p'). Attributes are passed as a list of (name, value) tuples.

gopy notes

html.parser is a pure-Python module; gopy runs it directly. goahead uses re.compile and str.find, routing through module/re and objects.StrFind. The callback methods are virtual — subclasses override them as regular Python methods, invoked via the standard __call__ dispatch.

Map​

Reading​

HTMLParser.feed​

HTMLParser.goahead​

Attribute parsing​

Callback model​

gopy notes​

Map