Lib/html/parser.py
Source:
cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py
HTMLParser is a lenient, event-driven HTML/XHTML parser. Unlike the SAX-style XML parsers it tolerates broken HTML. Users subclass it and override handle_* methods.
Map
| Lines | Symbol | Role |
|---|---|---|
| 1-50 | Imports | markupbase, re patterns for tags |
| 51-150 | HTMLParser.__init__ | Parser state: rawdata buffer, cdata_content_elements |
| 151-280 | HTMLParser.feed | Append to buffer, call goahead(end=0) |
| 281-400 | goahead | Core scan loop: dispatch on <, &, or text |
| 401-500 | handle_* stubs | Default no-op callbacks for user override |
Reading
State machine
# CPython: Lib/html/parser.py:160 HTMLParser.feed
def feed(self, data):
"""Feed some text to the parser. It is processed insofar as it consists
of complete elements; incomplete data is buffered until more data is fed
or close() is called."""
self.rawdata = self.rawdata + data
self.goahead(0)
Data is accumulated in rawdata. goahead(0) processes as much as possible; goahead(1) is called by close() to force parsing incomplete data.
Tag scanning
# CPython: Lib/html/parser.py:295 goahead
def goahead(self, end):
rawdata = self.rawdata
i = 0
while i < len(rawdata):
if self.convert_charrefs and not self.cdata_elem:
j = rawdata.find('<', i)
if j < 0:
... # no more tags
break
if i < j:
self.handle_data(rawdata[i:j]) # text between tags
...
if rawdata[i:i+2] == '<!':
# comment, doctype, CDATA section
...
elif rawdata[i] == '<':
# start or end tag
k = self.handle_starttag or self.handle_endtag ...
handle_starttag
# CPython: Lib/html/parser.py:420 handle_starttag
def handle_starttag(self, tag, attrs):
"""Called for each opening tag.
tag is the tag name lowercased, attrs is a list of (name, value) pairs.
"""
pass # override in subclass
Usage:
# CPython: Lib/html/parser.py:1 example
from html.parser import HTMLParser
class MyParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print(f"Start: {tag}, attrs={attrs}")
def handle_data(self, data):
print(f"Data: {data!r}")
MyParser().feed('<p class="x">Hello</p>')
Entity and character reference handling
# CPython: Lib/html/parser.py:375 handle_charref
def handle_charref(self, name):
"""Called for numeric character references (&#NNN; or &#xNNN;)."""
pass
def handle_entityref(self, name):
"""Called for named character references (&, <, etc.)."""
pass
With convert_charrefs=True (default since 3.5), character and entity references are converted automatically before handle_data is called.
unescape
# CPython: Lib/html/parser.py:490 unescape
# Deprecated in 3.4, removed in 3.9; now in html module:
# html.unescape('<br>') == '<br>'
gopy notes
html.parser is pure Python. It is importable in gopy when re and markupbase work. The goahead method uses regular-expression matching (tagfind_tolerant, attrfind_tolerant, etc.) compiled at import time.