Lib/html/parser.py
Source:
cpython 3.14 @ ab2d84fe1023/Lib/html/parser.py
Map
| Lines | Symbol | Purpose |
|---|---|---|
| 1–50 | imports, constants | Compiled regexes: tagfind_tolerant, attrfind_tolerant, entity patterns |
| 51 –100 | HTMLParser.__init__ | Buffer init, convert_charrefs flag (default True) |
| 101–160 | HTMLParser.feed, close | Incremental buffering; close flushes end-of-stream |
| 161–310 | HTMLParser.goahead | Central state machine: DATA, TAG, ATTR, COMMENT, CDATA dispatch |
| 311–390 | handle_starttag, handle_endtag | Callback stubs; subclasses override these |
| 391–440 | handle_data, handle_comment | Data and comment callback stubs |
| 441–470 | handle_entityref, handle_charref | Entity and character reference callback stubs |
| 471–500 | getpos, reset | 1-based (line, col) position tracking; buffer reset |
Reading
feed buffering and the goahead loop
feed appends incoming text to self.rawdata and immediately calls goahead(0). The 0 argument signals "not end of file", so the method leaves incomplete tokens in the buffer rather than reporting an error. close calls goahead(1) to flush any remaining content.
# CPython: Lib/html/parser.py:101 HTMLParser.feed
def feed(self, data):
self.rawdata = self.rawdata + data
self.goahead(0)
def close(self):
HTMLParser.goahead(self, 1)
self.rawdata = ''
goahead scans self.rawdata with a while loop. Each iteration locates the next < with rawdata.find('<', i). Text before that position is dispatched to handle_data (or held until character references are resolved when convert_charrefs is set). When no < is found and end is false, the remaining text stays in the buffer for the next feed call.
# CPython: Lib/html/parser.py:161 HTMLParser.goahead
def goahead(self, end):
rawdata = self.rawdata
i = 0
n = len(rawdata)
while i < n:
if self.convert_charrefs and not self.cdata_content_elements:
j = rawdata.find('<', i)
if j < 0:
if end:
self.handle_data(rawdata[i:n])
i = n
break
# ... tag dispatch continues
self.rawdata = rawdata[i:]
Attribute parsing and handle_starttag
When goahead identifies a start tag it calls handle_starttag(tag, attrs). The attrs argument is a list of (name, value) tuples. Bare attributes such as disabled produce (name, None). The tagfind_tolerant and attrfind_tolerant regexes accept spaces around = and unquoted values to handle real-world malformed HTML.
# CPython: Lib/html/parser.py:311 HTMLParser.handle_starttag
def handle_starttag(self, tag, attrs):
pass # override in subclass
def handle_endtag(self, tag):
pass
In CPython 3.14, tagfind_tolerant was tightened to reject tag names beginning with a digit. Markup such as <3foo> is now passed to handle_data as raw text instead of being treated as a tag, aligning behavior more closely with the HTML5 specification.
CDATA mode for script and style
Inside <script> and <style> elements the parser enters CDATA mode via set_cdata_mode. In this mode goahead scans for the literal closing tag string instead of treating every < as a potential new tag. This prevents content like if (a < b) from being misread as markup.
# CPython: Lib/html/parser.py:270 set_cdata_mode
def set_cdata_mode(self, elem):
self.cdata_elem = elem.lower()
def clear_cdata_mode(self):
self.cdata_elem = None
# Inside goahead, when self.cdata_elem is set:
# j = rawdata.lower().find('</' + self.cdata_elem, i)
# if j < 0:
# break # wait for more data
# self.handle_data(rawdata[i:j])
handle_comment follows the same incomplete-token deferral: if --> is not yet present in the buffer, the comment fragment stays in rawdata until the next feed call delivers the rest.
gopy notes
Status: not yet ported.
Planned package path: module/html/ (public name html.parser).
Key porting decisions:
convert_charrefs(defaultTruesince Python 3.5) causesgoaheadto batch all text between tags and resolve&,', and similar references before callinghandle_data. A port must apply the same batching or callers will see raw entity text.cdata_content_elementsis a frozenset containing'script'and'style'. It is consulted insidegoahead, not inhandle_starttag. A port must preserve this placement.getposreturns(line, col)where both are 1-based. Line counts are maintained by counting\nin consumed data. A port must update position counters at the same points CPython does orgetposwill drift.- The 3.14 digit-in-tagname restriction is baked into
tagfind_tolerant. A port should translate the updated regex pattern unconditionally.