urllib/parse.py: URL parsing internals
Lib/urllib/parse.py is the reference implementation for all URL parsing,
joining, and encoding utilities in the standard library. At roughly 1600 lines
it is one of the larger pure-Python stdlib files. CPython 3.14 added stricter
scheme validation and deprecated several previously silent coercions.
Map
| Lines | Symbol | Role |
|---|---|---|
| 1-60 | module constants | uses_netloc, uses_relative, scheme_chars, MAX_CACHE_SIZE |
| 62-100 | _coerce_args / _noop | Byte-vs-str unification helpers |
| 102-160 | SplitResult / ParseResult | Named tuples with geturl() reconstruction |
| 162-260 | urlsplit | Splits URL into scheme, netloc, path, query, fragment |
| 262-320 | urlparse | Calls urlsplit then splits path into path+params |
| 322-400 | urljoin | RFC 3986 base-resolution algorithm |
| 402-460 | urlunparse / urlunsplit | Reconstructs URL from component tuple |
| 462-560 | urlencode | Encodes query string; delegates to quote_plus |
| 562-640 | quote / quote_plus / quote_from_bytes | Percent-encoding |
| 642-720 | unquote / unquote_plus / unquote_to_bytes | Percent-decoding |
| 722-820 | parse_qs / parse_qsl | Query-string parsing with duplicate-key handling |
| 822-900 | splitport / splituser etc. | Legacy helpers (deprecated since 3.9) |
| 902-1000 | _splitnetloc / _checknetloc | Internal netloc extraction |
| 1000-1600 | _encode_result, _decode_args | Bytes/str coercion round-trips |
Reading
urlsplit and SplitResult
urlsplit is the canonical low-level split. It does not decompose the path
further (that is urlparse's job). The result is a SplitResult named tuple
whose geturl() round-trips the original string:
SplitResult = namedtuple('SplitResult', 'scheme netloc path query fragment')
def urlsplit(urlstring, scheme='', allow_fragments=True):
# Normalise scheme.
netloc = query = fragment = ''
i = urlstring.find(':')
if i > 0 and urlstring[:i].isidentifier(): # 3.14: stricter scheme check
scheme, urlstring = urlstring[:i].lower(), urlstring[i+1:]
if urlstring[:2] == '//':
netloc, urlstring = _splitnetloc(urlstring, 2)
if '[' in netloc and ']' not in netloc:
raise ValueError("Invalid IPv6 URL")
if allow_fragments and '#' in urlstring:
urlstring, fragment = urlstring.split('#', 1)
if '?' in urlstring:
urlstring, query = urlstring.split('?', 1)
return SplitResult(scheme, netloc, urlstring, query, fragment)
The 3.14 change (isidentifier() check) rejects schemes containing digits in
the first position or non-ASCII characters, matching RFC 3986 section 3.1.
urljoin base-resolution algorithm
urljoin implements RFC 3986 section 5.2.2. The logic is a direct translation
of the pseudocode in the RFC:
def urljoin(base, url, allow_fragments=True):
if not base:
return url
if not url:
return base
base, url, _coerce_result = _coerce_args(base, url)
bscheme, bnetloc, bpath, bparams, bquery, bfragment = urlparse(base, '', allow_fragments)
scheme, netloc, path, params, query, fragment = urlparse(url, bscheme, allow_fragments)
if scheme != bscheme or scheme not in uses_relative:
_coerce_result(url)
return url
if scheme in uses_netloc:
if netloc:
# url is absolute-path or authority reference; use url path directly.
path = posixpath.normpath(path) if path[:1] == '/' else path
return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
netloc = bnetloc
if not path and not params:
path = bpath
params = bparams
if not query:
query = bquery
elif path[:1] == '/':
path = posixpath.normpath(path)
else:
# Merge relative path into base.
base_parts = bpath.split('/')
base_parts[-1] = ''
path = '/'.join(base_parts) + path
path = posixpath.normpath(path)
return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
The posixpath.normpath call resolves .. and . segments. urljoin must
handle both str and bytes arguments; _coerce_args detects which is in use
and returns a _coerce_result wrapper for the output.
parse_qs, parse_qsl, and urlencode
parse_qsl is the workhorse. parse_qs calls it and groups values by key:
def parse_qsl(qs, keep_blank_values=False, strict_parsing=False,
encoding='utf-8', errors='replace', max_num_fields=None, separator='&'):
qs, _coerce_result = _coerce_args(qs)
pairs = [s2 for s1 in qs.split(separator) for s2 in s1.split(';')]
# 3.14: semicolon splitting is disabled by default; separator='&' is the new default.
r = []
for name_value in pairs:
if not name_value and not strict_parsing:
continue
nv = name_value.split('=', 1)
if len(nv) != 2:
if strict_parsing:
raise ValueError("bad query field: %r" % (name_value,))
nv.append('')
name = unquote_plus(nv[0], encoding=encoding, errors=errors)
value = unquote_plus(nv[1], encoding=encoding, errors=errors)
if keep_blank_values or value:
r.append((name, value))
return r
urlencode accepts a sequence of pairs or a dict and percent-encodes each
key/value using quote_plus (spaces become +):
def urlencode(query, doseq=False, safe='', encoding=None, errors=None,
quote_via=quote_plus):
...
l = []
for k, v in (query.items() if isinstance(query, dict) else query):
k = quote_via(str(k), safe)
if doseq and not isinstance(v, str):
for elt in v:
l.append(k + '=' + quote_via(str(elt), safe))
else:
l.append(k + '=' + quote_via(str(v), safe))
return '&'.join(l)
The quote_via parameter (added in 3.6) lets callers substitute quote for
quote_plus when + encoding is not desired.
gopy notes
_coerce_args/_coerce_resultcreate a bytes/str unification layer. In Go the equivalent is a separate bytes-variant of each exported function rather than runtime coercion.SplitResult.geturl()must round-trip exactly; the test suite checks thaturlunsplit(urlsplit(u)) == ufor all fixtures.- The 3.14 scheme-validation change (
isidentifier()) should be gated on the version constant so that gopy's 3.14 mode enforces it while a compatibility shim can relax it. parse_qslsemicolon splitting was removed as a default in 3.7.2 (CVE fix) and theseparatorparameter made explicit. The Go port must not split on;unlessseparator=';'is passed explicitly.posixpath.normpathinsideurljoinremoves trailing slashes and collapses//. The Go port should usepath.Cleanwith equivalent edge-case handling for the empty-string input (path.Clean("")returns".", but CPython returns"").