Skip to main content

urllib/parse.py: URL parsing internals

Lib/urllib/parse.py is the reference implementation for all URL parsing, joining, and encoding utilities in the standard library. At roughly 1600 lines it is one of the larger pure-Python stdlib files. CPython 3.14 added stricter scheme validation and deprecated several previously silent coercions.

Map

LinesSymbolRole
1-60module constantsuses_netloc, uses_relative, scheme_chars, MAX_CACHE_SIZE
62-100_coerce_args / _noopByte-vs-str unification helpers
102-160SplitResult / ParseResultNamed tuples with geturl() reconstruction
162-260urlsplitSplits URL into scheme, netloc, path, query, fragment
262-320urlparseCalls urlsplit then splits path into path+params
322-400urljoinRFC 3986 base-resolution algorithm
402-460urlunparse / urlunsplitReconstructs URL from component tuple
462-560urlencodeEncodes query string; delegates to quote_plus
562-640quote / quote_plus / quote_from_bytesPercent-encoding
642-720unquote / unquote_plus / unquote_to_bytesPercent-decoding
722-820parse_qs / parse_qslQuery-string parsing with duplicate-key handling
822-900splitport / splituser etc.Legacy helpers (deprecated since 3.9)
902-1000_splitnetloc / _checknetlocInternal netloc extraction
1000-1600_encode_result, _decode_argsBytes/str coercion round-trips

Reading

urlsplit and SplitResult

urlsplit is the canonical low-level split. It does not decompose the path further (that is urlparse's job). The result is a SplitResult named tuple whose geturl() round-trips the original string:

SplitResult = namedtuple('SplitResult', 'scheme netloc path query fragment')

def urlsplit(urlstring, scheme='', allow_fragments=True):
# Normalise scheme.
netloc = query = fragment = ''
i = urlstring.find(':')
if i > 0 and urlstring[:i].isidentifier(): # 3.14: stricter scheme check
scheme, urlstring = urlstring[:i].lower(), urlstring[i+1:]
if urlstring[:2] == '//':
netloc, urlstring = _splitnetloc(urlstring, 2)
if '[' in netloc and ']' not in netloc:
raise ValueError("Invalid IPv6 URL")
if allow_fragments and '#' in urlstring:
urlstring, fragment = urlstring.split('#', 1)
if '?' in urlstring:
urlstring, query = urlstring.split('?', 1)
return SplitResult(scheme, netloc, urlstring, query, fragment)

The 3.14 change (isidentifier() check) rejects schemes containing digits in the first position or non-ASCII characters, matching RFC 3986 section 3.1.

urljoin base-resolution algorithm

urljoin implements RFC 3986 section 5.2.2. The logic is a direct translation of the pseudocode in the RFC:

def urljoin(base, url, allow_fragments=True):
if not base:
return url
if not url:
return base
base, url, _coerce_result = _coerce_args(base, url)
bscheme, bnetloc, bpath, bparams, bquery, bfragment = urlparse(base, '', allow_fragments)
scheme, netloc, path, params, query, fragment = urlparse(url, bscheme, allow_fragments)

if scheme != bscheme or scheme not in uses_relative:
_coerce_result(url)
return url
if scheme in uses_netloc:
if netloc:
# url is absolute-path or authority reference; use url path directly.
path = posixpath.normpath(path) if path[:1] == '/' else path
return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
netloc = bnetloc

if not path and not params:
path = bpath
params = bparams
if not query:
query = bquery
elif path[:1] == '/':
path = posixpath.normpath(path)
else:
# Merge relative path into base.
base_parts = bpath.split('/')
base_parts[-1] = ''
path = '/'.join(base_parts) + path
path = posixpath.normpath(path)
return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))

The posixpath.normpath call resolves .. and . segments. urljoin must handle both str and bytes arguments; _coerce_args detects which is in use and returns a _coerce_result wrapper for the output.

parse_qs, parse_qsl, and urlencode

parse_qsl is the workhorse. parse_qs calls it and groups values by key:

def parse_qsl(qs, keep_blank_values=False, strict_parsing=False,
encoding='utf-8', errors='replace', max_num_fields=None, separator='&'):
qs, _coerce_result = _coerce_args(qs)
pairs = [s2 for s1 in qs.split(separator) for s2 in s1.split(';')]
# 3.14: semicolon splitting is disabled by default; separator='&' is the new default.
r = []
for name_value in pairs:
if not name_value and not strict_parsing:
continue
nv = name_value.split('=', 1)
if len(nv) != 2:
if strict_parsing:
raise ValueError("bad query field: %r" % (name_value,))
nv.append('')
name = unquote_plus(nv[0], encoding=encoding, errors=errors)
value = unquote_plus(nv[1], encoding=encoding, errors=errors)
if keep_blank_values or value:
r.append((name, value))
return r

urlencode accepts a sequence of pairs or a dict and percent-encodes each key/value using quote_plus (spaces become +):

def urlencode(query, doseq=False, safe='', encoding=None, errors=None,
quote_via=quote_plus):
...
l = []
for k, v in (query.items() if isinstance(query, dict) else query):
k = quote_via(str(k), safe)
if doseq and not isinstance(v, str):
for elt in v:
l.append(k + '=' + quote_via(str(elt), safe))
else:
l.append(k + '=' + quote_via(str(v), safe))
return '&'.join(l)

The quote_via parameter (added in 3.6) lets callers substitute quote for quote_plus when + encoding is not desired.

gopy notes

  • _coerce_args / _coerce_result create a bytes/str unification layer. In Go the equivalent is a separate bytes-variant of each exported function rather than runtime coercion.
  • SplitResult.geturl() must round-trip exactly; the test suite checks that urlunsplit(urlsplit(u)) == u for all fixtures.
  • The 3.14 scheme-validation change (isidentifier()) should be gated on the version constant so that gopy's 3.14 mode enforces it while a compatibility shim can relax it.
  • parse_qsl semicolon splitting was removed as a default in 3.7.2 (CVE fix) and the separator parameter made explicit. The Go port must not split on ; unless separator=';' is passed explicitly.
  • posixpath.normpath inside urljoin removes trailing slashes and collapses //. The Go port should use path.Clean with equivalent edge-case handling for the empty-string input (path.Clean("") returns ".", but CPython returns "").