Skip to main content

Lib/urllib/parse.py

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py

urllib.parse is a pure-Python module that implements RFC 3986 URL parsing and RFC 3987 IRI handling. It provides two layers of decomposition: urlsplit (five components: scheme, netloc, path, query, fragment) and urlparse (six components: the five from urlsplit plus params separated by ; in the path). Named-tuple result types (SplitResult, ParseResult) expose components as both positional elements and named attributes, and add geturl() to reassemble the URL.

The module also handles percent-encoding (quote/unquote) and query-string serialisation (urlencode/parse_qs/parse_qsl). A _coerce_args helper unifies bytes and str inputs by converting bytes to str internally and converting the result back, so every function works symmetrically on both types.

Map

LinesSymbolRolegopy
1-100uses_netloc, uses_params, uses_query, uses_fragment, uses_relativeRegistry sets that control which schemes get netloc/params/query/fragment treatment in urlsplit/urlparse.(stdlib pending)
100-300SplitResult, ParseResult, SplitResultBytes, ParseResultBytes, urlsplit, urlparseNamed-tuple result types plus the two main decomposition functions; urlparse calls urlsplit and then peels off the ;params suffix from the path.(stdlib pending)
300-450urlunparse, urlunsplit, urldefrag, urljoinReassembly and joining. urljoin implements the RFC 3986 reference-resolution algorithm using the base URL's components as fallbacks.(stdlib pending)
450-600quote, quote_plus, quote_from_bytes, _ALWAYS_SAFEPercent-encoding. _ALWAYS_SAFE is a frozenset of unreserved characters. quote_from_bytes encodes each byte not in safe as %XX.(stdlib pending)
600-750unquote, unquote_plus, unquote_to_bytesPercent-decoding. unquote_to_bytes splits on % and decodes each two-hex-digit sequence; unquote then decodes the resulting bytes with an error handler that preserves malformed sequences.(stdlib pending)
750-900urlencode, parse_qs, parse_qslQuery-string encoding and parsing. parse_qsl returns a list of (name, value) pairs; parse_qs folds duplicates into lists. urlencode handles dicts, sequences, and the doseq flag.(stdlib pending)
900-1100_coerce_args, _noop, _encode_result, _decode_argsBytes/str unification layer. _coerce_args detects bytes input, converts to str, and returns a result encoder that converts back. Every public function wraps its str implementation with this pair.(stdlib pending)

Reading

urlsplit netloc extraction (lines 100 to 300)

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L100-300

def urlsplit(url, scheme='', allow_fragments=True):
url, scheme, _coerce_result = _coerce_args(url, scheme)
allow_fragments = bool(allow_fragments)
key = url, scheme, allow_fragments, type(url), type(scheme)
cached = _parse_cache.get(key, None)
if cached:
return _coerce_result(cached)
...
netloc = query = fragment = ''
i = url.find(':')
if i > 0:
...
rest = url[i+1:]
if not rest or any(c not in '0123456789' for c in rest):
scheme, url = url[:i].lower(), rest
if url[:2] == '//':
netloc, url = _splitnetloc(url, 2)
if (('[' in netloc and ']' not in netloc) or
(']' in netloc and '[' not in netloc)):
raise ValueError("Invalid IPv6 URL")
if allow_fragments and '#' in url:
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)

The scheme is extracted by finding the first : and validating that everything before it is a valid scheme character. The netloc is extracted only if the remaining string starts with //; _splitnetloc scans for the first /, ?, or # that ends the authority. Fragment is split off before the query so that a # inside a query is treated as a fragment delimiter (RFC 3986 section 3.4 permits # in a query only when percent-encoded). Results are cached in _parse_cache (a plain dict, not LRU) keyed by the full argument tuple including types so that bytes and str calls never collide.

urljoin RFC 3986 resolution algorithm (lines 300 to 450)

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L300-450

def urljoin(base, url, allow_fragments=True):
if not base:
return url
if not url:
return base
base, url, _coerce_result = _coerce_args(base, url)
bscheme, bnetloc, bpath, bparams, bquery, bfragment = urlparse(base, '', allow_fragments)
scheme, netloc, path, params, query, fragment = urlparse(url, bscheme, allow_fragments)
if scheme != bscheme or scheme not in uses_relative:
_coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
return ...
if scheme in uses_netloc:
netloc = netloc or bnetloc
if not path and not params:
path = bpath
params = bparams
if not query:
query = bquery
return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
base_parts = bpath.split('/')
if base_parts[-1] != '':
del base_parts[-1]
if path[:1] == '/':
segments = path.split('/')
else:
segments = base_parts + path.split('/')
resolved = []
for seg in segments:
if seg == '..':
if resolved[1:]:
resolved.pop()
elif seg != '.':
resolved.append(seg)
...

The algorithm follows RFC 3986 section 5.2 literally. If the reference has a different scheme it is returned verbatim. Otherwise netloc, path, and query are resolved in order: a non-empty reference netloc overrides the base netloc; an absolute reference path overrides the base path entirely; a relative path is appended to the base path after removing the last segment. Dot segments (. and ..) are removed in a single pass over the merged segment list.

quote percent-encoding (lines 450 to 600)

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L450-600

_ALWAYS_SAFE = frozenset(
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'abcdefghijklmnopqrstuvwxyz'
'0123456789'
'_.-~')

_safe_quoters = {}

def quote_from_bytes(bs, safe='/'):
if not isinstance(bs, (bytes, bytearray)):
raise TypeError('quote_from_bytes() expected bytes or bytearray, '
'not %s' % bs.__class__.__name__)
if not bs:
return ''
if isinstance(safe, str):
safe = safe.encode('ascii', 'ignore')
else:
safe = bytes([c for c in safe if c < 128])
if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
return bs.decode()
try:
quoter = _safe_quoters[safe]
except KeyError:
_safe_quoters[safe] = quoter = Quoter(safe).__getitem__
return ''.join([quoter(char) for char in bs])

quote_from_bytes converts each byte outside _ALWAYS_SAFE | safe to %XX using a Quoter object (a dict subclass that generates entries on demand via __missing__). Results for each safe set are cached in _safe_quoters to avoid rebuilding the table on every call. The fast-path rstrip check skips encoding entirely when every byte in the input is already safe.

quote(string, safe, encoding, errors) encodes the string to bytes first using the specified encoding (default utf-8) and then calls quote_from_bytes. quote_plus additionally replaces spaces with + after calling quote, matching the application/x-www-form-urlencoded format used in HTML forms.

gopy mirror

urllib.parse has no C accelerator. The only external dependency is the re module (used in parse_qsl to split on & or ; separators). A gopy port needs a working str.encode/bytes.decode, str.split, and the _ALWAYS_SAFE frozenset. The _parse_cache dict is module-global and shared across calls, so a port must replicate it to achieve comparable performance on repeated calls to urlsplit.