Lib/urllib/parse.py
cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py
urllib.parse is a pure-Python module that implements RFC 3986 URL parsing
and RFC 3987 IRI handling. It provides two layers of decomposition:
urlsplit (five components: scheme, netloc, path, query, fragment) and
urlparse (six components: the five from urlsplit plus params separated
by ; in the path). Named-tuple result types (SplitResult, ParseResult)
expose components as both positional elements and named attributes, and add
geturl() to reassemble the URL.
The module also handles percent-encoding (quote/unquote) and query-string
serialisation (urlencode/parse_qs/parse_qsl). A _coerce_args helper
unifies bytes and str inputs by converting bytes to str internally and
converting the result back, so every function works symmetrically on both types.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-100 | uses_netloc, uses_params, uses_query, uses_fragment, uses_relative | Registry sets that control which schemes get netloc/params/query/fragment treatment in urlsplit/urlparse. | (stdlib pending) |
| 100-300 | SplitResult, ParseResult, SplitResultBytes, ParseResultBytes, urlsplit, urlparse | Named-tuple result types plus the two main decomposition functions; urlparse calls urlsplit and then peels off the ;params suffix from the path. | (stdlib pending) |
| 300-450 | urlunparse, urlunsplit, urldefrag, urljoin | Reassembly and joining. urljoin implements the RFC 3986 reference-resolution algorithm using the base URL's components as fallbacks. | (stdlib pending) |
| 450-600 | quote, quote_plus, quote_from_bytes, _ALWAYS_SAFE | Percent-encoding. _ALWAYS_SAFE is a frozenset of unreserved characters. quote_from_bytes encodes each byte not in safe as %XX. | (stdlib pending) |
| 600-750 | unquote, unquote_plus, unquote_to_bytes | Percent-decoding. unquote_to_bytes splits on % and decodes each two-hex-digit sequence; unquote then decodes the resulting bytes with an error handler that preserves malformed sequences. | (stdlib pending) |
| 750-900 | urlencode, parse_qs, parse_qsl | Query-string encoding and parsing. parse_qsl returns a list of (name, value) pairs; parse_qs folds duplicates into lists. urlencode handles dicts, sequences, and the doseq flag. | (stdlib pending) |
| 900-1100 | _coerce_args, _noop, _encode_result, _decode_args | Bytes/str unification layer. _coerce_args detects bytes input, converts to str, and returns a result encoder that converts back. Every public function wraps its str implementation with this pair. | (stdlib pending) |
Reading
urlsplit netloc extraction (lines 100 to 300)
cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L100-300
def urlsplit(url, scheme='', allow_fragments=True):
url, scheme, _coerce_result = _coerce_args(url, scheme)
allow_fragments = bool(allow_fragments)
key = url, scheme, allow_fragments, type(url), type(scheme)
cached = _parse_cache.get(key, None)
if cached:
return _coerce_result(cached)
...
netloc = query = fragment = ''
i = url.find(':')
if i > 0:
...
rest = url[i+1:]
if not rest or any(c not in '0123456789' for c in rest):
scheme, url = url[:i].lower(), rest
if url[:2] == '//':
netloc, url = _splitnetloc(url, 2)
if (('[' in netloc and ']' not in netloc) or
(']' in netloc and '[' not in netloc)):
raise ValueError("Invalid IPv6 URL")
if allow_fragments and '#' in url:
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return _coerce_result(v)
The scheme is extracted by finding the first : and validating that everything
before it is a valid scheme character. The netloc is extracted only if the
remaining string starts with //; _splitnetloc scans for the first /,
?, or # that ends the authority. Fragment is split off before the query
so that a # inside a query is treated as a fragment delimiter (RFC 3986
section 3.4 permits # in a query only when percent-encoded). Results are
cached in _parse_cache (a plain dict, not LRU) keyed by the full argument
tuple including types so that bytes and str calls never collide.
urljoin RFC 3986 resolution algorithm (lines 300 to 450)
cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L300-450
def urljoin(base, url, allow_fragments=True):
if not base:
return url
if not url:
return base
base, url, _coerce_result = _coerce_args(base, url)
bscheme, bnetloc, bpath, bparams, bquery, bfragment = urlparse(base, '', allow_fragments)
scheme, netloc, path, params, query, fragment = urlparse(url, bscheme, allow_fragments)
if scheme != bscheme or scheme not in uses_relative:
_coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
return ...
if scheme in uses_netloc:
netloc = netloc or bnetloc
if not path and not params:
path = bpath
params = bparams
if not query:
query = bquery
return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
base_parts = bpath.split('/')
if base_parts[-1] != '':
del base_parts[-1]
if path[:1] == '/':
segments = path.split('/')
else:
segments = base_parts + path.split('/')
resolved = []
for seg in segments:
if seg == '..':
if resolved[1:]:
resolved.pop()
elif seg != '.':
resolved.append(seg)
...
The algorithm follows RFC 3986 section 5.2 literally. If the reference has a
different scheme it is returned verbatim. Otherwise netloc, path, and query
are resolved in order: a non-empty reference netloc overrides the base netloc;
an absolute reference path overrides the base path entirely; a relative path
is appended to the base path after removing the last segment. Dot segments
(. and ..) are removed in a single pass over the merged segment list.
quote percent-encoding (lines 450 to 600)
cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L450-600
_ALWAYS_SAFE = frozenset(
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'abcdefghijklmnopqrstuvwxyz'
'0123456789'
'_.-~')
_safe_quoters = {}
def quote_from_bytes(bs, safe='/'):
if not isinstance(bs, (bytes, bytearray)):
raise TypeError('quote_from_bytes() expected bytes or bytearray, '
'not %s' % bs.__class__.__name__)
if not bs:
return ''
if isinstance(safe, str):
safe = safe.encode('ascii', 'ignore')
else:
safe = bytes([c for c in safe if c < 128])
if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
return bs.decode()
try:
quoter = _safe_quoters[safe]
except KeyError:
_safe_quoters[safe] = quoter = Quoter(safe).__getitem__
return ''.join([quoter(char) for char in bs])
quote_from_bytes converts each byte outside _ALWAYS_SAFE | safe to
%XX using a Quoter object (a dict subclass that generates entries on
demand via __missing__). Results for each safe set are cached in
_safe_quoters to avoid rebuilding the table on every call. The fast-path
rstrip check skips encoding entirely when every byte in the input is
already safe.
quote(string, safe, encoding, errors) encodes the string to bytes first
using the specified encoding (default utf-8) and then calls
quote_from_bytes. quote_plus additionally replaces spaces with +
after calling quote, matching the application/x-www-form-urlencoded
format used in HTML forms.
gopy mirror
urllib.parse has no C accelerator. The only external dependency is the
re module (used in parse_qsl to split on & or ; separators). A
gopy port needs a working str.encode/bytes.decode, str.split, and
the _ALWAYS_SAFE frozenset. The _parse_cache dict is module-global and
shared across calls, so a port must replicate it to achieve comparable
performance on repeated calls to urlsplit.