`Lib/urllib/parse.py`

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py

urllib.parse is a pure-Python module that implements RFC 3986 URL parsing and RFC 3987 IRI handling. It provides two layers of decomposition: urlsplit (five components: scheme, netloc, path, query, fragment) and urlparse (six components: the five from urlsplit plus params separated by ; in the path). Named-tuple result types (SplitResult, ParseResult) expose components as both positional elements and named attributes, and add geturl() to reassemble the URL.

The module also handles percent-encoding (quote/unquote) and query-string serialisation (urlencode/parse_qs/parse_qsl). A _coerce_args helper unifies bytes and str inputs by converting bytes to str internally and converting the result back, so every function works symmetrically on both types.

Map

Lines	Symbol	Role	gopy
1-100	`uses_netloc`, `uses_params`, `uses_query`, `uses_fragment`, `uses_relative`	Registry sets that control which schemes get netloc/params/query/fragment treatment in `urlsplit`/`urlparse`.	`(stdlib pending)`
100-300	`SplitResult`, `ParseResult`, `SplitResultBytes`, `ParseResultBytes`, `urlsplit`, `urlparse`	Named-tuple result types plus the two main decomposition functions; `urlparse` calls `urlsplit` and then peels off the `;params` suffix from the path.	`(stdlib pending)`
300-450	`urlunparse`, `urlunsplit`, `urldefrag`, `urljoin`	Reassembly and joining. `urljoin` implements the RFC 3986 reference-resolution algorithm using the base URL's components as fallbacks.	`(stdlib pending)`
450-600	`quote`, `quote_plus`, `quote_from_bytes`, `_ALWAYS_SAFE`	Percent-encoding. `_ALWAYS_SAFE` is a `frozenset` of unreserved characters. `quote_from_bytes` encodes each byte not in `safe` as `%XX`.	`(stdlib pending)`
600-750	`unquote`, `unquote_plus`, `unquote_to_bytes`	Percent-decoding. `unquote_to_bytes` splits on `%` and decodes each two-hex-digit sequence; `unquote` then decodes the resulting bytes with an error handler that preserves malformed sequences.	`(stdlib pending)`
750-900	`urlencode`, `parse_qs`, `parse_qsl`	Query-string encoding and parsing. `parse_qsl` returns a list of `(name, value)` pairs; `parse_qs` folds duplicates into lists. `urlencode` handles dicts, sequences, and the `doseq` flag.	`(stdlib pending)`
900-1100	`_coerce_args`, `_noop`, `_encode_result`, `_decode_args`	Bytes/str unification layer. `_coerce_args` detects bytes input, converts to str, and returns a result encoder that converts back. Every public function wraps its str implementation with this pair.	`(stdlib pending)`

Reading

`urlsplit` netloc extraction (lines 100 to 300)

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L100-300

def urlsplit(url, scheme='', allow_fragments=True):
    url, scheme, _coerce_result = _coerce_args(url, scheme)
    allow_fragments = bool(allow_fragments)
    key = url, scheme, allow_fragments, type(url), type(scheme)
    cached = _parse_cache.get(key, None)
    if cached:
        return _coerce_result(cached)
    ...
    netloc = query = fragment = ''
    i = url.find(':')
    if i > 0:
        ...
        rest = url[i+1:]
        if not rest or any(c not in '0123456789' for c in rest):
            scheme, url = url[:i].lower(), rest
    if url[:2] == '//':
        netloc, url = _splitnetloc(url, 2)
        if (('[' in netloc and ']' not in netloc) or
                (']' in netloc and '[' not in netloc)):
            raise ValueError("Invalid IPv6 URL")
    if allow_fragments and '#' in url:
        url, fragment = url.split('#', 1)
    if '?' in url:
        url, query = url.split('?', 1)
    v = SplitResult(scheme, netloc, url, query, fragment)
    _parse_cache[key] = v
    return _coerce_result(v)

The scheme is extracted by finding the first : and validating that everything before it is a valid scheme character. The netloc is extracted only if the remaining string starts with //; _splitnetloc scans for the first /, ?, or # that ends the authority. Fragment is split off before the query so that a # inside a query is treated as a fragment delimiter (RFC 3986 section 3.4 permits # in a query only when percent-encoded). Results are cached in _parse_cache (a plain dict, not LRU) keyed by the full argument tuple including types so that bytes and str calls never collide.

`urljoin` RFC 3986 resolution algorithm (lines 300 to 450)

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L300-450

def urljoin(base, url, allow_fragments=True):
    if not base:
        return url
    if not url:
        return base
    base, url, _coerce_result = _coerce_args(base, url)
    bscheme, bnetloc, bpath, bparams, bquery, bfragment = urlparse(base, '', allow_fragments)
    scheme,  netloc,  path,  params,  query,  fragment  = urlparse(url,  bscheme, allow_fragments)
    if scheme != bscheme or scheme not in uses_relative:
        _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
        return ...
    if scheme in uses_netloc:
        netloc = netloc or bnetloc
    if not path and not params:
        path = bpath
        params = bparams
        if not query:
            query = bquery
        return _coerce_result(urlunparse((scheme, netloc, path, params, query, fragment)))
    base_parts = bpath.split('/')
    if base_parts[-1] != '':
        del base_parts[-1]
    if path[:1] == '/':
        segments = path.split('/')
    else:
        segments = base_parts + path.split('/')
    resolved = []
    for seg in segments:
        if seg == '..':
            if resolved[1:]:
                resolved.pop()
        elif seg != '.':
            resolved.append(seg)
    ...

The algorithm follows RFC 3986 section 5.2 literally. If the reference has a different scheme it is returned verbatim. Otherwise netloc, path, and query are resolved in order: a non-empty reference netloc overrides the base netloc; an absolute reference path overrides the base path entirely; a relative path is appended to the base path after removing the last segment. Dot segments (. and ..) are removed in a single pass over the merged segment list.

`quote` percent-encoding (lines 450 to 600)

cpython 3.14 @ ab2d84fe1023/Lib/urllib/parse.py#L450-600

_ALWAYS_SAFE = frozenset(
    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    'abcdefghijklmnopqrstuvwxyz'
    '0123456789'
    '_.-~')

_safe_quoters = {}

def quote_from_bytes(bs, safe='/'):
    if not isinstance(bs, (bytes, bytearray)):
        raise TypeError('quote_from_bytes() expected bytes or bytearray, '
                        'not %s' % bs.__class__.__name__)
    if not bs:
        return ''
    if isinstance(safe, str):
        safe = safe.encode('ascii', 'ignore')
    else:
        safe = bytes([c for c in safe if c < 128])
    if not bs.rstrip(_ALWAYS_SAFE_BYTES + safe):
        return bs.decode()
    try:
        quoter = _safe_quoters[safe]
    except KeyError:
        _safe_quoters[safe] = quoter = Quoter(safe).__getitem__
    return ''.join([quoter(char) for char in bs])

quote_from_bytes converts each byte outside _ALWAYS_SAFE | safe to %XX using a Quoter object (a dict subclass that generates entries on demand via __missing__). Results for each safe set are cached in _safe_quoters to avoid rebuilding the table on every call. The fast-path rstrip check skips encoding entirely when every byte in the input is already safe.

quote(string, safe, encoding, errors) encodes the string to bytes first using the specified encoding (default utf-8) and then calls quote_from_bytes. quote_plus additionally replaces spaces with + after calling quote, matching the application/x-www-form-urlencoded format used in HTML forms.

gopy mirror

urllib.parse has no C accelerator. The only external dependency is the re module (used in parse_qsl to split on & or ; separators). A gopy port needs a working str.encode/bytes.decode, str.split, and the _ALWAYS_SAFE frozenset. The _parse_cache dict is module-global and shared across calls, so a port must replicate it to achieve comparable performance on repeated calls to urlsplit.

Map​

Reading​

urlsplit netloc extraction (lines 100 to 300)​

urljoin RFC 3986 resolution algorithm (lines 300 to 450)​

quote percent-encoding (lines 450 to 600)​

gopy mirror​

Map

Reading

`urlsplit` netloc extraction (lines 100 to 300)

`urljoin` RFC 3986 resolution algorithm (lines 300 to 450)

`quote` percent-encoding (lines 450 to 600)

gopy mirror