Modules/_json.c
cpython 3.14 @ ab2d84fe1023/Modules/_json.c
_json.c is the C accelerator for json.encoder and json.decoder. The
pure-Python module in Lib/json/ imports _json and replaces its slow
Python implementations with the C versions when available.
The file has two main halves:
- The scanner side:
scanstring_str/scanstring_unicode, which parse a JSON string literal from a Pythonstrobject, handling all\uXXXXescapes including surrogate pairs. - The encoder side:
py_encode_basestring,py_encode_basestring_ascii,encoder_listencode_obj,encoder_listencode_dict, andencoder_listencode_list, which recursively serialize Python objects to JSON text and accumulate chunks in a Python list.
Neither half touches files. Both operate on Python str objects (or list
buffers for encoder output) and are called from the Python layer in
Lib/json/decoder.py and Lib/json/encoder.py.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-300 | scanstring_unicode, scanstring_str | JSON string scanner: fast chunk copy, backslash dispatch, \uXXXX decode, surrogate-pair join. | module/json/ |
| 300-700 | py_encode_basestring, py_encode_basestring_ascii | Encode a Python str as a JSON string literal; ASCII mode escapes all code points above U+007F. | module/json/ |
| 700-1200 | encoder_listencode_obj, encoder_listencode_dict, encoder_listencode_list | Recursive object encoder; appends string chunks to a list accumulator. | module/json/ |
| 1200-1500 | Scanner type, Encoder type, _jsonmodule, PyInit__json | Python-level scanner and encoder objects, module definition, and entry point. | module/json/ |
Reading
scanstring_unicode inner loop (lines 1 to 300)
cpython 3.14 @ ab2d84fe1023/Modules/_json.c#L1-300
scanstring_unicode(pystr, end, strict) scans a JSON string starting at
position end (the character after the opening "). It returns a
(str, new_end) pair.
The fast path copies contiguous non-escape characters into a
_PyUnicodeWriter. A \ causes a character-dispatch switch:
switch (c) {
case '"': c = '"'; break;
case '\\': c = '\\'; break;
case '/': c = '/'; break;
case 'b': c = '\b'; break;
case 'f': c = '\f'; break;
case 'n': c = '\n'; break;
case 'r': c = '\r'; break;
case 't': c = '\t'; break;
case 'u':
/* Read four hex digits. */
c = (Py_UCS4)digit[0] << 12;
c |= (Py_UCS4)digit[1] << 8;
c |= (Py_UCS4)digit[2] << 4;
c |= (Py_UCS4)digit[3];
if (Py_UNICODE_IS_HIGH_SURROGATE(c)) {
/* Peek for a following \uDCxx low surrogate. */
if (s[0] == '\\' && s[1] == 'u') {
Py_UCS4 c2 = /* next \uXXXX */;
if (Py_UNICODE_IS_LOW_SURROGATE(c2)) {
c = Py_UNICODE_JOIN_SURROGATES(c, c2);
s += 6;
}
}
}
break;
default:
if (strict) {
PyErr_Format(PyExc_JSONDecodeError,
"Invalid \\escape: line %d column %d (char %d)", ...);
goto bail;
}
break;
}
When strict=True (the default), a bare surrogate that is not followed by
a valid low surrogate raises JSONDecodeError. Control characters
U+0000 through U+001F that appear literally without escaping also raise
JSONDecodeError in strict mode.
py_encode_basestring_ascii (lines 300 to 700)
cpython 3.14 @ ab2d84fe1023/Modules/_json.c#L300-700
py_encode_basestring_ascii(s) is the ensure_ascii=True path. It scans
the input str and emits \uXXXX for every code point above U+007F, and
a \uXXXX\uYYYY surrogate pair for code points above U+FFFF:
static PyObject *
encoder_encode_string(PyEncoderObject *s, PyObject *obj)
{
/* ... */
if (s->ensure_ascii) {
while (/* chars remain */) {
Py_UCS4 c = PyUnicode_READ(kind, data, i);
if (c >= 0x10000) {
/* Supplementary plane: emit surrogate pair. */
Py_UCS4 v = c - 0x10000;
*p++ = '\\'; *p++ = 'u';
emit_hex4(p, 0xD800 | (v >> 10)); p += 4;
*p++ = '\\'; *p++ = 'u';
emit_hex4(p, 0xDC00 | (v & 0x3FF)); p += 4;
} else if (c >= 0x80) {
/* BMP non-ASCII: emit \uXXXX. */
*p++ = '\\'; *p++ = 'u';
emit_hex4(p, c); p += 4;
} else {
/* ASCII: emit directly or as two-char escape. */
*p++ = (char)c;
}
}
}
}
Control characters U+0000 through U+001F are always escaped, even in
non-ASCII mode. The structural characters " and \ are escaped as \"
and \\ respectively.
encoder_listencode_dict recursive encoding (lines 700 to 1200)
cpython 3.14 @ ab2d84fe1023/Modules/_json.c#L700-1200
The dict encoder is the recursive core for JSON object serialization. It
appends string chunks to a Python list accumulator (chunks) rather than
building one large string. The caller joins the list with ''.join(chunks).
Keys must be str (or types coercible via a key callable). Non-string
keys are silently skipped when skipkeys=True, otherwise they raise
TypeError:
static int
encoder_listencode_dict(PyEncoderObject *s, _PyUnicodeWriter *writer,
PyObject *dct, Py_ssize_t indent_level)
{
PyObject *it = PyObject_GetIter(dct); /* iterate over keys */
while ((key = PyIter_Next(it)) != NULL) {
if (!PyUnicode_Check(key)) {
if (s->skipkeys) { Py_DECREF(key); continue; }
PyErr_Format(PyExc_TypeError,
"keys must be strings, not %.100s",
Py_TYPE(key)->tp_name);
goto bail;
}
if (encoder_encode_string(s, writer, key) < 0) goto bail;
/* separator */
value = PyObject_GetItem(dct, key);
if (encoder_listencode_obj(s, writer, value, indent_level) < 0)
goto bail;
Py_DECREF(value);
Py_DECREF(key);
}
...
}
Circular reference detection uses a markers dict keyed on id(obj).
Before encoding any container, its id is inserted; after encoding, the id
is removed. A duplicate id raises ValueError: Circular reference detected.
NaN and Infinity float values are encoded as NaN, Infinity, and
-Infinity when allow_nan=True (the default). When allow_nan=False
they raise ValueError, because those literals are not valid JSON.
gopy mirror
module/json/ (pending). ScanString mirrors scanstring_unicode
character-by-character with the same surrogate-pair join logic.
EncodeString mirrors the ensure_ascii branch. EncodeDictChunks and
EncodeListChunks mirror the list-accumulator pattern with the same
skipkeys / sort_keys knobs. Circular-reference detection uses a Go
map[uintptr]struct{} keyed on object pointer identity.
CPython 3.14 changes
_json.c has been largely stable since 3.1. The _PyUnicodeWriter fast
path replaced the older PyObject_CallMethod approach in 3.3. The
JSONDecodeError exception (a subclass of ValueError) was introduced in
3.5. Per-interpreter state cleanup arrived in 3.12. No structural changes
between 3.12 and 3.14.