Modules/_bz2module.c
cpython 3.14 @ ab2d84fe1023/Modules/_bz2module.c
The C backend for the bz2 module. The pure-Python Lib/bz2.py imports this
module as _bz2 and wraps BZ2Compressor and BZ2Decompressor in
higher-level BZ2File and open() helpers.
The file is self-contained at 641 lines. It registers two types:
BZ2Compressor, a write-only object that feeds data through
BZ2_bzCompress in BZ_RUN mode and finalises the stream with BZ_FINISH
on a call to flush(); and BZ2Decompressor, a read-only object that drives
BZ2_bzDecompress with an internal input buffer, tracking eof,
unused_data, and needs_input in the same pattern as zlib._ZlibDecompressor
and lzma.LZMADecompressor.
Both types use PyMutex for thread safety and release the GIL around the
compression/decompression calls via Py_BEGIN_ALLOW_THREADS.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-107 | includes, _bz2_state, BZ2Compressor, BZ2Decompressor structs, output-buffer helpers | Module state struct, both object structs, and _BlocksOutputBuffer wrappers. | module/bz2/ |
| 108-212 | compress, catch_bz2_error, BZ2_Malloc, BZ2_Free | Core compress loop (used by both compress() and flush()); error-code mapper; custom allocator wrappers. | module/bz2/ |
| 213-314 | _bz2_BZ2Compressor_compress_impl, _bz2_BZ2Compressor_flush_impl, _bz2_BZ2Compressor_impl, BZ2Compressor_dealloc | BZ2Compressor methods and constructor. compress() calls the shared compress loop with BZ_RUN; flush() calls it with BZ_FINISH and marks the object flushed. | module/bz2/ |
| 315-496 | decompress_buf, decompress | Inner decompression engine. decompress_buf drives BZ2_bzDecompress until input is exhausted or BZ_STREAM_END is reached. decompress manages the internal input buffer and sets needs_input/eof. | module/bz2/ |
| 497-544 | _bz2_BZ2Decompressor_decompress_impl, _bz2_BZ2Decompressor_impl, BZ2Decompressor_dealloc, BZ2Decompressor_unused_data_get | BZ2Decompressor methods, constructor, destructor, and the unused_data property getter. | module/bz2/ |
| 545-641 | _bz2_exec, _bz2_traverse, _bz2_clear, _bz2_free, PyInit__bz2 | Module init. Registers both types in per-interpreter state; no module-level constants (bz2 has no user-visible integer flags). | module/bz2/ |
Reading
BZ2Compressor struct layout (lines 1 to 107)
cpython 3.14 @ ab2d84fe1023/Modules/_bz2module.c#L1-107
typedef struct {
PyObject_HEAD
bz_stream bzs; /* libbzip2 stream state */
int flushed; /* 1 after flush() is called; compress() checks this */
PyMutex mutex;
} BZ2Compressor;
bz_stream is the libbzip2 analogue of zlib's z_stream. It holds
next_in/avail_in/next_out/avail_out pointer-and-length pairs that
BZ2_bzCompress advances on each call. flushed is a one-way latch: once
flush() has sent BZ_FINISH, any further call to compress() raises
ValueError("Compressor has been flushed").
typedef struct {
PyObject_HEAD
bz_stream bzs;
char eof; /* set when BZ_STREAM_END is seen */
PyObject *unused_data; /* bytes after the compressed stream */
char needs_input; /* True if internal buffer is empty */
char *input_buffer; /* heap-allocated unconsumed input */
size_t input_buffer_size;
size_t bzs_avail_in_real; /* 64-bit count of bytes in input_buffer */
PyMutex mutex;
} BZ2Decompressor;
bzs_avail_in_real is the authoritative count of bytes waiting to be
consumed. bzs.avail_in is a truncated unsigned int copy used for each
individual BZ2_bzDecompress call; the real count is reconciled after each
call via d->bzs_avail_in_real += bzs->avail_in.
BZ2Compressor.compress() GIL-free loop (lines 108 to 212)
cpython 3.14 @ ab2d84fe1023/Modules/_bz2module.c#L108-212
Both compress() and flush() delegate to the static compress helper:
static PyObject *
compress(BZ2Compressor *c, char *data, size_t len, int action)
{
_BlocksOutputBuffer buffer = {.writer = NULL};
OutputBuffer_InitAndGrow(&buffer, -1, &c->bzs.next_out, &c->bzs.avail_out);
c->bzs.next_in = data;
c->bzs.avail_in = 0;
for (;;) {
int bzerror;
/* Clip to UINT_MAX because avail_in is unsigned int. */
if (c->bzs.avail_in == 0 && len > 0) {
c->bzs.avail_in = (unsigned int)Py_MIN(len, UINT_MAX);
len -= c->bzs.avail_in;
}
/* In BZ_RUN mode, stop when all input is consumed. */
if (action == BZ_RUN && c->bzs.avail_in == 0)
break;
if (c->bzs.avail_out == 0)
OutputBuffer_Grow(&buffer, &c->bzs.next_out, &c->bzs.avail_out);
Py_BEGIN_ALLOW_THREADS
bzerror = BZ2_bzCompress(&c->bzs, action);
Py_END_ALLOW_THREADS
if (catch_bz2_error(bzerror))
goto error;
/* In BZ_FINISH mode, stop when the stream end marker is written. */
if (action == BZ_FINISH && bzerror == BZ_STREAM_END)
break;
}
return OutputBuffer_Finish(&buffer, c->bzs.avail_out);
error:
OutputBuffer_OnError(&buffer);
return NULL;
}
The single loop handles both modes. When action == BZ_RUN, the break
condition is avail_in == 0 after all input has been fed in. When
action == BZ_FINISH, the break condition is BZ_STREAM_END from
BZ2_bzCompress, which means the compressor has written the end-of-stream
block; len is 0 in that case because flush() passes NULL/0 as the
input.
_bz2_BZ2Compressor_compress_impl and _bz2_BZ2Compressor_flush_impl are
thin wrappers that acquire the mutex, check flushed, then call compress:
static PyObject *
_bz2_BZ2Compressor_compress_impl(BZ2Compressor *self, Py_buffer *data)
{
PyObject *result = NULL;
PyMutex_Lock(&self->mutex);
if (self->flushed)
PyErr_SetString(PyExc_ValueError, "Compressor has been flushed");
else
result = compress(self, data->buf, data->len, BZ_RUN);
PyMutex_Unlock(&self->mutex);
return result;
}
static PyObject *
_bz2_BZ2Compressor_flush_impl(BZ2Compressor *self)
{
PyObject *result = NULL;
PyMutex_Lock(&self->mutex);
if (self->flushed)
PyErr_SetString(PyExc_ValueError, "Repeated call to flush()");
else {
self->flushed = 1;
result = compress(self, NULL, 0, BZ_FINISH);
}
PyMutex_Unlock(&self->mutex);
return result;
}
compresslevel (1-9, default 9) is passed to BZ2_bzCompressInit in the
constructor. Level 9 gives the best compression ratio; level 1 is the fastest.
There are no strategy flags in bzip2 (unlike zlib's Z_HUFFMAN_ONLY etc.).
BZ2Decompressor.decompress() eof and needs_input (lines 315 to 496)
cpython 3.14 @ ab2d84fe1023/Modules/_bz2module.c#L315-496
decompress_buf is the inner engine. It drives BZ2_bzDecompress in a loop,
growing the output buffer as needed, and sets d->eof atomically when it
sees BZ_STREAM_END:
static PyObject*
decompress_buf(BZ2Decompressor *d, Py_ssize_t max_length)
{
_BlocksOutputBuffer buffer = {.writer = NULL};
bz_stream *bzs = &d->bzs;
OutputBuffer_InitAndGrow(&buffer, max_length,
&bzs->next_out, &bzs->avail_out);
for (;;) {
int bzret;
/* Clip to UINT_MAX for the same reason as the compressor. */
bzs->avail_in = (unsigned int)Py_MIN(d->bzs_avail_in_real, UINT_MAX);
d->bzs_avail_in_real -= bzs->avail_in;
Py_BEGIN_ALLOW_THREADS
bzret = BZ2_bzDecompress(bzs);
Py_END_ALLOW_THREADS
d->bzs_avail_in_real += bzs->avail_in; /* add back unconsumed */
if (catch_bz2_error(bzret))
goto error;
if (bzret == BZ_STREAM_END) {
FT_ATOMIC_STORE_CHAR_RELAXED(d->eof, 1);
break;
} else if (d->bzs_avail_in_real == 0) {
break; /* all input consumed, not yet at end */
} else if (bzs->avail_out == 0) {
if (OutputBuffer_GetDataSize(&buffer, bzs->avail_out) == max_length)
break; /* output cap reached */
OutputBuffer_Grow(&buffer, &bzs->next_out, &bzs->avail_out);
}
}
return OutputBuffer_Finish(&buffer, bzs->avail_out);
error:
OutputBuffer_OnError(&buffer);
return NULL;
}
The outer decompress function handles the input buffer logic: if
bzs->next_in is non-NULL (there was leftover input from a previous call),
it appends the new data to the internal buffer, possibly reallocating it. If
there was no leftover input, it sets bzs->next_in directly to the caller's
buffer without copying. After decompress_buf returns, decompress updates
needs_input and unused_data:
if (d->eof) {
/* Stream finished. Save any trailing bytes as unused_data. */
d->needs_input = 0;
if (d->bzs_avail_in_real > 0) {
d->unused_data = PyBytes_FromStringAndSize(
bzs->next_in, d->bzs_avail_in_real);
}
}
else if (d->bzs_avail_in_real == 0) {
bzs->next_in = NULL;
d->needs_input = 1; /* caller must provide more data */
}
else {
d->needs_input = 0; /* internal buffer still has bytes */
/* Copy tail from caller's buffer into internal buffer
so we own it across the call boundary. */
if (!input_buffer_in_use) {
/* allocate / resize input_buffer, then memcpy */
}
}
needs_input = True means the decompressor has consumed everything it was
given and cannot make progress until the caller passes new data.
needs_input = False after a non-EOF call means unconsumed bytes remain in
the internal buffer; calling decompress(b"") will drain them.
Error handling: catch_bz2_error (lines 131 to 171)
cpython 3.14 @ ab2d84fe1023/Modules/_bz2module.c#L131-171
libbzip2 returns integer status codes. catch_bz2_error maps them to Python
exceptions:
static int
catch_bz2_error(int bzerror)
{
switch (bzerror) {
case BZ_OK:
case BZ_RUN_OK:
case BZ_FLUSH_OK:
case BZ_FINISH_OK:
case BZ_STREAM_END:
return 0; /* not an error */
case BZ_CONFIG_ERROR:
PyErr_SetString(PyExc_SystemError,
"libbzip2 was not compiled correctly");
return 1;
case BZ_PARAM_ERROR:
PyErr_SetString(PyExc_ValueError,
"Internal error - "
"invalid parameters passed to libbzip2");
return 1;
case BZ_DATA_ERROR:
case BZ_DATA_ERROR_MAGIC:
/* BZ_DATA_ERROR_MAGIC fires when the stream does not start
with the "BZh" magic bytes. */
PyErr_SetString(PyExc_OSError, "Invalid data stream");
return 1;
case BZ_MEM_ERROR:
PyErr_SetNone(PyExc_MemoryError);
return 1;
case BZ_SEQUENCE_ERROR:
/* Raised when BZ2_bzCompress is called after BZ_FINISH. */
PyErr_SetString(PyExc_RuntimeError,
"Internal error - libbzip2 state machine in "
"wrong state");
return 1;
default:
PyErr_Format(PyExc_OSError,
"Unrecognized error from libbzip2: %d", bzerror);
return 1;
}
}
BZ_DATA_ERROR_MAGIC is worth calling out: it is the code libbzip2 returns
when the first bytes of the stream are not "BZh". This surfaces as
OSError("Invalid data stream") at the Python level, consistent with how
zlib and lzma report format errors. BZ_SEQUENCE_ERROR can only occur if the
C code passes the wrong action argument to BZ2_bzCompress after
BZ_FINISH; it is a gopy-side bug guard, not a user error.
gopy mirror
gopy ports Modules/_bz2module.c into module/bz2/. Go's compress/bzip2
package provides a decompressor (bzip2.NewReader), but it has no compressor;
compression requires the third-party dsnet/compress/bzip2 package or CGo
bindings to libbzip2. The porting strategy is:
BZ2Compressoruses CGo to callBZ2_bzCompressInit,BZ2_bzCompress, andBZ2_bzCompressEnddirectly, preserving the GIL-release pattern by calling into C outside the Go runtime lock.BZ2Decompressorwrapsbzip2.NewReaderfor the pure-Go path, adding theinput_buffer,needs_input, andeofbookkeeping in Go, mirroring CPython'sdecompressfunction.
Thread safety uses sync.Mutex in both types. The compresslevel parameter
maps to libbzip2's blockSize100k argument (1-9), with 9 as the default.
catch_bz2_error is mirrored as a Go helper that maps libbzip2 return codes
to OSError, ValueError, or MemoryError, preserving the exact exception
types CPython raises.