Skip to main content

Modules/_lzmamodule.c

Source:

cpython 3.14 @ ab2d84fe1023/Modules/_lzmamodule.c

The _lzma extension module wraps liblzma (part of the XZ Utils package) to expose LZMACompressor and LZMADecompressor types in Python. Both types hold a live lzma_stream inside their C struct, and the module-level compress() / decompress() convenience functions are thin wrappers over those types.

Map

SymbolKindLines (approx)Purpose
Compressorstruct40-60Holds lzma_stream, format/check, and lock
Decompressorstruct65-90Holds lzma_stream, eof, unused_data, and lock
Compressor_initfunction200-290Initialises stream with lzma_easy_encoder or lzma_stream_encoder
Decompressor_initfunction540-620Initialises stream with lzma_stream_decoder or lzma_auto_decoder
Compressor_compressfunction295-360Drives the compress loop, collecting partial output
Compressor_flushfunction365-410Finalises stream with LZMA_FINISH action
Decompressor_decompressfunction625-730Drives the decompress loop, handling multi-stream reset
FORMAT_AUTO / FORMAT_XZ / FORMAT_ALONE / FORMAT_RAWconstants850-870Format selector exposed to Python
CHECK_CRC32 / CHECK_CRC64 / CHECK_SHA256 / CHECK_NONEconstants870-900Integrity check selector

Reading

Stream initialisation and format selection

Compressor_init (line 200) inspects the format argument to choose between lzma_easy_encoder (for XZ with a preset level) and lzma_stream_encoder (for raw or alone formats that require explicit filter chains). The check argument selects the integrity check appended to each XZ block.

// CPython: Modules/_lzmamodule.c:200 Compressor_init
static int
Compressor_init(Compressor *self, PyObject *args, PyObject *kwargs)
{
if (format == FORMAT_XZ) {
lzma_ret ret = lzma_easy_encoder(&self->lzs, preset, check);
if (ret != LZMA_OK) {
/* raise LZMAError */
}
} else {
/* build filter chain, call lzma_stream_encoder or lzma_alone_encoder */
}
}

Decompressor_init (line 540) maps the three user-facing format constants to the appropriate liblzma decoder. FORMAT_AUTO becomes lzma_auto_decoder, which auto-detects XZ and legacy LZMA streams. FORMAT_XZ uses lzma_stream_decoder, and FORMAT_ALONE uses lzma_alone_decoder. FORMAT_RAW uses lzma_raw_decoder with a caller-supplied filter chain.

// CPython: Modules/_lzmamodule.c:540 Decompressor_init
static int
Decompressor_init(Decompressor *self, PyObject *args, PyObject *kwargs)
{
switch (format) {
case FORMAT_AUTO:
ret = lzma_auto_decoder(&self->lzs, memlimit, decoder_flags);
break;
case FORMAT_XZ:
ret = lzma_stream_decoder(&self->lzs, memlimit, decoder_flags);
break;
case FORMAT_ALONE:
ret = lzma_alone_decoder(&self->lzs, memlimit);
break;
}
}

Compression partial-output loop

Compressor_compress (line 295) feeds input to liblzma in a do/while loop, accumulating output chunks into a PyByteArrayObject. Each iteration sets avail_in and next_in, calls lzma_code with LZMA_RUN, then appends whatever bytes landed in the output buffer. The loop exits when avail_in reaches zero, meaning all input has been consumed (though not necessarily flushed to the stream).

// CPython: Modules/_lzmamodule.c:295 Compressor_compress
static PyObject *
Compressor_compress(Compressor *self, PyObject *args)
{
self->lzs.next_in = (uint8_t *)data.buf;
self->lzs.avail_in = data.len;
do {
arrange_output_buffer(&self->lzs, &result, &output_size);
lzma_ret ret = lzma_code(&self->lzs, LZMA_RUN);
if (ret != LZMA_OK && ret != LZMA_STREAM_END)
goto error;
} while (self->lzs.avail_in != 0);
/* trim result to actual output length */
}

Compressor_flush (line 365) runs the same loop but passes LZMA_FINISH to signal end-of-stream. It keeps looping until liblzma returns LZMA_STREAM_END, at which point the XZ footer (including the integrity check) has been written.

Decompression multi-stream reset

Decompressor_decompress (line 625) handles the case where a byte buffer contains more than one concatenated XZ stream. When lzma_code returns LZMA_STREAM_END but input bytes remain, the function calls lzma_end followed by the appropriate lzma_*_decoder init to reset the stream state, then continues the loop. Any leftover bytes after the final stream end are stored in self->unused_data.

// CPython: Modules/_lzmamodule.c:670 Decompressor_decompress (multi-stream reset)
if (ret == LZMA_STREAM_END) {
if (self->lzs.avail_in == 0) {
self->eof = 1;
break;
}
/* more data: reset the decoder and continue */
lzma_end(&self->lzs);
if (reinit_decoder(self) != 0)
goto error;
}

The CHECK_CRC64 and CHECK_NONE constants (lines 870-900) are passed as the check argument to lzma_easy_encoder. CHECK_NONE disables integrity verification entirely, useful for in-memory pipelines where external framing already covers integrity. CHECK_CRC64 is the XZ default and provides strong detection of corruption.

gopy notes

Status: not yet ported.

Planned package path: module/lzma/.

The Go standard library does not include an LZMA decoder. The most practical path is a cgo bridge over the system liblzma, mirroring how _lzmamodule.c works. A pure-Go alternative exists (github.com/ulikunitz/xz), but it would need to be evaluated for streaming API compatibility before adoption.

LZMACompressor and LZMADecompressor both require a per-object mutex because lzma_stream is not thread-safe. The Go port should use sync.Mutex in the same position.

The multi-stream reset logic in Decompressor_decompress must be ported faithfully. Dropping it would silently truncate concatenated XZ files, which are commonly produced by parallel compression tools.