Skip to main content

Lib/tarfile.py

Source:

cpython 3.14 @ ab2d84fe1023/Lib/tarfile.py

Map

LinesSymbolRole
1–120module header, constantsENCODING, BLOCKSIZE, RECORDSIZE, GNU_MAGIC, POSIX_MAGIC, REGTYPE/DIRTYPE/SYMTYPE/etc.
121–300_Streamtransparent gzip/bzip2/xz wrapper around a raw file object
301–480ExFileObjectread-only file-like view of a single tar entry, with sparse-map support
481–900TarInfoheader packing/unpacking (ustar POSIX format, GNU extensions)
901–1100TarInfo.frombuf, TarInfo.tobufdecode/encode 512-byte header block
1101–1400TarFile.__init__, TarFile.openmode dispatch, compression detection, _Stream setup
1401–1700TarFile.getmember, TarFile.getmembers, TarFile._loadindex building, lazy-load
1701–2100TarFile.extractall, TarFile.extract, TarFile._extract_memberextraction loop, filter parameter, path safety
2101–2500TarFile.add, TarFile.addfile, TarFile.gettarinfoarchive creation
2501–3200TarFile._proc_* methodsGNU/POSIX pax extension block processing
3201–3900TarFile.close, utilities, open shortcutfinalisation, padding, tarfile.open alias

Reading

TarFile.open() mode dispatch

TarFile.open() is a class method that inspects the mode string and routes to the right subclass or compression wrapper. The first character selects the access mode (read, write, exclusive-create, append). An optional suffix after the colon names the compression format.

# CPython: Lib/tarfile.py:1101 TarFile.open
@classmethod
def open(cls, name=None, mode="r", fileobj=None, bufsize=RECORDSIZE, **kwargs):
...
if ":" in mode:
filemode, comptype = mode.split(":", 1)
else:
filemode, comptype = mode, ""
...
if comptype == "gz":
fileobj = _Stream(name, filemode, comptype, fileobj, bufsize)
elif comptype == "bz2":
fileobj = _Stream(name, filemode, comptype, fileobj, bufsize)
elif comptype == "xz":
fileobj = _Stream(name, filemode, comptype, fileobj, bufsize)
...
return cls.taropen(name, filemode, fileobj, **kwargs)

When mode is "r" (without a suffix), open() tries each registered compression format in order and falls back to uncompressed. This auto-detect path reads the first few bytes, checks magic bytes, then rewinds.

TarInfo: POSIX ustar header packing

Each file entry in a tar archive is preceded by a 512-byte header block in POSIX ustar format. TarInfo.frombuf() decodes one such block into Python attributes; TarInfo.tobuf() encodes them back.

Field values are stored as fixed-width octal ASCII strings in the block. TarInfo reads them with nti() (null-terminated integer from octal) and writes them with itn() (integer to null-terminated octal). Fields that overflow the fixed width trigger GNU or PAX extension headers.

# CPython: Lib/tarfile.py:901 TarInfo.frombuf
@classmethod
def frombuf(cls, buf, encoding, errors):
...
tarinfo.name = nts(buf[0:100], encoding, errors)
tarinfo.mode = nti(buf[100:108])
tarinfo.uid = nti(buf[108:116])
tarinfo.gid = nti(buf[116:124])
tarinfo.size = nti(buf[124:136])
tarinfo.mtime = nti(buf[136:148])
...
tarinfo.type = buf[156:157]
tarinfo.linkname = nts(buf[157:257], encoding, errors)
...

extractall() filter parameter

CPython 3.12 introduced the filter parameter on extractall() and extract() to address path-traversal and permission-escalation vulnerabilities. Three named policies are provided: "data" (safest), "tar", and "fully_trusted".

# CPython: Lib/tarfile.py:1701 TarFile.extractall
def extractall(self, path=".", members=None, *, numeric_owner=False,
filter=None):
...
for tarinfo in members:
...
tarinfo = self._check_filter(tarinfo, path, filter)
if tarinfo is None:
continue
self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
numeric_owner=numeric_owner)

The "data" filter rejects absolute paths, .. segments, special files, and high-permission bits. It is the recommended default for archives from untrusted sources. "fully_trusted" skips all checks and matches pre-3.12 behaviour.

gopy notes

Status: not yet ported.

Planned package path: module/tarfile/.

The port will need TarInfo, TarFile, ExFileObject, and _Stream. Go already provides archive/tar for basic read/write, but the CPython implementation adds GNU and PAX extensions, sparse-file support, and the three-tier extraction filter that has no direct equivalent in the standard library. The filter logic is security-critical and must be ported function by function with citations rather than delegated to archive/tar hooks. The _Stream gzip/bzip2/xz wrapper maps to chained io.Reader/io.Writer decorators using compress/gzip, compress/bzip2, and github.com/ulikunitz/xz.