Lib/difflib.py
cpython 3.14 @ ab2d84fe1023/Lib/difflib.py
difflib is a pure-Python module with no C extension. It provides
flexible sequence comparison through SequenceMatcher, a line-oriented
differencing class Differ, and several output formatters:
unified_diff, context_diff, ndiff, and HtmlDiff.
SequenceMatcher works on any sequences with hashable elements. It uses
a heuristic junk filter (autojunk) to ignore elements that appear in
more than 1% of lines and are present in at least two lines, which
prevents common boilerplate from anchoring the diff at the wrong
positions. The matching algorithm is a recursive divide-and-conquer
approach built on find_longest_match, which is O(n) per call using a
hash-indexed occurrence map.
Differ wraps SequenceMatcher to produce a delta format (lines
prefixed with ' ', '- ', '+ ', or '? ') and handles intra-line
differences through _fancy_replace, which calls SequenceMatcher again
on the character level.
Map
| Lines | Symbol | Role | gopy |
|---|---|---|---|
| 1-80 | module docstring, __all__, imports | Sets up the public API list and imports heapq, re. | (pending) |
| 80-200 | SequenceMatcher.__init__, set_seqs, set_seq1, set_seq2 | Accepts two sequences plus an optional isjunk predicate; builds the element-to-indices map b2j on set_seq2. | (pending) |
| 200-400 | _chain_b, find_longest_match | Constructs the occurrence map with junk and popular-item filtering; finds the longest common substring in a rectangle. | (pending) |
| 400-600 | get_matching_blocks, get_opcodes | Recursive matching block computation; converts to opcode 5-tuples (tag, i1, i2, j1, j2). | (pending) |
| 600-800 | get_grouped_opcodes, ratio, quick_ratio, real_quick_ratio | Context-grouped opcodes for diff formatters; three ratio approximations in O(1)/O(n) time. | (pending) |
| 800-1000 | Differ.__init__, compare, _dump, _plain_replace | Top-level driver; emits '+ '/'- ' lines for replacements without intra-line detail. | (pending) |
| 1000-1200 | _fancy_replace, _qformat | Character-level diff for replaced lines; the '? ' hint line showing changed positions. | (pending) |
| 1200-1500 | unified_diff, context_diff | Standard diff output generators; yield header and hunk lines for sequence pairs. | (pending) |
| 1500-1800 | ndiff, restore, HtmlDiff | ndiff is Differ.compare with a nicer header; restore extracts one side; HtmlDiff generates a two-column HTML table. | (pending) |
Reading
find_longest_match algorithm (lines 200 to 400)
cpython 3.14 @ ab2d84fe1023/Lib/difflib.py#L200-400
def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
besti, bestj, bestsize = alo, blo, 0
j2len = {}
for i in range(alo, ahi):
newj2len = {}
for j in b2j.get(a[i], []):
if j < blo:
continue
if j >= bhi:
break
k = newj2len[j] = j2len.get(j - 1, 0) + 1
if k > bestsize:
besti, bestj, bestsize = i - k + 1, j - k + 1, k
j2len = newj2len
# Extend the match at the edges if possible.
while (besti > alo and bestj > blo
and not isbjunk(b[bestj - 1])
and a[besti - 1] == b[bestj - 1]):
besti -= 1; bestj -= 1; bestsize += 1
while (besti + bestsize < ahi and bestj + bestsize < bhi
and not isbjunk(b[bestj + bestsize])
and a[besti + bestsize] == b[bestj + bestsize]):
bestsize += 1
# Now, the same for junk.
while (besti > alo and bestj > blo
and isbjunk(b[bestj - 1])
and a[besti - 1] == b[bestj - 1]):
besti -= 1; bestj -= 1; bestsize += 1
...
return Match(besti, bestj, bestsize)
b2j maps each element of sequence b to a sorted list of positions
where it occurs. For each position i in the a slice, the inner loop
walks only the positions in b where a[i] occurs, which keeps the
loop proportional to the number of matching pairs rather than the product
of the two slice lengths.
j2len is a dictionary from j (b-position) to the length of the
longest common run ending at (i, j). On each outer iteration j2len
from the previous row is used to extend runs diagonally: if a[i] == b[j] then the run ending at (i, j) is one longer than the run ending
at (i-1, j-1). The dictionary is rebuilt from scratch each row rather
than updated in place, which avoids stale entries.
After the main loop, the match is extended left and right by elements
that compare equal but were not in b2j (junk items that are not
in bjunk are handled in a second extension pass).
get_matching_blocks recursion (lines 400 to 600)
cpython 3.14 @ ab2d84fe1023/Lib/difflib.py#L400-600
def get_matching_blocks(self):
if self.matching_blocks is not None:
return self.matching_blocks
la, lb = len(self.a), len(self.b)
queue = [(0, la, 0, lb)]
matching_blocks = []
while queue:
alo, ahi, blo, bhi = queue.pop()
i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
if k:
matching_blocks.append(x)
if alo < i and blo < j:
queue.append((alo, i, blo, j))
if i + k < ahi and j + k < bhi:
queue.append((i + k, ahi, j + k, bhi))
matching_blocks.sort()
matching_blocks.append(Match(la, lb, 0)) # sentinel
self.matching_blocks = matching_blocks
return matching_blocks
The recursion is simulated with an explicit stack (queue). For each
(alo, ahi, blo, bhi) rectangle, find_longest_match finds the best
block inside it. If the block is non-empty it is recorded, and the two
sub-rectangles on either side of the block are pushed onto the stack.
This gives the same blocks as a recursive descent but without Python
stack depth concerns.
Results are sorted by (i, j, k) and a zero-length sentinel
Match(la, lb, 0) is appended so that loop-over-blocks code can always
reference the next block without bounds checks.
unified_diff format (lines 1200 to 1500)
cpython 3.14 @ ab2d84fe1023/Lib/difflib.py#L1200-1500
def unified_diff(a, b, fromfile='', tofile='',
fromfiledate='', tofiledate='',
n=3, lineterm='\n'):
started = False
for group in SequenceMatcher(None, a, b).get_grouped_opcodes(n):
if not started:
started = True
fromdate = '\t' + fromfiledate if fromfiledate else ''
todate = '\t' + tofiledate if tofiledate else ''
yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
yield '+++ {}{}{}'.format(tofile, todate, lineterm)
first, last = group[0], group[-1]
file1_range = _format_range_unified(first[1], last[2])
file2_range = _format_range_unified(first[3], last[4])
yield '@@ -{} +{} @@{}'.format(file1_range, file2_range, lineterm)
for tag, i1, i2, j1, j2 in group:
if tag == 'equal':
for line in a[i1:i2]:
yield ' ' + line
if tag in {'replace', 'delete'}:
for line in a[i1:i2]:
yield '-' + line
if tag in {'replace', 'insert'}:
for line in b[j1:j2]:
yield '+' + line
unified_diff is a generator. The header lines (---/+++) are
deferred until the first hunk, so sequences with no differences produce
no output at all.
get_grouped_opcodes(n) yields lists of opcodes grouped into hunks with
n lines of context on each side. Each group starts and ends with an
equal tag. The @@ -f,l +f,l @@ hunk header is built from the first
and last opcode in the group using _format_range_unified, which omits
the ,1 suffix when the hunk covers exactly one line (matching the
standard unified diff format).
gopy mirror
difflib is not yet ported. The module has no C extension and no
platform dependencies, making it a straightforward port. The main
complexity is SequenceMatcher's b2j hash map and the j2len
row-by-row DP table. The HtmlDiff class depends on string.Template
and textwrap, both of which would need to be available first.