Skip to main content

Lib/difflib.py

cpython 3.14 @ ab2d84fe1023/Lib/difflib.py

difflib is a pure-Python module with no C extension. It provides flexible sequence comparison through SequenceMatcher, a line-oriented differencing class Differ, and several output formatters: unified_diff, context_diff, ndiff, and HtmlDiff.

SequenceMatcher works on any sequences with hashable elements. It uses a heuristic junk filter (autojunk) to ignore elements that appear in more than 1% of lines and are present in at least two lines, which prevents common boilerplate from anchoring the diff at the wrong positions. The matching algorithm is a recursive divide-and-conquer approach built on find_longest_match, which is O(n) per call using a hash-indexed occurrence map.

Differ wraps SequenceMatcher to produce a delta format (lines prefixed with ' ', '- ', '+ ', or '? ') and handles intra-line differences through _fancy_replace, which calls SequenceMatcher again on the character level.

Map

LinesSymbolRolegopy
1-80module docstring, __all__, importsSets up the public API list and imports heapq, re.(pending)
80-200SequenceMatcher.__init__, set_seqs, set_seq1, set_seq2Accepts two sequences plus an optional isjunk predicate; builds the element-to-indices map b2j on set_seq2.(pending)
200-400_chain_b, find_longest_matchConstructs the occurrence map with junk and popular-item filtering; finds the longest common substring in a rectangle.(pending)
400-600get_matching_blocks, get_opcodesRecursive matching block computation; converts to opcode 5-tuples (tag, i1, i2, j1, j2).(pending)
600-800get_grouped_opcodes, ratio, quick_ratio, real_quick_ratioContext-grouped opcodes for diff formatters; three ratio approximations in O(1)/O(n) time.(pending)
800-1000Differ.__init__, compare, _dump, _plain_replaceTop-level driver; emits '+ '/'- ' lines for replacements without intra-line detail.(pending)
1000-1200_fancy_replace, _qformatCharacter-level diff for replaced lines; the '? ' hint line showing changed positions.(pending)
1200-1500unified_diff, context_diffStandard diff output generators; yield header and hunk lines for sequence pairs.(pending)
1500-1800ndiff, restore, HtmlDiffndiff is Differ.compare with a nicer header; restore extracts one side; HtmlDiff generates a two-column HTML table.(pending)

Reading

find_longest_match algorithm (lines 200 to 400)

cpython 3.14 @ ab2d84fe1023/Lib/difflib.py#L200-400

def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
besti, bestj, bestsize = alo, blo, 0
j2len = {}
for i in range(alo, ahi):
newj2len = {}
for j in b2j.get(a[i], []):
if j < blo:
continue
if j >= bhi:
break
k = newj2len[j] = j2len.get(j - 1, 0) + 1
if k > bestsize:
besti, bestj, bestsize = i - k + 1, j - k + 1, k
j2len = newj2len
# Extend the match at the edges if possible.
while (besti > alo and bestj > blo
and not isbjunk(b[bestj - 1])
and a[besti - 1] == b[bestj - 1]):
besti -= 1; bestj -= 1; bestsize += 1
while (besti + bestsize < ahi and bestj + bestsize < bhi
and not isbjunk(b[bestj + bestsize])
and a[besti + bestsize] == b[bestj + bestsize]):
bestsize += 1
# Now, the same for junk.
while (besti > alo and bestj > blo
and isbjunk(b[bestj - 1])
and a[besti - 1] == b[bestj - 1]):
besti -= 1; bestj -= 1; bestsize += 1
...
return Match(besti, bestj, bestsize)

b2j maps each element of sequence b to a sorted list of positions where it occurs. For each position i in the a slice, the inner loop walks only the positions in b where a[i] occurs, which keeps the loop proportional to the number of matching pairs rather than the product of the two slice lengths.

j2len is a dictionary from j (b-position) to the length of the longest common run ending at (i, j). On each outer iteration j2len from the previous row is used to extend runs diagonally: if a[i] == b[j] then the run ending at (i, j) is one longer than the run ending at (i-1, j-1). The dictionary is rebuilt from scratch each row rather than updated in place, which avoids stale entries.

After the main loop, the match is extended left and right by elements that compare equal but were not in b2j (junk items that are not in bjunk are handled in a second extension pass).

get_matching_blocks recursion (lines 400 to 600)

cpython 3.14 @ ab2d84fe1023/Lib/difflib.py#L400-600

def get_matching_blocks(self):
if self.matching_blocks is not None:
return self.matching_blocks
la, lb = len(self.a), len(self.b)
queue = [(0, la, 0, lb)]
matching_blocks = []
while queue:
alo, ahi, blo, bhi = queue.pop()
i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
if k:
matching_blocks.append(x)
if alo < i and blo < j:
queue.append((alo, i, blo, j))
if i + k < ahi and j + k < bhi:
queue.append((i + k, ahi, j + k, bhi))
matching_blocks.sort()
matching_blocks.append(Match(la, lb, 0)) # sentinel
self.matching_blocks = matching_blocks
return matching_blocks

The recursion is simulated with an explicit stack (queue). For each (alo, ahi, blo, bhi) rectangle, find_longest_match finds the best block inside it. If the block is non-empty it is recorded, and the two sub-rectangles on either side of the block are pushed onto the stack. This gives the same blocks as a recursive descent but without Python stack depth concerns.

Results are sorted by (i, j, k) and a zero-length sentinel Match(la, lb, 0) is appended so that loop-over-blocks code can always reference the next block without bounds checks.

unified_diff format (lines 1200 to 1500)

cpython 3.14 @ ab2d84fe1023/Lib/difflib.py#L1200-1500

def unified_diff(a, b, fromfile='', tofile='',
fromfiledate='', tofiledate='',
n=3, lineterm='\n'):
started = False
for group in SequenceMatcher(None, a, b).get_grouped_opcodes(n):
if not started:
started = True
fromdate = '\t' + fromfiledate if fromfiledate else ''
todate = '\t' + tofiledate if tofiledate else ''
yield '--- {}{}{}'.format(fromfile, fromdate, lineterm)
yield '+++ {}{}{}'.format(tofile, todate, lineterm)
first, last = group[0], group[-1]
file1_range = _format_range_unified(first[1], last[2])
file2_range = _format_range_unified(first[3], last[4])
yield '@@ -{} +{} @@{}'.format(file1_range, file2_range, lineterm)
for tag, i1, i2, j1, j2 in group:
if tag == 'equal':
for line in a[i1:i2]:
yield ' ' + line
if tag in {'replace', 'delete'}:
for line in a[i1:i2]:
yield '-' + line
if tag in {'replace', 'insert'}:
for line in b[j1:j2]:
yield '+' + line

unified_diff is a generator. The header lines (---/+++) are deferred until the first hunk, so sequences with no differences produce no output at all.

get_grouped_opcodes(n) yields lists of opcodes grouped into hunks with n lines of context on each side. Each group starts and ends with an equal tag. The @@ -f,l +f,l @@ hunk header is built from the first and last opcode in the group using _format_range_unified, which omits the ,1 suffix when the hunk covers exactly one line (matching the standard unified diff format).

gopy mirror

difflib is not yet ported. The module has no C extension and no platform dependencies, making it a straightforward port. The main complexity is SequenceMatcher's b2j hash map and the j2len row-by-row DP table. The HtmlDiff class depends on string.Template and textwrap, both of which would need to be available first.