Lib/difflib.py
Source:
cpython 3.14 @ ab2d84fe1023/Lib/difflib.py
difflib computes differences between sequences. SequenceMatcher is the core; it finds matching blocks using a hash-based approach inspired by Ratcliff/Obershelp. The diff output functions (unified_diff, ndiff) are built on top of it.
Map
| Lines | Symbol | Role |
|---|---|---|
| 1-60 | __all__ | Public API |
| 61-700 | SequenceMatcher | Matching blocks, ratio, opcodes |
| 701-800 | Differ | Human-readable line-by-line diff with intraline marking |
| 801-950 | unified_diff, context_diff | Standard diff output formats |
| 951-1050 | ndiff | Differ-based output |
| 1051-1150 | get_close_matches | Fuzzy string matching |
| 1151-1250 | HtmlDiff | Side-by-side HTML diff |
Reading
SequenceMatcher
SequenceMatcher(a, b) computes similarity between sequences a and b. The key methods are:
get_matching_blocks(): returns a list ofMatch(a, b, size)triples for non-overlapping matching regions, plus a sentinelMatch(len(a), len(b), 0).get_opcodes(): returns a list of(tag, i1, i2, j1, j2)tuples wheretagis one of'equal','replace','delete','insert'.ratio():2.0 * M / TwhereMis matching characters andTis total characters.
The algorithm uses a hash table (b2j) mapping each element of b to the list of indices where it appears, enabling O(n) matching block detection.
# CPython: Lib/difflib.py:191 SequenceMatcher.set_seq2
def set_seq2(self, b):
self.b = b
self.matching_blocks = self.opcodes = None
self.fullbcount = None
self.__chain_b()
def __chain_b(self):
b = self.b
self.b2j = b2j = {}
for i, elt in enumerate(b):
indices = b2j.setdefault(elt, [])
indices.append(i)
find_longest_match
The core primitive finds the longest matching block in a[alo:ahi] and b[blo:bhi]. It scans a left-to-right and uses b2j to find where each element appears in b, tracking the longest run.
# CPython: Lib/difflib.py:298 SequenceMatcher.find_longest_match
def find_longest_match(self, alo=0, ahi=None, blo=0, bhi=None):
...
best_i = alo
best_j = blo
best_size = 0
j2len = {}
for i in range(alo, ahi):
newj2len = {}
for j in b2j.get(a[i], []):
if j < blo: continue
if j >= bhi: break
k = newj2len[j] = j2len.get(j-1, 0) + 1
if k > best_size:
best_i, best_j, best_size = i-k+1, j-k+1, k
j2len = newj2len
...
get_matching_blocks calls find_longest_match recursively (divide and conquer on the non-matching regions).
unified_diff
# CPython: Lib/difflib.py:1073 unified_diff
def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
tofiledate='', n=3, lineterm='\n'):
...
for group in SequenceMatcher(None, a, b).get_grouped_opcodes(n):
...
Outputs the standard ---/+++/@@ format. The n parameter controls context lines.
get_close_matches
# CPython: Lib/difflib.py:738 get_close_matches
def get_close_matches(word, possibilities, n=3, cutoff=0.6):
...
result = []
s = SequenceMatcher()
s.set_seq2(word)
for x in possibilities:
s.set_seq1(x)
if s.real_quick_ratio() >= cutoff and \
s.quick_ratio() >= cutoff and \
s.ratio() >= cutoff:
result.append((s.ratio(), x))
result = _nlargest(n, result)
return [x for score, x in result]
Uses three progressively more expensive ratio checks (real_quick_ratio, quick_ratio, ratio) to filter candidates efficiently.
gopy notes
Status: not yet ported. The algorithm is pure Python and translates directly to Go. The b2j hash table maps to map[interface{}][]int. get_close_matches is the primary entry point for suggestion features in the gopy REPL.