Compare two line-based sets and compute the four classic similarity coefficients — Jaccard, Sørensen–Dice, Overlap (Szymkiewicz–Simpson), and Tversky — plus cosine on the multiset (counted) representation. Useful for fuzzy duplicate detection, deciding whether two tag lists are "close enough," tuning thresholds for retrieval, and scoring how much one keyword list covers another.
| Coefficient | Formula | Range | Good for |
|---|---|---|---|
Jaccard J(A,B) |
|A∩B| / |A∪B| |
0 → 1 | General similarity — penalizes union growth. The default if you don't know which to pick. |
Sørensen–Dice D(A,B) |
2·|A∩B| / (|A| + |B|) |
0 → 1 | Like Jaccard but more generous with small intersections. D = 2J / (1+J). |
Overlap O(A,B) |
|A∩B| / min(|A|,|B|) |
0 → 1 | "How much of the smaller set is in the bigger one?" Reaches 1 when A ⊆ B or B ⊆ A. |
Tversky T(A,B) |
|A∩B| / (|A∩B| + α·|A\B| + β·|B\A|) |
0 → 1 | Asymmetric — weight false positives (β on B-only) vs false negatives (α on A-only) differently. α=β=½ ⇒ Dice; α=β=1 ⇒ Jaccard. |
| Cosine (multiset) | Σ min(a,b) / √(|A|·|B|) |
0 → 1 | Treats inputs as bag-of-words. Useful when items repeat and frequency matters. |
When Dedupe within each side is on, the four classic set coefficients are computed on the de-duplicated sets — that's the textbook definition. Cosine still uses the original (with-duplicate) counts so frequency information isn't lost.
When dedupe is off, Jaccard and Dice are computed on multisets — intersection uses element-wise min(count_A, count_B), union uses max. This matches how tools like MinHash on shingles behave.