Set Similarity Calculator

Compare two line-based sets and compute the four classic similarity coefficients — Jaccard, Sørensen–Dice, Overlap (Szymkiewicz–Simpson), and Tversky — plus cosine on the multiset (counted) representation. Useful for fuzzy duplicate detection, deciding whether two tag lists are "close enough," tuning thresholds for retrieval, and scoring how much one keyword list covers another.

Reference

What each coefficient measures

Coefficient	Formula	Range	Good for
Jaccard `J(A,B)`	`\|A∩B\| / \|A∪B\|`	0 → 1	General similarity — penalizes union growth. The default if you don't know which to pick.
Sørensen–Dice `D(A,B)`	`2·\|A∩B\| / (\|A\| + \|B\|)`	0 → 1	Like Jaccard but more generous with small intersections. `D = 2J / (1+J)`.
Overlap `O(A,B)`	`\|A∩B\| / min(\|A\|,\|B\|)`	0 → 1	"How much of the smaller set is in the bigger one?" Reaches 1 when A ⊆ B or B ⊆ A.
Tversky `T(A,B)`	`\|A∩B\| / (\|A∩B\| + α·\|A\B\| + β·\|B\A\|)`	0 → 1	Asymmetric — weight false positives (β on B-only) vs false negatives (α on A-only) differently. α=β=½ ⇒ Dice; α=β=1 ⇒ Jaccard.
Cosine (multiset)	`Σ min(a,b) / √(\|A\|·\|B\|)`	0 → 1	Treats inputs as bag-of-words. Useful when items repeat and frequency matters.

Choosing a threshold

Dedupe / near-duplicate detection on short strings (titles, emails): Jaccard ≥ 0.7 or Dice ≥ 0.8 is a common starting point — tune on a held-out set.
Retrieval / search: Overlap is forgiving when the query is shorter than the document — works well when one side is much smaller.
Recommendation (e.g. tag-based item similarity): Jaccard handles long-tail tags well without being thrown off by one giant item.
Asymmetric coverage ("how much of my keyword list does this doc cover?"): Tversky with α high, β low.

About the multiset modes

When Dedupe within each side is on, the four classic set coefficients are computed on the de-duplicated sets — that's the textbook definition. Cosine still uses the original (with-duplicate) counts so frequency information isn't lost.

When dedupe is off, Jaccard and Dice are computed on multisets — intersection uses element-wise min(count_A, count_B), union uses max. This matches how tools like MinHash on shingles behave.

Similarity coefficients

Set summary

A ∩ B (in both)

A \ B (A only)

B \ A (B only)

Reference