tabibu-dupes — duplicate-file funnel
Finds true (byte-identical) duplicate sets via a three-stage funnel that runs
the cheapest filter first and only pays for full hashing on files that are
still plausibly duplicates. Confirmed groups stream to the UI via on_group
as they are finalised and are also returned as a sorted list.
The funnel
flowchart TD
A["Candidates: ~50,000 files >= min_size (default 4 KiB)"]
B["Stage 1: group by exact size (one stat each)\ndrop singleton groups"]
C["~6,000 files share a size with another file"]
D["Stage 2: blake3 of first + last 16 KiB\n(< 32 KiB: whole content once) — rayon parallel\ndrop singleton subgroups"]
E["~1,500 files survive sampling"]
F["Stage 3: full streaming blake3, 1 MiB buffer\nrayon parallel — drop singleton subgroups"]
G["~600 files in TRUE duplicate groups\npaths sorted newest-mtime first ([0] = keeper)"]
A --> B --> C --> D --> E --> F --> GWhy head + tail sampling
Most same-size non-duplicates (logs, exports, media, VM images) diverge near the start or end: headers, timestamps, trailing indexes. Reading 32 KiB instead of whole files turns stage 2 into two seeks + two small reads per file, eliminating almost all false candidates before the expensive stage 3. Sampling can collide (files differing only in the middle), which is exactly what stage 3 exists to catch — correctness never depends on the sample.
Complexity
- Stage 1:
O(n)stats, zero file reads. - Stage 2:
O(s)wheres= files sharing a size; <= 32 KiB read each. - Stage 3:
O(total bytes of survivors)— the only full-content pass. - Hash subgrouping is hash-map based; final sort is
O(g log g)on groups.
Safety & robustness
- Files that vanish or error mid-hash are dropped silently; one bad file never fails the run.
to_cleanup_itemsspares the newest copy and emits the rest asCategory::DuplicateatSafetyTier::Review— never auto-selected, never hard-deleted; reason text names the kept copy.- Cancellation is checked between stages and inside every rayon loop.
Benchmark
benches/dupes.rs (criterion, harness = false): 2,000 x 8 KiB files with
~30% duplicates (300 pairs), fixture built once in setup; benches
find_duplicates end-to-end. Target: well under 100 ms warm on Apple
silicon. Run with cargo bench -p tabibu-dupes.