Claude-skill-registry-data mithril-dedup-agent
Build mithril-dedup for ML dataset deduplication. Use when implementing MinHash, LSH, clustering, or document I/O.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mithril-dedup-agent" ~/.claude/skills/majiayu000-claude-skill-registry-data-mithril-dedup-agent && rm -rf "$T"
manifest:
data/mithril-dedup-agent/SKILL.mdsource content
Mithril Dedup Agent
Build data deduplication for ML training datasets at 100K+ docs/sec.
Status
Read
crates/mithril-dedup/STATUS.md for current progress.
Reference Documentation
- Full product specificationdedup/SPEC.md
- Papers (Google 2021 dedup paper, LSHBloom, text-dedup)RESEARCH.md
Module Responsibilities
minhash
MinHash signature generation:
pub struct MinHasher { num_permutations: usize, // 128 default seeds: Vec<u64>, } impl MinHasher { pub fn new(num_permutations: usize) -> Self; pub fn signature(&self, tokens: &HashSet<u64>) -> MinHashSignature; pub fn similarity(sig1: &MinHashSignature, sig2: &MinHashSignature) -> f64; } pub struct MinHashSignature { pub values: Vec<u64>, }
Use
mithril_core::hashing::hash_with_seed() for hashing.
lsh
Locality-Sensitive Hashing for candidate pair generation:
pub struct LshIndex { num_bands: usize, rows_per_band: usize, buckets: Vec<HashMap<u64, Vec<DocId>>>, } impl LshIndex { /// Create with target similarity threshold /// For 0.85 threshold: typically b=20, r=5 pub fn with_threshold(num_permutations: usize, threshold: f64) -> Self; pub fn insert(&mut self, doc_id: DocId, signature: &MinHashSignature); pub fn candidates(&self) -> impl Iterator<Item = (DocId, DocId)>; }
cluster
Union-Find for grouping duplicates:
pub struct UnionFind { parent: Vec<usize>, rank: Vec<usize>, } impl UnionFind { pub fn new(n: usize) -> Self; pub fn find(&mut self, x: usize) -> usize; // with path compression pub fn union(&mut self, x: usize, y: usize); // by rank pub fn clusters(&mut self) -> HashMap<usize, Vec<usize>>; }
io
File I/O for JSONL and Parquet:
pub fn read_jsonl(path: &Path, text_field: &str) -> Result<Vec<Document>>; pub fn read_parquet(path: &Path, text_column: &str) -> Result<Vec<Document>>; pub fn write_jsonl(path: &Path, docs: &[Document]) -> Result<()>;
cli (main.rs)
Command-line interface:
mithril-dedup input.jsonl -o output.jsonl --field text --threshold 0.85
Target Metrics
| Metric | Target |
|---|---|
| Throughput | ≥100K docs/sec |
| Precision | ≥0.95 |
| Recall | ≥0.90 |
| Memory (LSH) | <16GB for 1B docs |
Key Dependencies
mithril-core = { workspace = true } xxhash-rust = { workspace = true } rayon = { workspace = true } arrow = { workspace = true } parquet = { workspace = true } clap = { workspace = true }
Test Fixtures
- 1000 docs with 30% known duplicatesfixtures/datasets/duplicates.jsonl
Testing
cargo test -p mithril-dedup cargo bench -p mithril-dedup
Implementation Order
- Implement
module with testsminhash - Implement
modulelsh - Implement
(UnionFind)cluster - Implement
for JSONLio - Wire up CLI
- Add Parquet support
- Run benchmarks
- Update STATUS.md
Completion Criteria
- Detects duplicates with Jaccard ≥0.85
- ≥100K docs/sec throughput
- CLI works:
mithril-dedup input.jsonl -o output.jsonl - Unit tests pass
- STATUS.md updated to COMPLETE