Claude-skill-registry-data mithril-dedup-agent

Build mithril-dedup for ML dataset deduplication. Use when implementing MinHash, LSH, clustering, or document I/O.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mithril-dedup-agent" ~/.claude/skills/majiayu000-claude-skill-registry-data-mithril-dedup-agent && rm -rf "$T"
manifest: data/mithril-dedup-agent/SKILL.md
source content

Mithril Dedup Agent

Build data deduplication for ML training datasets at 100K+ docs/sec.

Status

Read

crates/mithril-dedup/STATUS.md
for current progress.

Reference Documentation

  • dedup/SPEC.md
    - Full product specification
  • RESEARCH.md
    - Papers (Google 2021 dedup paper, LSHBloom, text-dedup)

Module Responsibilities

minhash

MinHash signature generation:

pub struct MinHasher {
    num_permutations: usize,  // 128 default
    seeds: Vec<u64>,
}

impl MinHasher {
    pub fn new(num_permutations: usize) -> Self;
    pub fn signature(&self, tokens: &HashSet<u64>) -> MinHashSignature;
    pub fn similarity(sig1: &MinHashSignature, sig2: &MinHashSignature) -> f64;
}

pub struct MinHashSignature {
    pub values: Vec<u64>,
}

Use

mithril_core::hashing::hash_with_seed()
for hashing.

lsh

Locality-Sensitive Hashing for candidate pair generation:

pub struct LshIndex {
    num_bands: usize,
    rows_per_band: usize,
    buckets: Vec<HashMap<u64, Vec<DocId>>>,
}

impl LshIndex {
    /// Create with target similarity threshold
    /// For 0.85 threshold: typically b=20, r=5
    pub fn with_threshold(num_permutations: usize, threshold: f64) -> Self;
    pub fn insert(&mut self, doc_id: DocId, signature: &MinHashSignature);
    pub fn candidates(&self) -> impl Iterator<Item = (DocId, DocId)>;
}

cluster

Union-Find for grouping duplicates:

pub struct UnionFind {
    parent: Vec<usize>,
    rank: Vec<usize>,
}

impl UnionFind {
    pub fn new(n: usize) -> Self;
    pub fn find(&mut self, x: usize) -> usize;  // with path compression
    pub fn union(&mut self, x: usize, y: usize);  // by rank
    pub fn clusters(&mut self) -> HashMap<usize, Vec<usize>>;
}

io

File I/O for JSONL and Parquet:

pub fn read_jsonl(path: &Path, text_field: &str) -> Result<Vec<Document>>;
pub fn read_parquet(path: &Path, text_column: &str) -> Result<Vec<Document>>;
pub fn write_jsonl(path: &Path, docs: &[Document]) -> Result<()>;

cli (main.rs)

Command-line interface:

mithril-dedup input.jsonl -o output.jsonl --field text --threshold 0.85

Target Metrics

MetricTarget
Throughput≥100K docs/sec
Precision≥0.95
Recall≥0.90
Memory (LSH)<16GB for 1B docs

Key Dependencies

mithril-core = { workspace = true }
xxhash-rust = { workspace = true }
rayon = { workspace = true }
arrow = { workspace = true }
parquet = { workspace = true }
clap = { workspace = true }

Test Fixtures

  • fixtures/datasets/duplicates.jsonl
    - 1000 docs with 30% known duplicates

Testing

cargo test -p mithril-dedup
cargo bench -p mithril-dedup

Implementation Order

  1. Implement
    minhash
    module with tests
  2. Implement
    lsh
    module
  3. Implement
    cluster
    (UnionFind)
  4. Implement
    io
    for JSONL
  5. Wire up CLI
  6. Add Parquet support
  7. Run benchmarks
  8. Update STATUS.md

Completion Criteria

  • Detects duplicates with Jaccard ≥0.85
  • ≥100K docs/sec throughput
  • CLI works:
    mithril-dedup input.jsonl -o output.jsonl
  • Unit tests pass
  • STATUS.md updated to COMPLETE