Claude-skill-registry-data mithril-dedup-agent

Build mithril-dedup for ML dataset deduplication. Use when implementing MinHash, LSH, clustering, or document I/O.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry-data

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/mithril-dedup-agent" ~/.claude/skills/majiayu000-claude-skill-registry-data-mithril-dedup-agent && rm -rf "$T"

manifest: data/mithril-dedup-agent/SKILL.md

source content

Mithril Dedup Agent

Build data deduplication for ML training datasets at 100K+ docs/sec.

Status

Read

crates/mithril-dedup/STATUS.md

for current progress.

Reference Documentation

```
dedup/SPEC.md
```
- Full product specification
```
RESEARCH.md
```
- Papers (Google 2021 dedup paper, LSHBloom, text-dedup)

Module Responsibilities

minhash

MinHash signature generation:

pub struct MinHasher {
    num_permutations: usize,  // 128 default
    seeds: Vec<u64>,
}

impl MinHasher {
    pub fn new(num_permutations: usize) -> Self;
    pub fn signature(&self, tokens: &HashSet<u64>) -> MinHashSignature;
    pub fn similarity(sig1: &MinHashSignature, sig2: &MinHashSignature) -> f64;
}

pub struct MinHashSignature {
    pub values: Vec<u64>,
}

Use

mithril_core::hashing::hash_with_seed()

for hashing.

lsh

Locality-Sensitive Hashing for candidate pair generation:

pub struct LshIndex {
    num_bands: usize,
    rows_per_band: usize,
    buckets: Vec<HashMap<u64, Vec<DocId>>>,
}

impl LshIndex {
    /// Create with target similarity threshold
    /// For 0.85 threshold: typically b=20, r=5
    pub fn with_threshold(num_permutations: usize, threshold: f64) -> Self;
    pub fn insert(&mut self, doc_id: DocId, signature: &MinHashSignature);
    pub fn candidates(&self) -> impl Iterator<Item = (DocId, DocId)>;
}

cluster

Union-Find for grouping duplicates:

pub struct UnionFind {
    parent: Vec<usize>,
    rank: Vec<usize>,
}

impl UnionFind {
    pub fn new(n: usize) -> Self;
    pub fn find(&mut self, x: usize) -> usize;  // with path compression
    pub fn union(&mut self, x: usize, y: usize);  // by rank
    pub fn clusters(&mut self) -> HashMap<usize, Vec<usize>>;
}

io

File I/O for JSONL and Parquet:

pub fn read_jsonl(path: &Path, text_field: &str) -> Result<Vec<Document>>;
pub fn read_parquet(path: &Path, text_column: &str) -> Result<Vec<Document>>;
pub fn write_jsonl(path: &Path, docs: &[Document]) -> Result<()>;

cli (main.rs)

Command-line interface:

mithril-dedup input.jsonl -o output.jsonl --field text --threshold 0.85

Target Metrics

Metric	Target
Throughput	≥100K docs/sec
Precision	≥0.95
Recall	≥0.90
Memory (LSH)	<16GB for 1B docs

Key Dependencies

mithril-core = { workspace = true }
xxhash-rust = { workspace = true }
rayon = { workspace = true }
arrow = { workspace = true }
parquet = { workspace = true }
clap = { workspace = true }

Test Fixtures

```
fixtures/datasets/duplicates.jsonl
```
- 1000 docs with 30% known duplicates

Testing

cargo test -p mithril-dedup
cargo bench -p mithril-dedup

Implementation Order

Implement
```
minhash
```
module with tests
Implement
```
lsh
```
module
Implement
```
cluster
```
(UnionFind)
Implement
```
io
```
for JSONL
Wire up CLI
Add Parquet support
Run benchmarks
Update STATUS.md

Completion Criteria

Detects duplicates with Jaccard ≥0.85
≥100K docs/sec throughput

CLI works:

mithril-dedup input.jsonl -o output.jsonl

Unit tests pass
STATUS.md updated to COMPLETE