Skillsbench academic-pdf-redaction
Redact text from PDF documents for blind review anonymization
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/paper-anonymizer/environment/skills/academic-pdf-redaction" ~/.claude/skills/benchflow-ai-skillsbench-academic-pdf-redaction && rm -rf "$T"
manifest:
tasks/paper-anonymizer/environment/skills/academic-pdf-redaction/SKILL.mdsource content
PDF Redaction for Blind Review
Redact identifying information from academic papers for blind review.
CRITICAL RULES
- PRESERVE References section - Self-citations MUST remain intact
- ONLY redact specific text matches - Never redact entire pages/regions
- VERIFY output - Check that 80%+ of original text remains
Common Pitfalls to AVOID
# ❌ WRONG - This removes ALL text from the page: for block in page.get_text("blocks"): page.add_redact_annot(fitz.Rect(block[:4])) # ❌ WRONG - Drawing rectangles over text: page.draw_rect(fitz.Rect(0, 0, 600, 100), fill=(0,0,0)) # ✅ CORRECT - Only redact specific search matches: for rect in page.search_for("John Smith"): page.add_redact_annot(rect)
Patterns to Redact (Before References Only)
IMPORTANT: Use FULL names/phrases, not partial matches!
- ✅ "John Smith" (full name)
- ❌ "Smith" (partial - would incorrectly match "Smith et al." citations in References)
- Author names - FULL names only (e.g., "John Smith", not just "Smith")
- Affiliations - Universities, companies (e.g., "Duke University")
- Email addresses - Pattern:
,*@*.edu*@*.com - Venue names - Conference/workshop names (e.g., "ICML 2024", "ICML Workshop")
- arXiv identifiers - Pattern:
arXiv:XXXX.XXXXX - DOIs - Pattern:
10.XXXX/... - Acknowledgement names - Names in "Acknowledgements" section
- Equal contribution footnotes - e.g., "Equal contribution", "* Equal contribution"
PyMuPDF (fitz) - Recommended Approach
import fitz import os def redact_with_pymupdf(input_path: str, output_path: str, patterns: list[str]): """Redact specific patterns from PDF using PyMuPDF.""" doc = fitz.open(input_path) original_len = sum(len(p.get_text()) for p in doc) # Find References page - stop redacting there references_page = None for i, page in enumerate(doc): if "references" in page.get_text().lower(): references_page = i break for page_num, page in enumerate(doc): if references_page is not None and page_num >= references_page: continue # Skip References section for pattern in patterns: # ONLY redact exact search matches for rect in page.search_for(pattern): page.add_redact_annot(rect, fill=(0, 0, 0)) page.apply_redactions() os.makedirs(os.path.dirname(output_path), exist_ok=True) doc.save(output_path) doc.close() # MUST verify after saving verify_redaction(input_path, output_path)
REQUIRED: Verification Function
Always run this after ANY redaction to catch errors early:
import fitz def verify_redaction(original_path, output_path): """Verify redaction didn't corrupt the PDF.""" orig = fitz.open(original_path) redc = fitz.open(output_path) orig_len = sum(len(p.get_text()) for p in orig) redc_len = sum(len(p.get_text()) for p in redc) print(f"Original: {len(orig)} pages, {orig_len} chars") print(f"Redacted: {len(redc)} pages, {redc_len} chars") print(f"Retained: {redc_len/orig_len:.1%}") # DEFENSIVE CHECKS - fail fast if something went wrong if len(redc) != len(orig): raise ValueError(f"Page count changed: {len(orig)} -> {len(redc)}") if redc_len < 1000: raise ValueError(f"PDF corrupted: only {redc_len} chars remain!") if redc_len < orig_len * 0.7: raise ValueError(f"Too much removed: kept only {redc_len/orig_len:.0%}") orig.close() redc.close() print("✓ Verification passed")