Skillsbench academic-pdf-redaction

Redact text from PDF documents for blind review anonymization

install

source · Clone the upstream repo

git clone https://github.com/benchflow-ai/skillsbench

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/paper-anonymizer/environment/skills/academic-pdf-redaction" ~/.claude/skills/benchflow-ai-skillsbench-academic-pdf-redaction && rm -rf "$T"

manifest: tasks/paper-anonymizer/environment/skills/academic-pdf-redaction/SKILL.md

source content

PDF Redaction for Blind Review

Redact identifying information from academic papers for blind review.

CRITICAL RULES

PRESERVE References section - Self-citations MUST remain intact
ONLY redact specific text matches - Never redact entire pages/regions
VERIFY output - Check that 80%+ of original text remains

Common Pitfalls to AVOID

# ❌ WRONG - This removes ALL text from the page:
for block in page.get_text("blocks"):
    page.add_redact_annot(fitz.Rect(block[:4]))

# ❌ WRONG - Drawing rectangles over text:
page.draw_rect(fitz.Rect(0, 0, 600, 100), fill=(0,0,0))

# ✅ CORRECT - Only redact specific search matches:
for rect in page.search_for("John Smith"):
    page.add_redact_annot(rect)

Patterns to Redact (Before References Only)

IMPORTANT: Use FULL names/phrases, not partial matches!

✅ "John Smith" (full name)
❌ "Smith" (partial - would incorrectly match "Smith et al." citations in References)

Author names - FULL names only (e.g., "John Smith", not just "Smith")
Affiliations - Universities, companies (e.g., "Duke University")
Email addresses - Pattern:
```
*@*.edu
```
,
```
*@*.com
```
Venue names - Conference/workshop names (e.g., "ICML 2024", "ICML Workshop")
arXiv identifiers - Pattern:
```
arXiv:XXXX.XXXXX
```
DOIs - Pattern:
```
10.XXXX/...
```
Acknowledgement names - Names in "Acknowledgements" section
Equal contribution footnotes - e.g., "Equal contribution", "* Equal contribution"

PyMuPDF (fitz) - Recommended Approach

import fitz
import os

def redact_with_pymupdf(input_path: str, output_path: str, patterns: list[str]):
    """Redact specific patterns from PDF using PyMuPDF."""
    doc = fitz.open(input_path)
    original_len = sum(len(p.get_text()) for p in doc)

    # Find References page - stop redacting there
    references_page = None
    for i, page in enumerate(doc):
        if "references" in page.get_text().lower():
            references_page = i
            break

    for page_num, page in enumerate(doc):
        if references_page is not None and page_num >= references_page:
            continue  # Skip References section

        for pattern in patterns:
            # ONLY redact exact search matches
            for rect in page.search_for(pattern):
                page.add_redact_annot(rect, fill=(0, 0, 0))
        page.apply_redactions()

    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    doc.save(output_path)
    doc.close()

    # MUST verify after saving
    verify_redaction(input_path, output_path)

REQUIRED: Verification Function

Always run this after ANY redaction to catch errors early:

import fitz

def verify_redaction(original_path, output_path):
    """Verify redaction didn't corrupt the PDF."""
    orig = fitz.open(original_path)
    redc = fitz.open(output_path)

    orig_len = sum(len(p.get_text()) for p in orig)
    redc_len = sum(len(p.get_text()) for p in redc)

    print(f"Original: {len(orig)} pages, {orig_len} chars")
    print(f"Redacted: {len(redc)} pages, {redc_len} chars")
    print(f"Retained: {redc_len/orig_len:.1%}")

    # DEFENSIVE CHECKS - fail fast if something went wrong
    if len(redc) != len(orig):
        raise ValueError(f"Page count changed: {len(orig)} -> {len(redc)}")
    if redc_len < 1000:
        raise ValueError(f"PDF corrupted: only {redc_len} chars remain!")
    if redc_len < orig_len * 0.7:
        raise ValueError(f"Too much removed: kept only {redc_len/orig_len:.0%}")

    orig.close()
    redc.close()
    print("✓ Verification passed")