Claude-skill-registry Knack Data Cleaner
Ensures accuracy for HTI compliance and performance dashboards through data validation, deduplication, normalization, and integrity checks. Critica...
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/knack-data-cleaner" ~/.claude/skills/majiayu000-claude-skill-registry-knack-data-cleaner && rm -rf "$T"
skills/data/knack-data-cleaner/SKILL.mdKnack Data Cleaner
Purpose
Ensures accuracy for HTI compliance and performance dashboards through data validation, deduplication, normalization, and integrity checks. Critical for NCDIT reporting accuracy.
Core Functions
deduplicate_records
Purpose: Identify and remove/merge duplicate device or donor records
Detection Strategies:
- Exact match: Same serial number, donor email, etc.
- Fuzzy match: Similar names, close dates
- Bulk import duplicates: Same acquisition batch
Example:
const cleaned = await deduplicate_records({ object_key: "object_1", match_fields: ["serial_number", "acquisition_date"], strategy: "keep_latest" // or "merge", "keep_first" }); // Output: // { // original_count: 3520, // duplicates_found: 37, // final_count: 3483, // removed: [ // { id: "rec_123", reason: "Duplicate serial: ABC123" }, // // ... // ] // }
Deduplication Strategies:
const strategies = { keep_latest: "Retain most recently modified record", keep_first: "Keep oldest record (by created_date)", merge: "Combine data from all duplicates", manual_review: "Flag for human verification" };
normalize_fields
Purpose: Standardize data formats for consistency
Normalization Rules:
County Names
const county_normalizer = { "wake": "Wake", "WAKE": "Wake", "Wake County": "Wake", "wake co": "Wake", "wake co.": "Wake" };
Phone Numbers
// (919) 555-1234 → 9195551234 // 919-555-1234 → 9195551234 // 919.555.1234 → 9195551234 const normalizePhone = (phone) => phone.replace(/\D/g, '');
Emails
// Convert to lowercase, trim whitespace const normalizeEmail = (email) => email.trim().toLowerCase();
Dates
// Standardize to ISO 8601: YYYY-MM-DD const normalizeDate = (date) => { const d = new Date(date); return d.toISOString().split('T')[0]; };
Device Status
const status_normalizer = { "acquired": "Acquired", "ACQUIRED": "Acquired", "received": "Acquired", "in progress": "In Progress", "ready": "Ready for Donation", "delivered": "Presented" };
Example:
const normalized = await normalize_fields({ dataset: await fetch_all_pages("object_1"), rules: { county: county_normalizer, phone: normalizePhone, email: normalizeEmail, date: normalizeDate, status: status_normalizer } });
validate_integrity
Purpose: Check for logical inconsistencies and data quality issues
Validation Checks:
Date Logic
const date_validations = [ "acquisition_date < conversion_date", "conversion_date < presentation_date", "all dates within grant period (2024-2026)", "no future dates" ];
Required Fields
const required_by_status = { "Acquired": ["donor", "acquisition_date", "quantity"], "Converted": ["converted_date", "os_installed", "tested"], "Presented": ["recipient", "county", "presentation_date"], "Discarded": ["discard_reason", "discard_date"] };
Value Ranges
const range_validations = { quantity: { min: 1, max: 500 }, training_duration: { min: 0.5, max: 8 }, participants: { min: 1, max: 100 }, donation_value: { min: 0, max: 50000 } };
Geographic Validation
const valid_counties = [ "Wake", "Durham", "Vance", "Franklin", "Granville", "Halifax", "Wilson", "Edgecombe", "Martin", "Hertford", "Greene", "Warren", "Northampton", "Person", "Nash" ]; const county_validation = (county) => { return valid_counties.includes(county); };
Example:
const validation_report = await validate_integrity({ dataset: await fetch_all_pages("object_1"), rules: { dates: date_validations, required: required_by_status, ranges: range_validations, counties: valid_counties } }); // Output: // { // total_records: 3483, // valid: 3401, // errors: [ // { record_id: "rec_123", issue: "Conversion date before acquisition" }, // { record_id: "rec_456", issue: "Invalid county: 'Raleigh'" }, // { record_id: "rec_789", issue: "Missing required field: donor" } // ], // warnings: [ // { record_id: "rec_234", issue: "Unusually high quantity: 450" } // ] // }
fix_common_issues
Purpose: Auto-correct known data entry mistakes
Auto-Fix Rules:
Trim Whitespace
const fix_whitespace = (record) => { Object.keys(record).forEach(key => { if (typeof record[key] === 'string') { record[key] = record[key].trim(); } }); return record; };
Fix Capitalization
const fix_capitalization = (record) => { if (record.donor_name) { record.donor_name = toTitleCase(record.donor_name); } if (record.county) { record.county = county_normalizer[record.county.toLowerCase()] || record.county; } return record; };
Infer Missing Data
const infer_missing = (record) => { // If county missing but zip code present, infer county if (!record.county && record.zip_code) { record.county = inferCountyFromZip(record.zip_code); } // If discard reason missing but status is Discarded, set default if (record.status === "Discarded" && !record.discard_reason) { record.discard_reason = "Not specified"; } return record; };
Example:
const fixed = await fix_common_issues({ dataset: await fetch_all_pages("object_1"), auto_fixes: [ "trim_whitespace", "fix_capitalization", "infer_missing_counties", "standardize_phone_numbers" ] }); // Output: // { // processed: 3483, // fixed: 287, // fixes_by_type: { // whitespace: 145, // capitalization: 67, // inferred_county: 42, // phone_format: 33 // } // }
remove_test_records
Purpose: Exclude dummy/test data from production reports
Detection:
const is_test_record = (record) => { const test_indicators = [ record.donor_name?.toLowerCase().includes("test"), record.donor_email?.includes("test@"), record.donor_email?.includes("example.com"), record.serial_number?.startsWith("TEST"), record.notes?.toLowerCase().includes("do not count") ]; return test_indicators.some(indicator => indicator === true); };
Example:
const production_only = await remove_test_records({ dataset: await fetch_all_pages("object_1") }); // Output: // { // original: 3483, // test_records_removed: 8, // production_records: 3475 // }
enrich_records
Purpose: Add calculated or derived fields
Enrichment Operations:
Add Calculated Fields
const enrich = (record) => { // Add age of device record.days_since_acquisition = daysBetween(record.acquisition_date, new Date()); // Add conversion time if (record.converted_date && record.acquisition_date) { record.days_to_convert = daysBetween(record.acquisition_date, record.converted_date); } // Add delivery time if (record.presentation_date && record.converted_date) { record.days_to_deliver = daysBetween(record.converted_date, record.presentation_date); } // Flag slow conversions record.conversion_delayed = record.days_to_convert > 30; return record; };
Add Geographic Context
const add_region = (record) => { const regions = { "Triangle": ["Wake", "Durham", "Franklin"], "Northeast": ["Vance", "Granville", "Warren", "Halifax"], "Eastern": ["Wilson", "Edgecombe", "Martin", "Greene"] }; record.region = Object.keys(regions).find(region => regions[region].includes(record.county) ) || "Other"; return record; };
Add Donor Tier
const add_donor_tier = (record) => { if (record.donation_value > 1000) { record.donor_tier = "Major"; } else if (record.donation_value > 500) { record.donor_tier = "Significant"; } else { record.donor_tier = "Standard"; } return record; };
HTI-Specific Cleaning Workflows
Pre-Quarterly Report Cleaning
async function clean_for_quarterly_report(quarter, year) { // 1. Fetch quarter data const raw_data = await fetch_all_pages("object_1", { filters: quarterly_filter(quarter, year) }); // 2. Remove test records const production = await remove_test_records({ dataset: raw_data }); // 3. Deduplicate const unique = await deduplicate_records({ dataset: production.production_records, match_fields: ["serial_number"] }); // 4. Normalize const normalized = await normalize_fields({ dataset: unique.records, rules: all_normalizers }); // 5. Validate const validated = await validate_integrity({ dataset: normalized, rules: compliance_rules }); // 6. Fix auto-fixable issues const fixed = await fix_common_issues({ dataset: validated.valid_records }); return { clean_data: fixed, cleaning_summary: { original: raw_data.length, test_removed: production.test_records_removed, duplicates: unique.duplicates_found, validation_errors: validated.errors.length, auto_fixed: fixed.fixed } }; }
Pre-Dashboard Data Cleaning
async function clean_for_dashboard() { const all_data = await fetch_all_pages("object_1"); // Quick cleaning for real-time dashboard const cleaned = all_data .filter(r => !is_test_record(r)) .map(fix_whitespace) .map(fix_capitalization) .map(enrich); return cleaned; }
Data Quality Monitoring
Generate Data Quality Report
async function data_quality_report() { const all_records = await fetch_all_pages("object_1"); const report = { total_records: all_records.length, completeness: { with_donor: all_records.filter(r => r.donor).length, with_county: all_records.filter(r => r.county).length, with_serial: all_records.filter(r => r.serial_number).length }, accuracy: { valid_counties: all_records.filter(r => valid_counties.includes(r.county)).length, logical_dates: all_records.filter(r => validate_dates(r)).length }, duplicates: await find_duplicates(all_records), test_records: all_records.filter(is_test_record).length }; return report; }
Automated Quality Checks
// Run daily at 2 AM cron.schedule('0 2 * * *', async () => { const report = await data_quality_report(); if (report.duplicates.length > 10) { await alert_admin("High duplicate count detected"); } if (report.accuracy.valid_counties / report.total_records < 0.95) { await alert_admin("County data quality below threshold"); } });
Integration Points
- knack_reader: Fetch data for cleaning
- knack_pagination: Process large datasets
- knack_reporting_sync: Clean before report generation
- knack_dashboard_ai: Ensure metrics accuracy
- knack_goal_tracker: Validate progress data
- knack_filter_sort: Filter out invalid records
Best Practices
- Clean before every report: Don't trust raw data
- Log all cleaning operations: Audit trail for compliance
- Manual review for ambiguous cases: Don't auto-fix everything
- Run quality checks daily: Catch issues early
- Version cleaned datasets: Track data lineage
Grant Compliance
NCDIT Data Standards
- No test/dummy records in official reports
- All dates must be within grant period
- County assignments must be accurate (15-county requirement)
- Device lifecycle dates must be logical
- Training hours must have supporting documentation
Audit Trail
const cleaning_log = { timestamp: new Date(), operation: "quarterly_report_cleaning", quarter: "Q1", year: "2025", records_processed: 875, issues_found: 23, auto_fixed: 18, manual_review_needed: 5, performed_by: "automated_system" };