Qsv reproducible-analysis
Machine-readable journal format for reproducible data analysis operations
git clone https://github.com/dathere/qsv
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/reproducible-analysis" ~/.claude/skills/dathere-qsv-reproducible-analysis && rm -rf "$T"
.claude/skills/skills/reproducible-analysis/SKILL.mdReproducible Analysis
Maintain a machine-readable journal of every data operation so that humans, agents, and machines can independently verify the analysis end-to-end.
Core Principle
Every analysis should be independently reproducible: given the same input files, a third party should be able to replay the exact sequence of operations and arrive at bit-identical results for all deterministic steps.
Journal Format
Create a journal file named
<analysis-name>.journal.jsonl alongside the analysis output. Each line is a JSON object representing one operation.
Entry Schema
{"seq": 1, "ts": "2026-03-19T14:30:00Z", "op": "index", "tool": "mcp__qsv__qsv_index", "input": "sales.csv", "input_sha256": "a1b2c3...", "input_rows": 50000, "input_cols": 12, "params": {}, "output": "sales.csv.idx", "output_sha256": "d4e5f6...", "duration_ms": 45, "note": "Create index for fast access"} {"seq": 2, "ts": "2026-03-19T14:30:01Z", "op": "stats", "tool": "mcp__qsv__qsv_stats", "input": "sales.csv", "input_sha256": "a1b2c3...", "params": {"cardinality": true, "stats_jsonl": true}, "output": "sales.stats.csv", "output_sha256": "f7a8b9...", "duration_ms": 320, "note": "Generate stats cache with cardinality"}
Required Fields
| Field | Type | Description |
|---|---|---|
| integer | 1-based sequence number within the journal |
| string | ISO 8601 UTC timestamp of when the operation ran |
| string | Human-readable operation name (e.g., "stats", "filter", "join") |
| string or null | Exact MCP tool name used (e.g., , ); null for journal-level entries (, ) |
| string, array, or null | Input file path(s), relative to working directory; null for journal-level entries |
| string, array, or null | SHA-256 hash(es) of input file(s); null for journal-level entries |
| object | All parameters passed to the tool (excluding input/output paths) |
| string or null | Output file path, or null if result was displayed only |
| string or null | SHA-256 hash of output file, or null |
| integer | Wall-clock execution time in milliseconds |
| string | Brief explanation of why this step was performed |
Optional Fields
| Field | Type | Description |
|---|---|---|
| integer | Row count of input (from ) |
| integer | Column count of input (from ) |
| integer | Row count of output |
| integer | Column count of output |
| integer | Rows added/removed (output_rows - input_rows) |
| boolean | Whether this step produces identical output every run (default: true) |
| boolean | Whether this step involved AI inference (e.g., describegpt) |
| string | Full SQL query text (for sqlp operations) |
| string | Error message if the operation failed |
| string | qsv version string (capture once at journal start) |
How to Compute Hashes
Use
mcp__qsv__qsv_sqlp or shell commands to compute SHA-256 hashes:
# Via shell (when available) shasum -a 256 sales.csv | cut -d' ' -f1 # Via qsv sqlp (for CSV content hash) # Hash the output file after each step
Alternatively, note the file size and row count as a lighter-weight fingerprint when hashing is impractical:
{"seq": 1, "input": "huge_file.csv", "input_fingerprint": {"rows": 5000000, "cols": 42, "bytes": 1073741824}, "note": "additional fields omitted for brevity"}
Journal Lifecycle
Starting a Journal
At the beginning of any analysis, create the journal and record the environment:
{"seq": 0, "ts": "2026-03-19T14:29:55Z", "op": "init", "tool": null, "input": null, "params": {"working_dir": "/path/to/data", "qsv_version": "0.142.0 (polars-0.46.0)", "platform": "darwin-aarch64"}, "output": "analysis.journal.jsonl", "note": "Initialize reproducibility journal"}
During Analysis
Log every data operation. For each step:
- Record the entry after the operation completes (so you have the output hash and duration)
- Include the
field explaining the analytical reasoning — this is what makes the journal useful to human reviewersnote - Mark
for any AI-generated step (describegpt, chart selection, narrative)deterministic: false
Closing a Journal
At the end, write a summary entry:
{"seq": 99, "ts": "2026-03-19T15:10:00Z", "op": "complete", "tool": null, "input": "sales.csv", "input_sha256": "a1b2c3...", "params": {"total_steps": 98, "deterministic_steps": 95, "ai_steps": 3, "final_output": "analysis_report.md"}, "output": "analysis.journal.jsonl", "note": "Analysis complete. 95 of 98 steps are deterministic and independently reproducible."}
Verification Protocol
For Humans
- Open the
file.journal.jsonl - Review each
field to understand the analytical reasoningnote - Check that the sequence of operations makes logical sense
- Verify
of the first entry matches your copy of the source datainput_sha256 - Spot-check any step by re-running the
with the recordedtoolparams
For Agents
- Parse the
file.journal.jsonl - Verify the
0 entry to confirm environment compatibility (qsv version, platform)seq - For each entry where
is true (or absent): a. Computedeterministic
of the input file — must matchsha256
b. Execute theinput_sha256
with the recordedtool
c. Computeparams
of the output — must matchsha256
d. If mismatch, flag the step and stopoutput_sha256 - For entries where
, skip hash verification but log that the step was AI-generateddeterministic: false - Report:
N of M deterministic steps verified, K AI-generated steps skipped
For Machines (CI/CD)
#!/bin/bash # replay-journal.sh — replay and verify a journal JOURNAL="$1" FAILURES=0 jq -c 'select(.seq > 0 and .op != "complete" and (.deterministic // true))' "$JOURNAL" | while read -r entry; do SEQ=$(echo "$entry" | jq -r '.seq') INPUT=$(echo "$entry" | jq -r '.input') EXPECTED=$(echo "$entry" | jq -r '.output_sha256') # Verify input hash ACTUAL_INPUT_HASH=$(shasum -a 256 "$INPUT" | cut -d' ' -f1) INPUT_HASH=$(echo "$entry" | jq -r '.input_sha256') if [ "$ACTUAL_INPUT_HASH" != "$INPUT_HASH" ]; then echo "FAIL step $SEQ: input hash mismatch" FAILURES=$((FAILURES + 1)) continue fi # Re-execute and verify output hash (tool-specific replay logic here) # ... echo "PASS step $SEQ" done echo "$FAILURES failures" exit $FAILURES
Integration with Commands
When any
/data-* command is invoked and the user requests reproducibility (or the output is a formal deliverable), maintain a journal:
| Command | Journal Approach |
|---|---|
| Log every profiling step (index, sniff, stats, frequency, etc.) |
| Log each cleaning operation with before/after row counts |
| Log both inputs with hashes, join parameters, output verification |
| Log the SQL query text in the field |
| Log each validation check and its pass/fail result |
| Log data preparation steps; mark chart generation as |
| Log stats step as deterministic, describegpt step as |
| Log input/output formats and hashes |
Integration with GenAI Disclaimer
The journal complements the
genai-disclaimer skill:
- The journal records what was done and enables replay
- The disclaimer communicates which parts are AI-generated vs. deterministic
- Together, they provide full transparency for stakeholders
Use the journal's
deterministic and ai_generated flags to auto-generate the disclaimer's attribution table.
Best Practices
- Always hash inputs: The input hash is the anchor for reproducibility — without it, verification is impossible
- Log failures too: If a step fails and you retry with different parameters, log both attempts (the failure with an
field, then the successful retry)error - Include row count deltas:
makes it easy to spot where data was filtered, joined, or deduplicateddelta_rows - Use relative paths: All file paths should be relative to the working directory so the journal is portable
- Version pin: Record
in the init entry — different versions may produce different stats precisionqsv_version - One journal per analysis: Don't append unrelated analyses to the same journal file
- Commit journals to version control: They're small (a few KB) and provide audit trail