git clone https://github.com/cristianleoo/NeuroTrace
skill.md- pip install
- references API keys
neurotrace — Agent Skill
You are an AI agent that can run neurotrace, an activation-patching framework for encoder-only transformer models. Use this skill when a user wants to understand where factual knowledge is stored inside a HuggingFace MLM, or to compare how two models encode domain-specific facts.
When to use this skill
- User asks "does [model] actually know [domain] facts?"
- User wants to compare a fine-tuned model against its base model
- User wants to locate knowledge neurons or test factual recall
- User asks about activation patching, causal tracing, or knowledge localisation
Prerequisites
The repo must be installed:
cd /path/to/neurotrace uv venv && uv pip install -e . source .venv/bin/activate
Option A: Use the built-in Strands orchestrator agent
If you have a Strands-compatible LLM provider configured, you can delegate to the built-in orchestrator:
from neurotrace.providers import get_model from neurotrace.agents import create_orchestrator model = get_model("anthropic", client_args={"api_key": "sk-..."}) # Also available: "openai", "bedrock", "gemini" agent = create_orchestrator(model=model) agent("Compare SecureBERT vs ModernBERT on cybersecurity knowledge")
The orchestrator has tools for dataset generation, tracing, charting, and interpretation. It will drive the full pipeline autonomously.
Option B: Drive the pipeline yourself (step-by-step below)
Step-by-step protocol
Step 1: Gather information from the user
Ask the following questions (skip any the user has already answered):
- Domain: What knowledge domain? (e.g. cybersecurity, medicine, law, finance, programming)
- Goal: What do you want to find out? (e.g. "Does BioBERT store drug-disease associations differently from BERT?")
- Models: Which models to compare? Need display name + HuggingFace ID for each. At least one model, ideally two for comparison.
- Facts: What factual associations to test? For each fact, you need:
- A sentence template with
where the subject goes (e.g. "The attack that exploits {} injection"){} - The correct subject and an incorrect alternative
- The target token the model should predict
- A context word that anchors the fact, and a corruption for it
- A sentence template with
- Variants (optional): Custom prompt phrasing variants, or use defaults.
If the user provides a domain but no specific facts, generate 8–12 representative facts yourself covering diverse concepts in that domain. Ensure each fact has an unambiguous correct answer and a plausible corruption.
Step 2: Build the DatasetSpec
Create a JSON spec and save it:
from neurotrace.agent_dataset import DatasetSpec, FactSpec, ModelSpec, generate_from_spec spec = DatasetSpec( domain="cybersecurity", description="Test whether SecureBERT stores cybersecurity facts differently from ModernBERT", models=[ ModelSpec(name="ModernBERT", huggingface_id="answerdotai/ModernBERT-base"), ModelSpec(name="SecureBERT", huggingface_id="cisco-ai/SecureBERT2.0-base"), ], facts=[ FactSpec( clean_subject="SQL", corrupt_subject="HTML", template="The attack that exploits input sanitization failures in a database is known as {} injection", target="SQL", context_from="database", context_to="email", ), # ... more facts ... ], output_dir="output", ) dataset, config = generate_from_spec(spec, save_path="output/dataset.json")
Step 3: Run the trace
from neurotrace import CausalTracer from neurotrace.tracer import save_results results = [] results_dict = {} for model_spec in spec.models: tracer = CausalTracer(model_spec.name, model_spec.huggingface_id) result = tracer.trace(dataset) results.append(result) results_dict[model_spec.name] = result.to_dict() tracer.release() save_results(results, f"{spec.output_dir}/trace_results.json")
Step 4: Generate charts
from neurotrace import plot_comparison, plot_patching_flow, plot_heatmap, plot_peak_bars out = spec.output_dir # Line chart — mask-patch restoration per layer plot_comparison(results_dict, curve="mask_patch", title="Mask-Patch: Probability Restoration by Layer", save_path=f"{out}/fig1_mask_patch.png") # Line chart — context-patch restoration per layer plot_comparison(results_dict, curve="context_patch", title="Context-Patch: Probability Restoration by Layer", save_path=f"{out}/fig2_context_patch.png") # Patching flow diagram peak = results[0].peak_layer()[0] plot_patching_flow(num_layers=results[0].num_layers, highlighted_layer=peak, save_path=f"{out}/fig3_flow.png") # Heatmap plot_heatmap(results_dict, save_path=f"{out}/fig4_heatmap.png") # Peak comparison bars plot_peak_bars(results_dict, save_path=f"{out}/fig5_peaks.png")
Step 5: Interpret the results
Report to the user:
- Peak layer: Which layer showed the highest restoration score for mask-patching? This is where the model stores the domain knowledge.
- Peak magnitude: How strong was the signal? Larger = the model relies more on context to retrieve the fact.
- Context-patch vs mask-patch: If context-patch is flat (near zero), the model stores facts directly at the prediction position, not via context tokens.
- Clean-minus-corrupt baseline (
in the results): If positive, clean context helps. If negative, the model's knowledge is robust enough that corruption doesn't hurt — a sign of strong factual entrenchment.clean_minus_corrupt - Cross-model comparison: If two models peak at the same layer but with different magnitudes, the one with the smaller peak may actually have stronger knowledge (less to restore because corruption matters less).
DatasetSpec JSON schema
{ "domain": "string (required)", "description": "string (required)", "models": [ { "name": "string", "huggingface_id": "string" } ], "facts": [ { "clean_subject": "string", "corrupt_subject": "string", "template": "string with {} placeholder", "target": "string", "context_from": "string", "context_to": "string" } ], "variants": [["prefix", "suffix"], ...], "output_dir": "string (default: 'output')" }
Fact design guidelines
When generating facts for the user:
- The template MUST contain
where the subject goes{}
should be the unambiguously correct answerclean_subject
should be a plausible but wrong alternative from the same categorycorrupt_subject
is the token the model predicts (usually same astarget
)clean_subject
is a word in the template that supports the correct answercontext_from
replaces it with something that shifts meaning but keeps the sentence grammaticalcontext_to- Aim for 8–12 diverse facts per domain, covering different sub-topics
- Each fact should be testable as a fill-in-the-blank: "...is known as [MASK]..."
Output files
After a run, the output directory will contain:
| File | Contents |
|---|---|
| The generated prompt pairs |
| Models, domain, metadata |
| Per-layer restoration scores |
| Mask-patch line chart |
| Context-patch line chart |
| Patching flow diagram |
| Layer × model heatmap |
| Peak layer bar chart |
Error handling
- If a model fails to load,
will raise an exception. Catch it and report the HuggingFace ID to the user — they may need toCausalTracer.__init__
a specific tokenizer or usepip install
.trust_remote_code=True - If the
token is not found, the tracer falls back to finding the subject token position. This works but is less precise for MLMs.[MASK] - Running on CPU is fine for base-size models (~120M params). For large models, suggest the user use a GPU.