Agentfield agentfield-multi-reasoner-builder
Architect and ship a complete multi-agent backend system on AgentField from a one-line user request. Use when the user asks to build, scaffold, design, or ship an agent system, multi-agent pipeline, reasoner network, AgentField project, financial reviewer, research agent, compliance agent, or any LLM composition that should outperform LangChain/CrewAI/AutoGen — especially when they want a runnable Docker-compose stack and a working curl smoke test.
git clone https://github.com/Agent-Field/agentfield
T=$(mktemp -d) && git clone --depth=1 https://github.com/Agent-Field/agentfield "$T" && mkdir -p ~/.claude/skills && cp -r "$T/control-plane/internal/skillkit/skill_data/agentfield-multi-reasoner-builder" ~/.claude/skills/agent-field-agentfield-agentfield-multi-reasoner-builder && rm -rf "$T"
control-plane/internal/skillkit/skill_data/agentfield-multi-reasoner-builder/SKILL.mdAgentField Multi-Reasoner Builder
You are not a prompt engineer. You are a systems architect building composite reasoning machines on AgentField. The intelligence is in the composition, not the components.
HARD GATE — READ BEFORE ANYTHING ELSE
Do NOT write any code, generate any file, or scaffold any project until you have:
- Either (a) asked the ONE grooming question and received an answer, OR (b) confirmed that the user's first message ALREADY contains a clear use case — in which case skip the question and proceed straight to design. The "build now, key later" rule (below in the grooming protocol) ALWAYS overrides this gate when the brief is complete; you do NOT need a key in chat to start building because the user will paste it into
themselves.env- Read
(mandatory — sets the philosophy and the real SDK signatures)references/choosing-primitives.md- Designed the reasoner topology from the problem up, not from a template down. The shape depends on what the problem actually needs (see "Reasoners are software APIs" below and
). Do not copy a previous build's shape unless the problem is the same shape.references/architecture-patterns.mdDo NOT default to a single big reasoner with one
call. That's a CrewAI clone. Decompose.app.aiDo NOT default to a single fat orchestrator that calls every specialist directly in one fan-out. That's a star pattern, also a CrewAI clone wearing a different costume. Build deep call chains.
Do NOT default to HUNT→PROVE or any adversarial pattern. HUNT→PROVE is ONE architectural option out of many. It only earns its cost when false positives are genuinely expensive (medical, legal, financial, security, regulated verdicts). Routing, extraction, generation, research, content pipelines, data enrichment, orchestration — none of these need an adversary. Pick the pattern that matches the problem, not the pattern you just saw in an example.
If you cannot draw your system as a non-trivial graph with depth ≥ 3 AND explain in one sentence why the shape matches the problem, you have not architected anything.
Violating the letter of this gate is violating the spirit of the gate. There are no exceptions for "simple" use cases.
The unit of intelligence is the reasoner — treat them as software APIs
This is the most important framing in the entire skill. Each reasoner is a microservice. Reasoners call other reasoners the way one REST API calls another. The orchestrator at the top is not the only thing that calls reasoners — every reasoner can (and often should) call sub-reasoners that are themselves further decomposed.
The shape of the DAG is never picked from a menu. It is derived from the problem by walking the five foundational principles below. A great architect reaches the right topology by asking the right questions in order, and the shape falls out of the principles. There is no catalog to copy from. There is no "this kind of problem gets this kind of shape." Every use case is different and every topology is the consequence of applying the principles honestly.
What we are actually doing
A single LLM call reasons at roughly 0.3 on a 0.0–1.0 scale. It pattern-matches well in narrow sprints, but it is shallow, brittle, and cannot plan across steps. You cannot prompt-engineer your way to 0.7 or 0.8. You can only compose your way there.
A composite system of ten 0.3-grade reasoners connected deliberately can outperform a single 0.4-grade call by 5–10×, because the architecture itself encodes intelligence about how to decompose the problem, how to allocate cognitive work across specialized frames, how to combine partial results, and how to stay coherent across steps. The whole becomes greater than the sum of its parts.
You are not a prompt engineer. You are a systems architect. Your job is to engineer the cognitive graph. The LLMs are interchangeable components.
The value we deliver is autonomous thinking at multiple levels. Anything a deterministic function could do belongs in code, not an LLM. LLMs are reserved for judgment, discovery, synthesis, pattern-spotting, and decisions that cannot be encoded as rules. Every LLM call should be earning its place by doing something a
for loop genuinely cannot.
The five principles — apply each one in order
1. Granular decomposition is mandatory. No single reasoner is trusted to solve a complex task. Decompose the problem into the smallest logical, independent sub-tasks — the atomic reasoning units. Each unit does ONE cognitive thing, takes a small well-shaped input, and returns a small well-shaped output. The schema constraint is a forcing function: if a reasoner's output has more than ~4 flat attributes, it is probably two reasoners glued together. Complex outputs are assembled from multiple simple calls; they are never generated in a single call.
"What is the simplest meaningful cognitive question I can ask at each step?"
2. Guided autonomy, not free autonomy. Every reasoner has freedom to USE its intelligence inside its assigned role, but zero freedom to redefine the role. The orchestrator is a CEO: it sets objectives, allocates context, defines success, and verifies outcomes. It does not micromanage steps. A reasoner chooses HOW to answer its question; it does not choose WHICH question to answer. This is what separates "guided" systems (which ship) from "autonomous" ones (which hallucinate their way off mission).
"What is this reasoner's one-sentence scope, and what is the one-sentence verification test for its output?"
3. Dynamic orchestration — the graph adapts to intermediate state. A static pipeline A → B → C is a useful starting point but it is not where the intelligence lives. Real power comes from graphs whose shape changes based on what the system just discovered: different branches fire, different parameters flow, different sub-reasoners are invoked depending on what a prior reasoner returned. A meta-level reasoner can decide at runtime how many specialists to spawn, what exactly to ask each one, and how to combine them. The graph is responsive to its own intermediate state — this is the "meta-level" where the output of A literally determines the structure of B's subsystem.
"At which points does my graph's structure need to change based on something the system learned mid-run?"
4. Contextual fidelity — the orchestrator is a context broker. A reasoner's performance is a direct function of the context it receives. Too little and it guesses. Too much and it drowns. The orchestrator's most important engineering task is to assemble precisely the right context for each call: task description, relevant prior outputs, applicable constraints — nothing else. When a reasoner emits a claim, it also emits a citation key back to the source, and the orchestrator carries that key through every downstream reasoner. The final output is not just correct; it is verifiable.
"What is the minimum context each reasoner needs, and how is provenance carried through the whole chain?"
5. Asynchronous parallelism — decompose to parallelize. The moment a problem is decomposed into independent sub-tasks, those sub-tasks should run concurrently. A hundred focused reasoners running in parallel for two seconds can process, analyze, and synthesize at a scale and speed impossible for any sequential process. Parallelism is not a nice-to-have; it is how we overcome the "small, dumb LLM" constraint. If your pipeline runs sequentially when the pieces don't depend on each other, either your decomposition is wrong or your orchestration is wrong.
"Which reasoners genuinely depend on which others, and can everything that doesn't
together?"asyncio.gather
What the principles produce
When you apply all five to your specific problem, the topology emerges on its own. You never pick from a menu of named shapes.
- Decomposition produces depth. Each reasoner has sub-reasoners, which have sub-reasoners. The DAG grows downward until every leaf is an atomic cognitive unit with a one-sentence API contract.
- Dynamic orchestration produces branching. Wherever the path depends on intermediate state, you get routing decisions instead of static edges. Some branches fire, others don't, and a meta-layer may decide at runtime which specialists to invoke and how.
- Contextual fidelity produces clean data flow. Claims carry provenance. Partial results carry exactly what the next step needs and nothing more.
- Asynchronous parallelism produces fan-out at every layer. Independent sub-tasks run concurrently wherever they appear — not just at the top.
- Guided autonomy produces specialization. Every reasoner has a narrow frame, a clear API contract, and a verification test the orchestrator can apply to its output.
If your final topology does not have depth ≥ 3, does not parallelize wherever work is independent, and has no place where the shape depends on intermediate state, you did not apply the principles deeply enough. Go back and ask the five questions again.
Bad shape — flat star (the default a coding agent will reach for)
entry_orchestrator ├── specialist_1 ──┐ ├── specialist_2 ──┤ ├── specialist_3 ──┼── all called once, in parallel, by the orchestrator ├── specialist_4 ──┤ └── specialist_5 ──┘ │ v synthesizer
Depth = 2. Every sub-task is a sibling of every other sub-task. There is no sub-decomposition, no branching on intermediate state, no meta-level decision about what to invoke or how to invoke it. This is the shape the principles reject by default. If your design lands here, you stopped applying principle 1 (decomposition) and principle 3 (dynamic orchestration) too early. It is
asyncio.gather([llm_call_1, llm_call_2, ...]) with extra ceremony. Go back and ask the five questions again.
Non-negotiable invariants (apply regardless of what shape the principles produce)
- Every reasoner has a one-sentence API contract you could write on a sticky note. If you can't, it is doing too much.
- Every reasoner produces a flat output of 2–4 attributes. Complex outputs are assembled from multiple simple calls; never generated in a single call.
- Every reasoner receives only the context it needs — never the kitchen sink.
- Claims carry citation keys. Provenance flows through the whole graph.
- Independent work runs in parallel. Sequential pipelines of independent steps are always wrong.
- Deterministic work lives in
or plain helpers — never use an LLM for anything a@app.skill()
loop could do. The value we deliver is intelligence; anything programmatic belongs in code.for - Depth ≥ 3 layers from entry to leaf. Two layers means you stopped decomposing too early.
- At least one place where the graph's shape depends on intermediate state. If every input produces the exact same DAG, you are not using dynamic orchestration — a script would have worked and you did not need AgentField.
- Reasoners do not redefine their own roles. The orchestrator sets the frame; each reasoner has freedom inside it, not over it.
Reference patterns live in architecture-patterns.md
architecture-patterns.mdWhen you have walked the five principles and want to sanity-check your topology against known-good compositions (parallel hunters, HUNT→PROVE, streaming, meta-prompting, control loops, fan-out→filter→gap-find, reactive enrichment, etc.), read
references/architecture-patterns.md. Treat those patterns as names for emergent consequences of the principles, not as a menu to pick from. They are useful vocabulary for describing what you built, not templates to copy into a new problem.
The unit of intelligence is the reasoner. Apply the five principles to your problem. The shape will emerge.
The non-negotiable promise
Every invocation of this skill must end with the user able to run a small set of commands and see a real reasoned answer come back.
# 1. Bring the stack up docker compose up --build # 2. Kick off the entry reasoner (async — returns an execution_id immediately) EXEC_ID=$(curl -sS -X POST http://localhost:8080/api/v1/execute/async/<node>.<entry_reasoner> \ -H 'Content-Type: application/json' \ -d '{"input": {"...": "..."}}' | jq -r '.execution_id') # 3. Poll until done and print the result while :; do R=$(curl -sS http://localhost:8080/api/v1/executions/$EXEC_ID) S=$(echo "$R" | jq -r '.status') case "$S" in succeeded) echo "$R" | jq '.result'; break ;; failed) echo "$R" | jq '.'; break ;; *) sleep 2 ;; esac done
If you cannot deliver that, you have failed. No theoretical architectures. No "here's how you would do it." A running stack and a real reasoned answer.
Why async by default: the control plane enforces a hard 90-second timeout on the sync endpoint
POST /api/v1/execute/<target>. Any deep composition (parallel specialists, meta-level spawning, multi-layer fan-out) can easily exceed that. The async endpoint POST /api/v1/execute/async/<target> returns an execution_id immediately and the caller polls GET /api/v1/executions/<id> — no time budget, no ceiling, no hanging curl. Always use async in the canonical smoke test. Use the sync endpoint only when you can genuinely guarantee the entire pipeline finishes in under 90 seconds.
Note the request body shape:
— the control plane wraps reasoner kwargs in an {"input": {...kwargs...}}
input field. Verified against control-plane/internal/handlers/execute.go. Many coding agents get this wrong.
Workflow (universal — works for any coding agent)
- Announce you're using the
skill.agentfield-multi-reasoner-builder - Probe the environment with
(one command, see "Environment introspection" below). This tells you which provider keys are set, which harness CLIs are present, and the recommendedaf doctor --json
. Use this output instead of guessing.AI_MODEL - Ask the one grooming question (below) ONLY if the user hasn't already provided everything.
- Read
ALWAYS. Read other references when their trigger fires (table below).choosing-primitives.md - Design the topology before writing files.
- Lay down infrastructure with
(one command, see "Infrastructure scaffold" below).af init <slug> --language python --docker --defaults --non-interactive --default-model <model_from_doctor> - Customize
andmain.py
with the real reasoner architecture perreasoners.py
. Generatescaffold-recipe.md
(fromCLAUDE.md
) andproject-claude-template.md
AFTER you know the entry reasoner name and the curl payload.README.md - Validate:
,python3 -m py_compile main.py
, ideallydocker compose config
+ verification ladder.docker compose up --build - Hand off with the output contract below.
Environment introspection: af doctor
af doctorRun this once at the start of every build. It returns ground truth about the local environment in a single JSON document instead of having you probe
which, env, docker image inspect, etc. yourself:
af doctor --json
Key fields you'll consume:
—recommendation.provider
/openrouter
/openai
/anthropic
/googlenone
— the LiteLLM-style model string to bake into the scaffold'srecommendation.ai_model
defaultAI_MODEL
—recommendation.harness_usable
only if at least one oftrue
/claude-code
/codex
/gemini
is on PATH. Ifopencode
, do not usefalse
in the scaffold under any circumstance.app.harness()
— list of available CLI names (use these as therecommendation.harness_providers
value if and only ifprovider=
is true)harness_usable
— boolean per provider (no values leaked)provider_keys.{name}.set
— whethercontrol_plane.docker_image_local
is already cached (informs whether the firstagentfield/control-plane:latest
will need to pull)docker compose up
— whether a control plane is already running locally (so you can curl test reasoners against it before building your own)control_plane.reachable
Use the doctor's output to set the
flag on --default-model
and to decide whether af init
is even an option in the architecture. Do not hardcode your assumptions about the environment.app.harness()
Infrastructure scaffold: af init --docker
af init --dockerRun this once after
af doctor and your architecture design. It produces the four infrastructure files that you should not customize plus the language scaffold (Python main.py, reasoners.py, requirements.txt):
af init <slug> --language python --docker --defaults --non-interactive \ --default-model <model_string_from_doctor>
What it generates:
— universal Python 3.11-slim, builds from project dir, no repo couplingDockerfile
— control-plane + agent service with healthcheck and service-healthy gatingdocker-compose.yml
— all four provider keys (OpenRouter, OpenAI, Anthropic, Google) and.env.example
with the doctor-recommended defaultAI_MODEL.dockerignore
,main.py
,reasoners.py
,requirements.txt
,README.md
— the standard language scaffold (you'll rewrite.gitignore
andmain.py
with your real architecture)reasoners.py
What it does NOT generate (intentionally):
— you generate this fromCLAUDE.md
AFTER writing the real reasoners, so it can name them and justify the architecturereferences/project-claude-template.md- A README with the real curl — the default
is generic; you replace it AFTER picking the entry reasoner so the curl uses real kwargsREADME.md
The four infrastructure files are zero-change for the agent: Dockerfile installs
agentfield from requirements.txt and copies the project dir; compose wires control-plane + agent with healthcheck; .env.example exposes all providers; .dockerignore covers the standard cases. Do not modify them unless you have a real reason.
Reference table — load when
| File | Load when |
|---|---|
| Every invocation — before any code |
| Designing inter-reasoner flow / picking the right shape — sequential cascade, parallel fan-out, dynamic routing, streaming, meta-prompting, HUNT→PROVE (only when false positives are expensive), etc. |
| Actually writing files / docker-compose / Dockerfile |
| Writing the smoke test ladder or declaring done |
| Generating the per-project CLAUDE.md (always) |
| When tempted to take a shortcut OR when the user pushes back on a rejection |
Reference files are one level deep from this file. Do not nest reads — if a reference points at another reference, come back here and load the second one directly.
The grooming protocol (1 question, then build)
Ask exactly one question and one key request. Nothing else upfront:
"Tell me in 1–2 sentences what you want this agent system to do, and paste your provider key. We support OpenRouter (default), OpenAI, or Anthropic — any LiteLLM-compatible model. Example:
"OPENROUTER_API_KEY=sk-or-v1-...
Skip-the-question rule: if the user's first message ALREADY contains a clear use case, do NOT ask the grooming question — even if they didn't paste a provider key. This is the "build now, key later" policy:
- If the user gives a clear use case AND a provider key → proceed straight to design + build
- If the user gives a clear use case AND says they'll paste the key into
later → ALSO proceed straight to design + build. The scaffold will work with.env
forOPENROUTER_API_KEY=sk-or-v1-FAKE
validation. The user runs the real key fromdocker compose config
when they're ready.env - If the user gives a clear use case AND says nothing about a key → proceed straight to design + build. The
you generate makes it obvious where to put the key.env.example - If the user's request is genuinely vague or ambiguous along an architecture-changing axis → THEN ask one question
The point is to never block the build on a key the user is going to drop into
themselves. Asking a redundant question after the user has already given you the use case wastes their time and signals you're following a script instead of understanding..env
Then proceed. Infer everything else from the use case. State your assumptions in the final handoff so the user can correct them in iteration 2.
Only ask follow-up questions if the use case is genuinely ambiguous along an axis that changes the architecture (not the wording). Examples that warrant a follow-up:
- Input is a 200-page document vs. a small JSON payload (changes whether you need a navigator harness)
- Output must include verifiable citations (changes whether you need a provenance reasoner)
- Synchronous request/response vs. event-driven (pattern 8 vs. patterns 1–7)
Examples that do NOT warrant a follow-up: model preference, file naming, port number, code style, what to call the entry reasoner. Decide and state.
The five primitives (cheat sheet — full detail in choosing-primitives.md
)
choosing-primitives.md
— every cognitive unit. Schemas come from type hints (no@app.reasoner()
param exists).input_schema=
— deterministic functions. No LLM. Use whenever an LLM call is overkill.@app.skill()
— single OR multi-turn LLM call.app.ai(system, user, schema, model, tools, ...)
makes it stateful.tools=[...]
per call overrides AIConfig default.model="..."
— delegates to an external coding-agent CLI. Not a generic tool-using LLM (that'sapp.harness(prompt, provider="claude-code"|"codex"|"gemini"|"opencode")
). REQUIRES the chosen provider's CLI to be installed inside the agent container — see "Harness availability gate" below.app.ai(tools=[...])
— inter-reasoner traffic THROUGH the control plane. Returnsapp.call(target, **kwargs)
. No model override param — threaddict
as a regular reasoner kwarg.model
The bias: many small
@app.reasoner() units. @app.skill() for anything code can do. app.ai() with explicit prompts. Reserve app.harness() for real coding-agent delegation.
Harness availability gate (READ BEFORE USING app.harness()
)
app.harness()app.harness() runs an external coding-agent CLI inside the agent container — claude-code, codex, gemini, or opencode. The default python:3.11-slim Docker image has none of these installed. A scaffold that uses app.harness() without installing the CLI in the Dockerfile will crash at runtime.
The check is automated.
af doctor --json reports recommendation.harness_usable (true/false) and recommendation.harness_providers (the list of CLIs on PATH). Use the doctor output as the source of truth — do not assume.
Default rule: scaffolds MUST NOT use
app.harness() at all when recommendation.harness_usable == false. Use app.ai(tools=[...]) for stateful reasoning, or a @app.reasoner() that loops app.ai() for chunked work. These work in the default container with zero extra setup.
You may use
ONLY when ALL of the following are true:app.harness()
- The use case genuinely requires a real coding agent in the loop — i.e. the reasoner needs to write/edit files on disk, run shell commands, or perform complex non-LLM coding work that
cannot do.app.ai(tools=[...]) - You modify the Dockerfile to install the chosen provider's CLI. Example for Claude Code:
RUN apt-get update && apt-get install -y --no-install-recommends nodejs npm \ && npm install -g @anthropic-ai/claude-code \ && rm -rf /var/lib/apt/lists/* - You add a startup availability check in
that fails fast with a clear error if the CLI is not on PATH:main.pyimport shutil, sys if not shutil.which("claude"): # or "codex" / "gemini" / "opencode" print("ERROR: app.harness(provider='claude-code') requires the `claude` CLI in PATH.", file=sys.stderr) sys.exit(1) - The README explicitly tells the user that the agent container ships with
(or whatever) and explains the consequence on image size.claude-code
If any of the four are not satisfied, do not use
. Refactor the reasoner to use app.harness()
app.ai(tools=[...]) or a chunked @app.reasoner() loop. There is no scenario where it's OK to write app.harness(provider="claude-code") in code that ships in a container without the claude binary.
When in doubt: don't use harness. The user can ask for it in iteration 2. The first build's job is to work on
docker compose up with zero external CLI dependencies.
Mandatory patterns (every build must have all three)
1. Per-request model propagation
The entry reasoner accepts
model: str | None = None and threads it through every app.ai(..., model=model) and app.call(..., model=model). Child reasoners accept model the same way and use it. The user can A/B test models per request:
curl -X POST http://localhost:8080/api/v1/execute/<slug>.<entry> \ -d '{"input": {"...": "...", "model": "openrouter/openai/gpt-4o"}}'
If
model is omitted, the AIConfig default from the env var AI_MODEL is used. app.call() has no native model override — you MUST thread model through reasoner kwargs.
2. Routers when reasoners > 4
Use
AgentRouter(prefix="domain", tags=["domain"]) and app.include_router(router) to split reasoners into separate files. Tags merge between router and per-decorator. Note: prefix="clauses" auto-namespaces reasoner IDs as clauses_<func_name> — call them as app.call(f"{app.node_id}.clauses_<func_name>", ...).
3. Tags on the entry reasoner
The public entry reasoner is decorated with
tags=["entry"] so it surfaces in the discovery API. Tags are free-form (not reserved); use domain tags for internal reasoners.
Hard rejections — refuse these without negotiation
| ❌ Rejected pattern | ✅ AgentField alternative |
|---|---|
Direct HTTP between reasoners () | — control plane needs to see every call to track DAG, generate VCs, replay |
| One giant reasoner doing 5 things | Decompose into 5 reasoners coordinated by an orchestrator using + |
Static linear chain (always, no routing) | Dynamic routing: intake reasoner picks downstream reasoners based on what it found |
| that loops per chunk, OR with explicit tool calls |
Unbounded | Hard cap: with explicit break |
| Passing structured JSON between two LLM reasoners | Convert to prose. LLMs reason over natural language, not JSON serialization |
Replicating sort/dedup/score work with | with plain Python |
Scaffold without a working that returns real output | The promise is + curl. Always include it |
| Multi-container agent fleet when one node would do | One agent node, many reasoners — unless there's a real boundary |
Hardcoded in | — survives rename |
| Hardcoded model | AND per-request override via reasoner kwarg |
schema with no field and no fallback | Schema must include , call site checks it and escalates |
in a default scaffold | Default container has no CLI. Use or a chunked-loop reasoner. See "Harness availability gate" |
or parameter on | These don't exist. Schemas come from type hints |
in | — auto-detects CLI vs server mode |
Passing a Pydantic instance (or a list/dict of them) across and expecting the receiver to get the instance | Cross-boundary data never auto-reconstitutes. The receiver gets plain / regardless of type hints. Either reconstruct explicitly on the receiving side () OR render to prose in the caller and pass a string. See → "Cross-boundary data does NOT auto-reconstitute" |
| Trusting prose descriptions of framework contracts with words like "every", "all", "transparently forwards" | Surface contracts are always narrower than the words describing them. Verify the exact attribute or method you plan to use against the enumerated list in . If it's not on the list, treat it as unsupported |
Treating static validation (, , type checks) as sufficient proof the build works | Static checks catch syntax and shape, not contract drift. A build is not done until the canonical async curl has been fired against the live stack and returned with a real reasoned . See "Mandatory live smoke test before handoff" |
When the user explicitly demands a rejected pattern, name the rejection, explain why in one sentence, propose the AgentField alternative, and only build it their way after they've confirmed they understand the tradeoff. Add a
# NOTE: User requested X over canonical Y comment.
Mandatory live smoke test before handoff
Static validation —
python3 -m py_compile, docker compose config, visual invariants, type checks — is necessary but not sufficient. None of those checks exercise the live call graph. None of them catch contract drift between reasoners. None of them reveal runtime errors like a cross-boundary type mismatch, a missing proxied attribute, a wrong env var name, or an unreachable sub-reasoner.
A build is not done until the canonical async curl from section 7 of the Output Contract has been fired against the live stack and returned
with a real reasoned status: "succeeded"
payload. No exceptions.result
The required sequence before you tell the user "it's ready":
— bring the stack up.docker compose up --build -d- Poll
(see section 6) until the agent is registered and every reasoner you defined is listed./api/v1/discovery/capabilities - Fire the canonical async curl from the README with realistic input.
- Poll
until/api/v1/executions/<id>
transitions to eitherstatus
orsucceeded
.failed - If
: read thestatus == "succeeded"
field and confirm it is a real structured response from your entry reasoner — not an empty object, not a fallback, not a half-finished dict.result - If
: read thestatus == "failed"
field AND tail the agent container logs (error
) to find the actual stack trace. Fix the bug. Go back to step 1.docker compose logs <slug> --tail=100
Do not hand off a build that has only passed static checks. Static validation means "the files are syntactically valid and the compose graph is shaped correctly". It does not mean "the reasoners can actually call each other, the type contracts line up at runtime, the prompts fit in the context window, and the final synthesis produces something coherent." Only the live execution proves those things.
Common runtime failures that ONLY show up in the live smoke test:
— typically a cross-boundary data reconstitution bug (a structured payload arrived as a dict, code tried to access it as a model). SeeAttributeError: '<object>' has no attribute '<X>'
→ "Cross-boundary data does NOT auto-reconstitute".choosing-primitives.md
— a framework surface contract was narrower than assumed. The attribute does not proxy. SeeAttributeError: '<framework object>' has no attribute '<X>'
→ router surface table.choosing-primitives.md
— a downstream reasoner tried toTypeError: argument after ** must be a mapping, not <T>
a value that wasn't a dict. Almost always the same cross-boundary issue.**payload- A reasoner returned an empty or half-filled schema — usually means an upstream
call had a.ai()
result and the fallback path returned a safe-default instance, but the orchestrator didn't notice and passed the default downstream as if it were real data.confident=False - The pipeline timed out — either the model is too slow, the fan-out is too wide for the sync endpoint, or a sub-reasoner is stuck waiting on something. Switch the smoke test to async if you haven't already.
The general rule (applies to any multi-component system, not just this one): the only test that proves a distributed system works is the test that exercises the distribution. Running the canonical path against the live system is not optional; it is the single most important validation step, and the one most likely to catch subtle cross-component contract bugs. Never hand off without it.
Rationalization counters & red flags
These thoughts mean STOP. If you notice any of them, re-read the linked reference and reconsider.
| Thought / symptom | Reality / re-read |
|---|---|
| "Quick demo, I'll skip the architecture" | The skill exists to be stronger than a chain. Weak demo proves nothing |
| "I'll pass JSON between two reasoners" | LLMs reason over prose. Strings between LLMs, JSON only for code |
"One big reasoner is fewer files" | Decompose. Granularity is the forcing function for parallelism. |
| "I'll skip the CLAUDE.md / README" | They're how the next coding agent extends without breaking it. Always generate |
| "I'll ask 5 questions to be safe" | One question. State assumptions. Iterate |
| "Curl is enough, skip discovery API" | Discovery API tells you in 2s which step actually failed. |
"I need stateful tool-using → " | NO. is external coding-agent CLI delegation AND requires the CLI in the container. Use or a chunked-loop reasoner |
"I'll add for the deep reasoning step" | The default Python container has no CLI. The scaffold will crash on first run. Read "Harness availability gate" |
"I'll add to the decorator" | That param doesn't exist. Schemas come from type hints |
| ".ai() for a 50-page document" | or a chunked-loop reasoner. |
"Static loop of LLM calls, no routing" | Add dynamic routing or admit AgentField isn't justified. |
"Skipping and " | Always run. |
"I'll write to call the other reasoner" | Use . |
"I'll use in main" | Use . Auto-detects CLI vs server |
Output contract (every build)
The final message to the user MUST contain these sections, in this order, in a clean copy-pasteable format. The whole point is the first-time user can read the message top to bottom and within 60 seconds have the system running and a working curl in another terminal.
1. What was scaffolded
Generated file tree with absolute paths.
2. Architecture sketch
4–6 bullets: what each reasoner does, who calls whom, where the dynamic routing happens, where the safety guardrails fire.
3. Assumptions made
5–10 bullets — the things you inferred without asking.
4. 🚀 Run it (3 commands)
cd <absolute_project_path> cp .env.example .env # then paste your OPENROUTER_API_KEY into .env docker compose up --build
Wait until you see
agent registered in the logs (~30–90 seconds first run).
5. 🌐 Open the UI
After the stack is up, open these URLs in your browser:
| URL | What it shows |
|---|---|
| http://localhost:8080/ui/ | AgentField control plane web UI — live workflow DAG, reasoner discovery, execution history, verifiable credential chains |
| http://localhost:8080/api/v1/discovery/capabilities | JSON: every reasoner registered with the control plane (proves your build deployed) |
| http://localhost:8080/api/v1/health | Health check |
6. ✅ Verify the build (in another terminal)
Use
/api/v1/discovery/capabilities as the primary, durable registration check. The /api/v1/nodes endpoint has filter behaviour that varies across control-plane versions and deployments — a node that successfully registered can still return empty under some filter combinations. Discovery is the only introspection path whose semantics are stable across versions.
# 1. Control plane up? curl -fsS http://localhost:8080/api/v1/health | jq '.status' # 2. Agent registered and reasoners discoverable? (PRIMARY CHECK — always use this) # Response shape: .capabilities[].reasoners[].id (NOT .reasoners[].name) curl -fsS http://localhost:8080/api/v1/discovery/capabilities \ | jq '.capabilities[] | select(.agent_id=="<slug>") | { agent_id, n_reasoners: (.reasoners | length), entry: [.reasoners[] | select(.tags[]? == "entry") | .id], all_reasoner_ids: [.reasoners[].id] }'
If the JSON above returns your
agent_id with n_reasoners > 0 and an entry array containing at least one id, registration succeeded. That is the only thing that matters. If you also want to look at /api/v1/nodes, do it as a secondary diagnostic — never as the primary gate — and do not worry if it shows health: "unknown" or returns an empty list; those are cosmetic/filter artifacts, not registration failures.
7. 🎯 Try it — sample async curl (canonical)
Use async. Multi-reasoner compositions routinely exceed the 90-second timeout on the sync endpoint. The canonical smoke test kicks off the async endpoint and polls until the execution succeeds.
# 1. Kick off — returns an execution_id immediately EXEC_ID=$(curl -sS -X POST http://localhost:8080/api/v1/execute/async/<slug>.<entry_reasoner> \ -H 'Content-Type: application/json' \ -d '{ "input": { "<param1>": "<realistic value>", "<param2>": <realistic value>, "model": "openrouter/google/gemini-2.5-flash" } }' | jq -r '.execution_id') echo "Execution: $EXEC_ID" # 2. Poll until done and print the result while :; do R=$(curl -sS http://localhost:8080/api/v1/executions/$EXEC_ID) S=$(echo "$R" | jq -r '.status') case "$S" in succeeded) echo "$R" | jq '.result'; break ;; failed) echo "$R" | jq '.'; break ;; *) sleep 2 ;; esac done
The payload must use realistic data the user can run as-is and see a real reasoned answer. Do not use placeholder values like
"foo" or "test". Use concrete values that actually exercise every reasoner in the system. The optional "model" field overrides the AIConfig default per-request — show it in the example so users discover the per-request override.
If the user provided test data in the brief (sample input, sample record, sample document), use THAT data verbatim in this curl. The first execution they run should be the most demonstrative one.
When it is safe to use the sync endpoint instead: only if you can genuinely guarantee the entire pipeline — every layer of fan-out, every child
app.call, every .ai() call, every retry — finishes in under ~60 seconds. If the system uses parallel specialists, meta-prompting, or slow models, use async. When in doubt, use async.
8. 🏆 Showpiece — verifiable workflow chain
LAST_EXEC=$(curl -s http://localhost:8080/api/v1/executions | jq -r '.[0].workflow_id') curl -s http://localhost:8080/api/v1/did/workflow/$LAST_EXEC/vc-chain | jq
This is the cryptographic verifiable credential chain — every reasoner that ran, with provenance. No other agent framework gives you this. Mention it.
9. Next iteration upgrade
One concrete suggestion tailored to the shape you actually built. Examples: "swap the intake
for a chunked-loop reasoner if inputs grow past 2 pages", "add per-document caching via .ai()
with scope=agent once volume justifies it", "split app.memory
into query_planner
+ query_decomposer
when queries get multi-clause", "add an adversarial reviewer ONLY if the cost of a wrong verdict crosses a real-world threshold you can name", "swap the sequential cascade for a streaming pipeline once you need sub-500ms first-token latency".constraint_extractor
TypeScript
A TypeScript SDK exists at
sdk/typescript/ and mirrors the Python API. Default to Python. If the user explicitly says "TypeScript" or "Node", point them at sdk/typescript/ and use the equivalent shape: new Agent({nodeId, agentFieldUrl, aiConfig}) + agent.reasoner('name', async (ctx) => {...}). Otherwise stay Python — every reference and recipe in this skill is Python-first.
Bottom line
Your output is judged by three things:
- Does the curl return a real reasoned answer? (the user can run the command and see intelligence happen)
- Does the architecture look like composite intelligence? (parallel reasoners, dynamic routing, decomposition — not a chain wearing a costume)
- Can a future coding agent extend it without breaking the contract? (CLAUDE.md present, anti-patterns listed, validation commands documented)
If all three are true, you've done it right. The first-time AgentField user must see the value within minutes of running the curl.