Claude-skill-registry investigate-stuck-messages

Investigate stuck messages in relayer queue. Use when alerts mention "queue length > 0", to diagnose why messages are stuck, or to get message IDs for denylisting.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/investigate-stuck-messages" ~/.claude/skills/majiayu000-claude-skill-registry-investigate-stuck-messages && rm -rf "$T"

manifest: skills/data/investigate-stuck-messages/SKILL.md

Investigate Stuck Messages

Query the relayer API to investigate stuck messages, their retry counts, and error reasons.

When to Use

Alert-based triggers:
- Alert: "Known app context relayer queue length > 0 for 40m"
- Any alert mentioning stuck messages in prepare queue
- High retry counts for specific app contexts
User request triggers:
- "Why are messages stuck for [app_context]?"
- "Investigate stuck messages on [chain]"
- "What's causing the queue alert?"
- Pasting a Grafana alert URL

Input Parameters

Option 1: Grafana Alert URL (recommended)

/investigate-stuck-messages https://abacusworks.grafana.net/alerting/grafana/cdg1ro5hi4vswb/view?tab=instances

Option 2: Manual specification

/investigate-stuck-messages app_context=EZETH/renzo-prod remote=linea

Parameter	Required	Default	Description
`alert_url`	No	-	Grafana alert URL (extracts app_context/remote from firing instances)
`app_context`	No*	-	The app context (e.g., `EZETH/renzo-prod` , `oUSDT/production` )
`remote`	No*	-	Destination chain name (e.g., `linea` , `ethereum` , `arbitrum` )
`environment`	No	`mainnet3`	Deployment environment

*Either

alert_url

OR both

app_context

and

remote

must be provided.

Workflow

Step 1: Parse Input and Extract Alert Instances

If Grafana alert URL provided:

Extract the alert UID from the URL (e.g.,

cdg1ro5hi4vswb

from

.../alerting/grafana/cdg1ro5hi4vswb/view

)

Query Prometheus directly for firing instances using

mcp__grafana__query_prometheus

sum by (app_context, remote)(
    max_over_time(
        hyperlane_submitter_queue_length{
            queue_name="prepare_queue",
            app_context!~"Unknown|merkly_eth|merkly_erc20|helloworld|velo_message_module",
            hyperlane_context!~"rc|vanguard0|vanguard1|vanguard2|vanguard3|vanguard4|vanguard5",
            operation_status!~"Retry\\(ApplicationReport\\(.*\\)\\)|FirstPrepareAttempt",
            hyperlane_deployment="mainnet3",
        }[2m]
    )
) > 0

Extract
```
app_context
```
and
```
remote
```
labels from each result.

If manual app_context/remote provided:

Use the provided values directly.

Step 2: Setup Port-Forward to Relayer

Check if port 9090 is already in use:

lsof -i :9090

If not in use, start port-forward in background:

kubectl port-forward omniscient-relayer-hyperlane-agent-relayer-0 9090 -n mainnet3 &

Wait a few seconds for the port-forward to establish.

Step 3: Get Domain IDs for Chains

Look up domain IDs from the registry:

cat node_modules/.pnpm/@hyperlane-xyz+registry@*/node_modules/@hyperlane-xyz/registry/dist/chains/<chain>/metadata.json | jq '.domainId'

Common domain IDs:

ethereum: 1
optimism: 10
arbitrum: 42161
polygon: 137
base: 8453
unichain: 130
avalanche: 43114

Step 4: Query Relayer API

For each destination chain, query the relayer API:

curl -s 'http://localhost:9090/list_operations?destination_domain=<DOMAIN_ID>' > /tmp/<chain>.json

The response contains operations with:

```
id
```
: Message ID (H256)
```
operation.message.sender
```
: Sender address
```
operation.message.recipient
```
: Recipient address
```
operation.num_retries
```
: Number of retries (higher = more stuck)

operation.status

: Error status (e.g.,

{"Retry": "ErrorEstimatingGas"}

)

```
operation.message.origin
```
: Origin domain ID
```
operation.message.destination
```
: Destination domain ID
```
operation.app_context
```
: App context name

Step 5: Filter Messages by App Context

Look up the

app_context

rust/main/app-contexts/mainnet_config.json

jq '.metricAppContexts[] | select(.name == "<APP_CONTEXT>")' rust/main/app-contexts/mainnet_config.json

Filter API results to only include messages where:

```
operation.message.recipient
```
matches one of the
```
recipientAddress
```
values for that destination domain

Important: Addresses are padded to 32 bytes (H256 format).

Step 6: Query GCP Logs for Actual Errors

Calculate log freshness based on retry count:

The relayer uses exponential backoff (see

calculate_msg_backoff

rust/main/agents/relayer/src/msg/pending_message.rs

Retries	Backoff/retry	Cumulative Time	Freshness Flag
1-4	5s-1min	~2min	`--freshness=1h`
5-24	3min	~1h	`--freshness=3h`
25-39	5-26min	~5h	`--freshness=12h`
40-49	30min-1h	~12h	`--freshness=24h`
50-60	2-22h	~35h	`--freshness=3d`
60+	22h+	35h+	`--freshness=7d`

For each message ID, query GCP logs with calculated freshness:

gcloud logging read 'resource.type=k8s_container AND resource.labels.namespace_name=mainnet3 AND resource.labels.pod_name:omniscient-relayer AND jsonPayload.span.id:<MESSAGE_ID> AND jsonPayload.fields.error:*' --project=abacus-labs-dev --limit=1 --format='value(jsonPayload.fields.error)' --freshness=<CALCULATED_FRESHNESS>

Extract the human-readable error from the response using

sed

(macOS compatible):

echo "$raw_error" | sed -n 's/.*execution reverted: \([^"]*\)".*/\1/p' | head -1

Common error patterns:

"execution reverted: Nonce already used"

→ "Nonce already used"

"execution reverted: panic: arithmetic underflow"

→ "Arithmetic underflow"

Note: Do not use

grep -P

as it's not available on macOS.

Step 7: Present Investigation Results

Output a detailed summary table with full message IDs and both error sources:

## Investigation Results for [APP_CONTEXT]

### Summary
- Total stuck messages: X
- Destinations affected: [list]
- Reprepare reasons: ErrorEstimatingGas (N), CouldNotFetchMetadata (M)

### Messages

| Message ID | Retries | Reprepare Reason | Error | Origin |
|------------|---------|------------------|-----------|--------|
| `0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c` | 47 | ErrorEstimatingGas | Nonce already used | optimism |
| `0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac` | 47 | ErrorEstimatingGas | Nonce already used | arbitrum |

**Important**: Always show the full 66-character message ID (0x + 64 hex chars). Do not truncate.

### Error Analysis
[Explain based on the actual log errors found]

### Next Steps
To denylist these messages, run:
/denylist-stuck-messages <message_ids> app_context=APP_CONTEXT

Column definitions:

Reprepare Reason: From
```
operation.status
```
in relayer API (e.g., ErrorEstimatingGas, CouldNotFetchMetadata)
Error: Actual revert reason from GCP logs (e.g., "Nonce already used", "Arithmetic underflow")

Step 8: Output Denylist Command

At the end of the investigation results, output the full denylist command:

### Next Steps
To denylist, run:
/denylist-stuck-messages 0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c 0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac app_context=APP_CONTEXT

Always use full message IDs, never truncated.

Error Status Reference

Status	Meaning	Action
`ErrorEstimatingGas`	Gas estimation failed (contract revert)	Usually denylist - contract won't accept
`CouldNotFetchMetadata`	Can't get ISM metadata	Check validators, may resolve itself
`ApplicationReport(...)`	App-specific error	Check the specific error message
`GasPaymentNotFound`	No IGP payment	May need manual relay with gas

Error Handling

Port-forward fails: Check kubectl context:
```
kubectl config current-context
```
No messages found: Queue may have cleared; alert may be stale

API returns error: Check relayer pod:

kubectl get pods -n mainnet3 | grep relayer

App context not found: May be new/custom; ask user for sender/recipient addresses

Prerequisites

```
kubectl
```
configured with access to mainnet cluster
Grafana MCP server connected (for alert URL parsing)