Claude-skill-registry investigate-stuck-messages

Investigate stuck messages in relayer queue. Use when alerts mention "queue length > 0", to diagnose why messages are stuck, or to get message IDs for denylisting.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/investigate-stuck-messages" ~/.claude/skills/majiayu000-claude-skill-registry-investigate-stuck-messages && rm -rf "$T"
manifest: skills/data/investigate-stuck-messages/SKILL.md
source content

Investigate Stuck Messages

Query the relayer API to investigate stuck messages, their retry counts, and error reasons.

When to Use

  1. Alert-based triggers:

    • Alert: "Known app context relayer queue length > 0 for 40m"
    • Any alert mentioning stuck messages in prepare queue
    • High retry counts for specific app contexts
  2. User request triggers:

    • "Why are messages stuck for [app_context]?"
    • "Investigate stuck messages on [chain]"
    • "What's causing the queue alert?"
    • Pasting a Grafana alert URL

Input Parameters

Option 1: Grafana Alert URL (recommended)

/investigate-stuck-messages https://abacusworks.grafana.net/alerting/grafana/cdg1ro5hi4vswb/view?tab=instances

Option 2: Manual specification

/investigate-stuck-messages app_context=EZETH/renzo-prod remote=linea
ParameterRequiredDefaultDescription
alert_url
No-Grafana alert URL (extracts app_context/remote from firing instances)
app_context
No*-The app context (e.g.,
EZETH/renzo-prod
,
oUSDT/production
)
remote
No*-Destination chain name (e.g.,
linea
,
ethereum
,
arbitrum
)
environment
No
mainnet3
Deployment environment

*Either

alert_url
OR both
app_context
and
remote
must be provided.

Workflow

Step 1: Parse Input and Extract Alert Instances

If Grafana alert URL provided:

  1. Extract the alert UID from the URL (e.g.,

    cdg1ro5hi4vswb
    from
    .../alerting/grafana/cdg1ro5hi4vswb/view
    )

  2. Query Prometheus directly for firing instances using

    mcp__grafana__query_prometheus
    :

    sum by (app_context, remote)(
        max_over_time(
            hyperlane_submitter_queue_length{
                queue_name="prepare_queue",
                app_context!~"Unknown|merkly_eth|merkly_erc20|helloworld|velo_message_module",
                hyperlane_context!~"rc|vanguard0|vanguard1|vanguard2|vanguard3|vanguard4|vanguard5",
                operation_status!~"Retry\\(ApplicationReport\\(.*\\)\\)|FirstPrepareAttempt",
                hyperlane_deployment="mainnet3",
            }[2m]
        )
    ) > 0
    
  3. Extract

    app_context
    and
    remote
    labels from each result.

If manual app_context/remote provided:

Use the provided values directly.

Step 2: Setup Port-Forward to Relayer

Check if port 9090 is already in use:

lsof -i :9090

If not in use, start port-forward in background:

kubectl port-forward omniscient-relayer-hyperlane-agent-relayer-0 9090 -n mainnet3 &

Wait a few seconds for the port-forward to establish.

Step 3: Get Domain IDs for Chains

Look up domain IDs from the registry:

cat node_modules/.pnpm/@hyperlane-xyz+registry@*/node_modules/@hyperlane-xyz/registry/dist/chains/<chain>/metadata.json | jq '.domainId'

Common domain IDs:

  • ethereum: 1
  • optimism: 10
  • arbitrum: 42161
  • polygon: 137
  • base: 8453
  • unichain: 130
  • avalanche: 43114

Step 4: Query Relayer API

For each destination chain, query the relayer API:

curl -s 'http://localhost:9090/list_operations?destination_domain=<DOMAIN_ID>' > /tmp/<chain>.json

The response contains operations with:

  • id
    : Message ID (H256)
  • operation.message.sender
    : Sender address
  • operation.message.recipient
    : Recipient address
  • operation.num_retries
    : Number of retries (higher = more stuck)
  • operation.status
    : Error status (e.g.,
    {"Retry": "ErrorEstimatingGas"}
    )
  • operation.message.origin
    : Origin domain ID
  • operation.message.destination
    : Destination domain ID
  • operation.app_context
    : App context name

Step 5: Filter Messages by App Context

Look up the

app_context
in
rust/main/app-contexts/mainnet_config.json
:

jq '.metricAppContexts[] | select(.name == "<APP_CONTEXT>")' rust/main/app-contexts/mainnet_config.json

Filter API results to only include messages where:

  • operation.message.recipient
    matches one of the
    recipientAddress
    values for that destination domain

Important: Addresses are padded to 32 bytes (H256 format).

Step 6: Query GCP Logs for Actual Errors

Calculate log freshness based on retry count:

The relayer uses exponential backoff (see

calculate_msg_backoff
in
rust/main/agents/relayer/src/msg/pending_message.rs
):

RetriesBackoff/retryCumulative TimeFreshness Flag
1-45s-1min~2min
--freshness=1h
5-243min~1h
--freshness=3h
25-395-26min~5h
--freshness=12h
40-4930min-1h~12h
--freshness=24h
50-602-22h~35h
--freshness=3d
60+22h+35h+
--freshness=7d

For each message ID, query GCP logs with calculated freshness:

gcloud logging read 'resource.type=k8s_container AND resource.labels.namespace_name=mainnet3 AND resource.labels.pod_name:omniscient-relayer AND jsonPayload.span.id:<MESSAGE_ID> AND jsonPayload.fields.error:*' --project=abacus-labs-dev --limit=1 --format='value(jsonPayload.fields.error)' --freshness=<CALCULATED_FRESHNESS>

Extract the human-readable error from the response using

sed
(macOS compatible):

echo "$raw_error" | sed -n 's/.*execution reverted: \([^"]*\)".*/\1/p' | head -1

Common error patterns:

  • "execution reverted: Nonce already used"
    → "Nonce already used"
  • "execution reverted: panic: arithmetic underflow"
    → "Arithmetic underflow"

Note: Do not use

grep -P
as it's not available on macOS.

Step 7: Present Investigation Results

Output a detailed summary table with full message IDs and both error sources:

## Investigation Results for [APP_CONTEXT]

### Summary
- Total stuck messages: X
- Destinations affected: [list]
- Reprepare reasons: ErrorEstimatingGas (N), CouldNotFetchMetadata (M)

### Messages

| Message ID | Retries | Reprepare Reason | Error | Origin |
|------------|---------|------------------|-----------|--------|
| `0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c` | 47 | ErrorEstimatingGas | Nonce already used | optimism |
| `0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac` | 47 | ErrorEstimatingGas | Nonce already used | arbitrum |

**Important**: Always show the full 66-character message ID (0x + 64 hex chars). Do not truncate.

### Error Analysis
[Explain based on the actual log errors found]

### Next Steps
To denylist these messages, run:
/denylist-stuck-messages <message_ids> app_context=APP_CONTEXT

Column definitions:

  • Reprepare Reason: From
    operation.status
    in relayer API (e.g., ErrorEstimatingGas, CouldNotFetchMetadata)
  • Error: Actual revert reason from GCP logs (e.g., "Nonce already used", "Arithmetic underflow")

Step 8: Output Denylist Command

At the end of the investigation results, output the full denylist command:

### Next Steps
To denylist, run:
/denylist-stuck-messages 0xaa18ebc1c79345e6d24984a0b9a5ab66c968d128d46b2357b641e56e71b8d30c 0xd6aeef7c092a88aa23ad53227aeb834ae731d059b3ce749db8451e761f3f15ac app_context=APP_CONTEXT

Always use full message IDs, never truncated.

Error Status Reference

StatusMeaningAction
ErrorEstimatingGas
Gas estimation failed (contract revert)Usually denylist - contract won't accept
CouldNotFetchMetadata
Can't get ISM metadataCheck validators, may resolve itself
ApplicationReport(...)
App-specific errorCheck the specific error message
GasPaymentNotFound
No IGP paymentMay need manual relay with gas

Error Handling

  • Port-forward fails: Check kubectl context:
    kubectl config current-context
  • No messages found: Queue may have cleared; alert may be stale
  • API returns error: Check relayer pod:
    kubectl get pods -n mainnet3 | grep relayer
  • App context not found: May be new/custom; ask user for sender/recipient addresses

Prerequisites

  • kubectl
    configured with access to mainnet cluster
  • Grafana MCP server connected (for alert URL parsing)