Agents troubleshooting-astro-deployments

Name: troubleshooting-astro-deployments
Author: astronomer

Troubleshoot Astronomer production deployments with Astro CLI. Use when investigating deployment issues, viewing production logs, analyzing failures, or managing deployment environment variables.

install

source · Clone the upstream repo

git clone https://github.com/astronomer/agents

manifest: skills/troubleshooting-astro-deployments/skill.md

Astro Deployment Troubleshooting

This skill helps you diagnose and troubleshoot production Astronomer deployments using the Astro CLI.

For deployment management, see the managing-astro-deployments skill. For local development, see the managing-astro-local-env skill.

Quick Health Check

Start with these commands to get an overview:

# 1. List deployments to find target
astro deployment list

# 2. Get deployment overview
astro deployment inspect <DEPLOYMENT_ID>

# 3. Check for errors
astro deployment logs <DEPLOYMENT_ID> --error -c 50

Viewing Deployment Logs

Use

-c

to control log count (default: 500). Log flags cannot be combined — use one component or level flag per command.

Component-Specific Logs

View logs from specific Airflow components:

# Scheduler logs (DAG processing, task scheduling)
astro deployment logs <DEPLOYMENT_ID> --scheduler -c 50

# Worker logs (task execution)
astro deployment logs <DEPLOYMENT_ID> --workers -c 30

# Webserver logs (UI access, health checks)
astro deployment logs <DEPLOYMENT_ID> --webserver -c 30

# Triggerer logs (deferrable operators)
astro deployment logs <DEPLOYMENT_ID> --triggerer -c 30

Log Level Filtering

Filter by severity:

# Error logs only (most useful for troubleshooting)
astro deployment logs <DEPLOYMENT_ID> --error -c 30

# Warning logs
astro deployment logs <DEPLOYMENT_ID> --warn -c 50

# Info-level logs
astro deployment logs <DEPLOYMENT_ID> --info -c 50

Search Logs

Search for specific keywords:

# Search for specific error
astro deployment logs <DEPLOYMENT_ID> --keyword "ConnectionError"

# Search for specific DAG
astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100

# Find import errors
astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError"

# Find task failures
astro deployment logs <DEPLOYMENT_ID> --error --keyword "Task failed"

Complete Investigation Workflow

Step 1: Identify the Problem

# List deployments with status
astro deployment list

# Get deployment details
astro deployment inspect <DEPLOYMENT_ID>

Look for:

Status: HEALTHY vs UNHEALTHY
Runtime version compatibility
Resource limits (CPU, memory)
Recent deployment timestamp

Step 2: Check Error Logs

# Start with errors
astro deployment logs <DEPLOYMENT_ID> --error -c 50

Look for:

Recurring error patterns
Specific DAGs failing repeatedly
Import errors or syntax errors
Connection or credential errors

Step 3: Review Scheduler Logs

# Check DAG processing
astro deployment logs <DEPLOYMENT_ID> --scheduler -c 30

Look for:

DAG parse errors
Scheduling delays
Task queueing issues

Step 4: Check Worker Logs

# Check task execution
astro deployment logs <DEPLOYMENT_ID> --workers -c 30

Look for:

Task execution failures
Resource exhaustion
Timeout errors

Step 5: Verify Configuration

# Check environment variables
astro deployment variable list --deployment-id <DEPLOYMENT_ID>

# Verify deployment settings
astro deployment inspect <DEPLOYMENT_ID>

Look for:

Missing or incorrect environment variables
Secrets configuration (AIRFLOW__SECRETS__BACKEND)
Connection configuration

Common Investigation Patterns

Recurring DAG Failures

Follow the complete investigation workflow above, then narrow to the specific DAG:

astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100

Resource Issues

# 1. Check deployment resource allocation
astro deployment inspect <DEPLOYMENT_ID>
# Look for: resource_quota_cpu, resource_quota_memory
# Worker queue: max_worker_count, worker_type

# 2. Check for worker scaling issues
astro deployment logs <DEPLOYMENT_ID> --workers -c 50

# 3. Look for out-of-memory errors
astro deployment logs <DEPLOYMENT_ID> --error --keyword "memory"

Configuration Problems

# 1. Review environment variables
astro deployment variable list --deployment-id <DEPLOYMENT_ID>

# 2. Check for secrets backend configuration
# Look for: AIRFLOW__SECRETS__BACKEND, AIRFLOW__SECRETS__BACKEND_KWARGS

# 3. Verify deployment settings
astro deployment inspect <DEPLOYMENT_ID>

# 4. Check webserver logs for auth issues
astro deployment logs <DEPLOYMENT_ID> --webserver -c 30

Import Errors

# 1. Find import errors
astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError"

# 2. Check scheduler for parse failures
astro deployment logs <DEPLOYMENT_ID> --scheduler --keyword "Failed to import" -c 50

# 3. Verify dependencies were deployed
astro deployment inspect <DEPLOYMENT_ID>
# Check: current_tag, last deployment timestamp

Environment Variables Management

List Variables

# List all variables for deployment
astro deployment variable list --deployment-id <DEPLOYMENT_ID>

# Find specific variable
astro deployment variable list --deployment-id <DEPLOYMENT_ID> --key AWS_REGION

# Export variables to file
astro deployment variable list --deployment-id <DEPLOYMENT_ID> --save --env .env.backup

Create Variables

# Create regular variable
astro deployment variable create --deployment-id <DEPLOYMENT_ID> \
  --key API_ENDPOINT \
  --value https://api.example.com

# Create secret (masked in UI and logs)
astro deployment variable create --deployment-id <DEPLOYMENT_ID> \
  --key API_KEY \
  --value secret123 \
  --secret

Update Variables

# Update existing variable
astro deployment variable update --deployment-id <DEPLOYMENT_ID> \
  --key API_KEY \
  --value newsecret

Delete Variables

# Delete variable
astro deployment variable delete --deployment-id <DEPLOYMENT_ID> --key OLD_KEY

Note: Variables are available to DAGs as environment variables. Changes require no redeployment.

Key Metrics from

deployment inspect

Focus on these fields when troubleshooting:

status: HEALTHY vs UNHEALTHY
runtime_version: Airflow version compatibility
scheduler_size/scheduler_count: Scheduler capacity
executor: CELERY, KUBERNETES, or LOCAL
worker_queues: Worker scaling limits and types
- ```
min_worker_count
```
  ,
```
max_worker_count
```
- ```
worker_concurrency
```
- ```
worker_type
```
  (resource class)
resource_quota_cpu/memory: Overall resource limits
dag_deploy_enabled: Whether DAG-only deploys work
current_tag: Last deployment version
is_high_availability: Redundancy enabled

Investigation Best Practices

Always start with error logs - Most obvious failures appear here
Check error logs for patterns - Same DAG failing repeatedly? Timing patterns?
Component-specific troubleshooting:
- Worker logs → task execution details
- Scheduler logs → DAG processing and scheduling
- Webserver logs → UI issues and health checks
- Triggerer logs → deferrable operator issues
Use
--keyword
for targeted searches - More efficient than reading all logs
The
inspect
command is your health dashboard - Check it first
Environment variables in
inspect
output - May reveal configuration issues
Log count default is 500 - Adjust with
```
-c
```
based on needs
Don't forget to check deployment time - Recent deploy might have introduced issue

Troubleshooting Quick Reference

Symptom	Command
Deployment shows UNHEALTHY	`astro deployment inspect <ID>` + `--error` logs
DAG not appearing	`--error` logs for import errors, check `--scheduler` logs
Tasks failing	`--workers` logs + search for DAG with `--keyword`
Slow scheduling	`--scheduler` logs + check `inspect` for scheduler resources
UI not responding	`--webserver` logs
Connection issues	Check variables, search logs for connection name
Import errors	`--error --keyword "ImportError"` + `--scheduler` logs
Out of memory	`inspect` for resources + `--workers --keyword "memory"`

Related Skills

managing-astro-deployments: Create, update, delete deployments, deploy code
managing-astro-local-env: Manage local Airflow development environment
setting-up-astro-project: Initialize and configure Astro projects