Agents troubleshooting-astro-deployments
Troubleshoot Astronomer production deployments with Astro CLI. Use when investigating deployment issues, viewing production logs, analyzing failures, or managing deployment environment variables.
git clone https://github.com/astronomer/agents
skills/troubleshooting-astro-deployments/skill.mdAstro Deployment Troubleshooting
This skill helps you diagnose and troubleshoot production Astronomer deployments using the Astro CLI.
For deployment management, see the managing-astro-deployments skill. For local development, see the managing-astro-local-env skill.
Quick Health Check
Start with these commands to get an overview:
# 1. List deployments to find target astro deployment list # 2. Get deployment overview astro deployment inspect <DEPLOYMENT_ID> # 3. Check for errors astro deployment logs <DEPLOYMENT_ID> --error -c 50
Viewing Deployment Logs
Use
-c to control log count (default: 500). Log flags cannot be combined — use one component or level flag per command.
Component-Specific Logs
View logs from specific Airflow components:
# Scheduler logs (DAG processing, task scheduling) astro deployment logs <DEPLOYMENT_ID> --scheduler -c 50 # Worker logs (task execution) astro deployment logs <DEPLOYMENT_ID> --workers -c 30 # Webserver logs (UI access, health checks) astro deployment logs <DEPLOYMENT_ID> --webserver -c 30 # Triggerer logs (deferrable operators) astro deployment logs <DEPLOYMENT_ID> --triggerer -c 30
Log Level Filtering
Filter by severity:
# Error logs only (most useful for troubleshooting) astro deployment logs <DEPLOYMENT_ID> --error -c 30 # Warning logs astro deployment logs <DEPLOYMENT_ID> --warn -c 50 # Info-level logs astro deployment logs <DEPLOYMENT_ID> --info -c 50
Search Logs
Search for specific keywords:
# Search for specific error astro deployment logs <DEPLOYMENT_ID> --keyword "ConnectionError" # Search for specific DAG astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100 # Find import errors astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError" # Find task failures astro deployment logs <DEPLOYMENT_ID> --error --keyword "Task failed"
Complete Investigation Workflow
Step 1: Identify the Problem
# List deployments with status astro deployment list # Get deployment details astro deployment inspect <DEPLOYMENT_ID>
Look for:
- Status: HEALTHY vs UNHEALTHY
- Runtime version compatibility
- Resource limits (CPU, memory)
- Recent deployment timestamp
Step 2: Check Error Logs
# Start with errors astro deployment logs <DEPLOYMENT_ID> --error -c 50
Look for:
- Recurring error patterns
- Specific DAGs failing repeatedly
- Import errors or syntax errors
- Connection or credential errors
Step 3: Review Scheduler Logs
# Check DAG processing astro deployment logs <DEPLOYMENT_ID> --scheduler -c 30
Look for:
- DAG parse errors
- Scheduling delays
- Task queueing issues
Step 4: Check Worker Logs
# Check task execution astro deployment logs <DEPLOYMENT_ID> --workers -c 30
Look for:
- Task execution failures
- Resource exhaustion
- Timeout errors
Step 5: Verify Configuration
# Check environment variables astro deployment variable list --deployment-id <DEPLOYMENT_ID> # Verify deployment settings astro deployment inspect <DEPLOYMENT_ID>
Look for:
- Missing or incorrect environment variables
- Secrets configuration (AIRFLOW__SECRETS__BACKEND)
- Connection configuration
Common Investigation Patterns
Recurring DAG Failures
Follow the complete investigation workflow above, then narrow to the specific DAG:
astro deployment logs <DEPLOYMENT_ID> --keyword "my_dag_name" -c 100
Resource Issues
# 1. Check deployment resource allocation astro deployment inspect <DEPLOYMENT_ID> # Look for: resource_quota_cpu, resource_quota_memory # Worker queue: max_worker_count, worker_type # 2. Check for worker scaling issues astro deployment logs <DEPLOYMENT_ID> --workers -c 50 # 3. Look for out-of-memory errors astro deployment logs <DEPLOYMENT_ID> --error --keyword "memory"
Configuration Problems
# 1. Review environment variables astro deployment variable list --deployment-id <DEPLOYMENT_ID> # 2. Check for secrets backend configuration # Look for: AIRFLOW__SECRETS__BACKEND, AIRFLOW__SECRETS__BACKEND_KWARGS # 3. Verify deployment settings astro deployment inspect <DEPLOYMENT_ID> # 4. Check webserver logs for auth issues astro deployment logs <DEPLOYMENT_ID> --webserver -c 30
Import Errors
# 1. Find import errors astro deployment logs <DEPLOYMENT_ID> --error --keyword "ImportError" # 2. Check scheduler for parse failures astro deployment logs <DEPLOYMENT_ID> --scheduler --keyword "Failed to import" -c 50 # 3. Verify dependencies were deployed astro deployment inspect <DEPLOYMENT_ID> # Check: current_tag, last deployment timestamp
Environment Variables Management
List Variables
# List all variables for deployment astro deployment variable list --deployment-id <DEPLOYMENT_ID> # Find specific variable astro deployment variable list --deployment-id <DEPLOYMENT_ID> --key AWS_REGION # Export variables to file astro deployment variable list --deployment-id <DEPLOYMENT_ID> --save --env .env.backup
Create Variables
# Create regular variable astro deployment variable create --deployment-id <DEPLOYMENT_ID> \ --key API_ENDPOINT \ --value https://api.example.com # Create secret (masked in UI and logs) astro deployment variable create --deployment-id <DEPLOYMENT_ID> \ --key API_KEY \ --value secret123 \ --secret
Update Variables
# Update existing variable astro deployment variable update --deployment-id <DEPLOYMENT_ID> \ --key API_KEY \ --value newsecret
Delete Variables
# Delete variable astro deployment variable delete --deployment-id <DEPLOYMENT_ID> --key OLD_KEY
Note: Variables are available to DAGs as environment variables. Changes require no redeployment.
Key Metrics from deployment inspect
deployment inspectFocus on these fields when troubleshooting:
- status: HEALTHY vs UNHEALTHY
- runtime_version: Airflow version compatibility
- scheduler_size/scheduler_count: Scheduler capacity
- executor: CELERY, KUBERNETES, or LOCAL
- worker_queues: Worker scaling limits and types
,min_worker_countmax_worker_countworker_concurrency
(resource class)worker_type
- resource_quota_cpu/memory: Overall resource limits
- dag_deploy_enabled: Whether DAG-only deploys work
- current_tag: Last deployment version
- is_high_availability: Redundancy enabled
Investigation Best Practices
- Always start with error logs - Most obvious failures appear here
- Check error logs for patterns - Same DAG failing repeatedly? Timing patterns?
- Component-specific troubleshooting:
- Worker logs → task execution details
- Scheduler logs → DAG processing and scheduling
- Webserver logs → UI issues and health checks
- Triggerer logs → deferrable operator issues
- Use
for targeted searches - More efficient than reading all logs--keyword - The
command is your health dashboard - Check it firstinspect - Environment variables in
output - May reveal configuration issuesinspect - Log count default is 500 - Adjust with
based on needs-c - Don't forget to check deployment time - Recent deploy might have introduced issue
Troubleshooting Quick Reference
| Symptom | Command |
|---|---|
| Deployment shows UNHEALTHY | + logs |
| DAG not appearing | logs for import errors, check logs |
| Tasks failing | logs + search for DAG with |
| Slow scheduling | logs + check for scheduler resources |
| UI not responding | logs |
| Connection issues | Check variables, search logs for connection name |
| Import errors | + logs |
| Out of memory | for resources + |
Related Skills
- managing-astro-deployments: Create, update, delete deployments, deploy code
- managing-astro-local-env: Manage local Airflow development environment
- setting-up-astro-project: Initialize and configure Astro projects