Gsd-skill-creator openstack-kolla-ansible-ops
Kolla-Ansible day-2 operations skill for post-deployment infrastructure lifecycle management. Covers service reconfiguration (globals.yml changes, config overrides, prechecks, targeted reconfigure with --tags), minor and major OpenStack upgrades (image pull, upgrade procedure, rollback), container management (restart, logs, health inspection), maintenance mode (compute disable, instance drain, host maintenance), password rotation, certificate renewal, and rolling updates. This skill is for operations after initial deployment -- the kolla-ansible deployment skill covers initial bootstrap and deploy.
git clone https://github.com/Tibsfox/gsd-skill-creator
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/openstack/kolla-ansible-ops" ~/.claude/skills/tibsfox-gsd-skill-creator-openstack-kolla-ansible-ops && rm -rf "$T"
skills/openstack/kolla-ansible-ops/SKILL.mdKolla-Ansible Day-2 Operations -- Infrastructure Lifecycle Management
Kolla-Ansible is not just a deployment tool -- it is the infrastructure lifecycle manager. The initial
kolla-ansible deploy is a one-time event. Everything after that -- reconfiguration, upgrades, container management, maintenance -- is what this skill covers. Operators will use these procedures repeatedly throughout the cloud's operational life.
The operational command set:
| Command | Purpose | When to Use |
|---|---|---|
| Regenerate configs and restart affected services | After globals.yml or config override changes |
| Pull new images and upgrade services | Minor or major OpenStack release upgrades |
| Validate configuration before applying | Always run before reconfigure or upgrade |
| Stop all or specific service containers | Maintenance, troubleshooting |
| Deploy services (also used for redeployment) | After stop, or for new services |
| Pull container images without deploying | Pre-stage images before upgrade window |
Relationship to the kolla-ansible deployment skill: The deployment skill (Phase 313) covers the initial bootstrap, deploy, and post-deploy verification that brings the cloud online for the first time. This skill picks up where deployment left off. They share the same inventory and globals.yml, but the operational context is different: deployment is about getting services running; operations is about keeping them running, changing their configuration, and upgrading them.
Container lifecycle in Kolla: Every OpenStack service runs in a Docker container managed by Kolla-Ansible. The container lifecycle is: image pull (from registry or local build) -> deploy (create container with config volumes) -> reconfigure (regenerate config, restart container) -> upgrade (pull new image, recreate container, run migrations). Understanding this lifecycle is essential for day-2 operations.
Deploy
Operational Prerequisites
Kolla-Ansible virtual environment maintenance:
# The Kolla-Ansible venv should be maintained separately from system Python source /path/to/kolla-venv/bin/activate # Verify kolla-ansible version matches deployed release kolla-ansible --version pip show kolla-ansible | grep Version
Inventory file management:
# Inventory defines which hosts run which services # For single-node: all-in-one inventory # For multi-node: multinode inventory with role assignments # Verify inventory connectivity ansible -i /etc/kolla/inventory all -m ping
globals.yml change tracking:
# globals.yml is the primary configuration file -- track all changes in git cd /etc/kolla git init # If not already tracked git add globals.yml git commit -m "baseline: initial globals.yml configuration" # Before every change, commit the current state git add globals.yml git commit -m "pre-change: document current state before <change-description>"
Password management:
# Initial password generation (done once during deployment) kolla-ansible -i /etc/kolla/inventory genpasswd # Passwords are stored in /etc/kolla/passwords.yml # After initial generation, rotate passwords manually # Never regenerate all passwords on a running system
Image management:
# Check current image versions docker images | grep kolla # Pull latest images for current release kolla-ansible -i /etc/kolla/inventory pull # Pull images for a specific service kolla-ansible -i /etc/kolla/inventory pull --tags nova
Configure
Service Reconfiguration Workflow
Reconfiguration is the most common operational action. It regenerates configuration files from templates and globals.yml, then restarts affected containers.
# Step 1: Make the change in globals.yml or config overrides vim /etc/kolla/globals.yml # Step 2: Validate the change kolla-ansible -i /etc/kolla/inventory prechecks # Step 3: Apply the change kolla-ansible -i /etc/kolla/inventory reconfigure # Step 4: Verify service health openstack service list openstack endpoint list
Partial Reconfiguration with --tags
Target specific services to minimize disruption:
# Reconfigure only Nova kolla-ansible -i /etc/kolla/inventory reconfigure --tags nova # Reconfigure only Neutron kolla-ansible -i /etc/kolla/inventory reconfigure --tags neutron # Reconfigure only Keystone kolla-ansible -i /etc/kolla/inventory reconfigure --tags keystone # Multiple services kolla-ansible -i /etc/kolla/inventory reconfigure --tags nova,neutron
Config Validation Before Apply
Always run prechecks before reconfigure or upgrade:
# Full prechecks kolla-ansible -i /etc/kolla/inventory prechecks # Service-specific prechecks kolla-ansible -i /etc/kolla/inventory prechecks --tags nova # Prechecks verify: # - Container runtime is available # - Required images exist # - Configuration templates render without errors # - Port conflicts are detected # - Service dependencies are met
Custom Config Overrides
Kolla-Ansible supports per-service configuration overrides that persist across reconfigures:
# Override directory structure /etc/kolla/config/ nova/ nova.conf # Merged into nova.conf nova-compute.conf # Merged into nova-compute.conf neutron/ neutron.conf # Merged into neutron.conf ml2_conf.ini # Merged into ml2_conf.ini keystone/ keystone.conf # Merged into keystone.conf
# Example: /etc/kolla/config/nova/nova.conf [DEFAULT] debug = True [scheduler] max_attempts = 5
globals.yml Change Matrix
What changes require which operational action:
| Change Category | Example | Required Action |
|---|---|---|
| Service enable/disable | | (new service) or + remove |
| Backend change | | Full redeploy of affected services |
| Network config | | |
| Logging | | |
| TLS | | (generates certs, updates all services) |
| Allocation ratios | | |
| Image tag | | (not reconfigure) |
Operate
Service Reconfigure
Complete procedure for a configuration change:
- Document the change:
git -C /etc/kolla diff globals.yml - Commit pre-change state:
git -C /etc/kolla add globals.yml && git -C /etc/kolla commit -m "pre-change: <description>" - Make the change: edit globals.yml or config overrides
- Run prechecks:
kolla-ansible -i /etc/kolla/inventory prechecks - Run reconfigure:
(or withkolla-ansible -i /etc/kolla/inventory reconfigure
)--tags - Verify service health:
openstack service list && openstack endpoint list - Commit post-change state:
git -C /etc/kolla add globals.yml && git -C /etc/kolla commit -m "post-change: <description>"
Minor Upgrade (Patch)
Within the same major OpenStack release (e.g., 2024.1 patch):
- Backup databases:
kolla-ansible -i /etc/kolla/inventory mariadb_backup - Pull new images:
kolla-ansible -i /etc/kolla/inventory pull - Run prechecks:
kolla-ansible -i /etc/kolla/inventory prechecks - Run upgrade:
kolla-ansible -i /etc/kolla/inventory upgrade - Verify all services:
-- all services should beopenstack service list
and endpoints reachableenabled - Run smoke test: create a test instance, verify networking, delete test instance
Major Upgrade
Across OpenStack releases (e.g., 2024.1 to 2024.2):
-
Pre-upgrade backup:
# Database backup kolla-ansible -i /etc/kolla/inventory mariadb_backup # Verify backup: ls -la /var/lib/docker/volumes/mariadb_backup/_data/ # Configuration backup tar czf /root/kolla-config-backup-$(date +%Y%m%d).tar.gz /etc/kolla/ -
Update Kolla-Ansible:
pip install --upgrade kolla-ansible==<new-version> -
Update globals.yml:
openstack_release: "2024.2" -
Pull new images:
kolla-ansible -i /etc/kolla/inventory pull -
Run prechecks:
kolla-ansible -i /etc/kolla/inventory prechecks -
Execute upgrade:
kolla-ansible -i /etc/kolla/inventory upgrade -
Post-upgrade verification:
# Check all containers are running docker ps --format '{{.Names}} {{.Status}}' | sort # Check all services are healthy openstack service list openstack compute service list openstack network agent list openstack volume service list # Run integration test openstack server create --flavor m1.small --image cirros --network tenant-net upgrade-test openstack server delete upgrade-test
Container Management
Restart an individual service:
# Restart a single container docker restart nova_api docker restart neutron_server # Check container logs after restart docker logs --tail 100 nova_api docker logs --tail 100 --follow neutron_server # Check container health docker inspect --format='{{.State.Health.Status}}' keystone docker inspect --format='{{.State.Status}}' nova_api
Inspect container configuration:
# View the running configuration docker exec nova_api cat /etc/nova/nova.conf | grep -v "^#" | grep -v "^$" # Check container resource usage docker stats --no-stream nova_api neutron_server keystone
Maintenance Mode
Procedure for compute host maintenance:
-
Disable the compute service:
openstack compute service set --disable --disable-reason "Scheduled maintenance" <hostname> nova-compute -
Drain instances (live migrate to other hosts):
# List instances on the host openstack server list --host <hostname> --all-projects # Live migrate each instance (multi-node only) openstack server migrate --live-migration <instance-id> # For single-node: shut down instances instead openstack server stop <instance-id> -
Perform maintenance (hardware, OS updates, etc.)
-
Re-enable the compute service:
openstack compute service set --enable <hostname> nova-compute -
Verify:
openstack compute service list # Service should be enabled and up
Password Rotation
# 1. Edit passwords.yml with new passwords for specific services vim /etc/kolla/passwords.yml # Change only the passwords you need to rotate # 2. Run reconfigure to apply new passwords kolla-ansible -i /etc/kolla/inventory reconfigure # 3. Verify all services reconnect with new credentials openstack service list openstack endpoint list # 4. Test authentication openstack token issue
Certificate Renewal
# 1. Generate or obtain new certificates # Place them in /etc/kolla/certificates/ # 2. Update globals.yml if certificate paths changed # kolla_external_fqdn_cert: /etc/kolla/certificates/haproxy.pem # kolla_internal_fqdn_cert: /etc/kolla/certificates/haproxy-internal.pem # 3. Reconfigure HAProxy and affected services kolla-ansible -i /etc/kolla/inventory reconfigure --tags haproxy # 4. Verify TLS connectivity openssl s_client -connect <api-endpoint>:5000 -brief curl -v https://<api-endpoint>:5000/v3/
Troubleshoot
Reconfigure Fails Midway
Symptoms:
kolla-ansible reconfigure exits with an error partway through. Some services have new config, others have old config.
Resolution steps:
- Check the Ansible output for the specific task that failed -- note the service name and error
- Fix the underlying issue (config syntax, missing dependency, connectivity)
- Rerun with the specific tag:
kolla-ansible -i /etc/kolla/inventory reconfigure --tags <service> - If a single container is stuck, restart it manually:
docker restart <container-name> - After resolution, run a full prechecks to verify consistency:
kolla-ansible -i /etc/kolla/inventory prechecks
Upgrade Fails on Prechecks
Symptoms:
kolla-ansible prechecks reports errors before the upgrade can begin.
Resolution steps:
- Read the specific precheck error message -- common causes:
- Version incompatibility: current release cannot upgrade directly to target (skip releases not supported)
- Unsupported configuration: a deprecated option in globals.yml
- Missing migration: database schema migration required before the upgrade
- Check Kolla-Ansible release notes for the target version -- look for breaking changes and required migration steps
- Fix globals.yml: remove deprecated options, add required new options
- Rerun prechecks:
kolla-ansible -i /etc/kolla/inventory prechecks
Container Won't Start After Upgrade
Symptoms: After
kolla-ansible upgrade, one or more containers fail to start or crash on startup.
Resolution steps:
- Check container logs:
-- look for config parse errors or missing dependenciesdocker logs --tail 200 <container-name> - Check image version:
-- verify the correct version was pulleddocker inspect <container-name> --format='{{.Config.Image}}' - If image pull failed:
thenkolla-ansible -i /etc/kolla/inventory pull --tags <service>kolla-ansible upgrade --tags <service> - If config incompatibility: check
for overrides that conflict with the new version/etc/kolla/config/<service>/ - If database migration needed: check service logs for migration errors, run migration manually if needed
Service Unhealthy After Reconfigure
Symptoms: Service containers are running but health checks fail, API returns errors, or service cannot authenticate to Keystone.
Resolution steps:
- Check service logs for config errors:
docker logs --tail 100 <service>_server - Verify Keystone connectivity:
docker exec <service>_server curl -s http://keystone:5000/v3/ - Check config override syntax:
-- look for malformed INI entriesdocker exec <service>_server cat /etc/<service>/<service>.conf | grep -i error - Verify database connectivity:
docker exec <service>_server mysql -h mariadb -u <service> -p<password> -e "SELECT 1" - If config override is the problem: fix the file in
and rerun/etc/kolla/config/<service>/reconfigure --tags <service>
Rollback After Failed Upgrade
Symptoms: Upgrade failed partway through and the cloud is in an inconsistent state. Need to restore previous working state.
Resolution steps:
- Stop all services:
kolla-ansible -i /etc/kolla/inventory stop - Restore database from backup:
# Find the backup ls /var/lib/docker/volumes/mariadb_backup/_data/ # Restore (Kolla-Ansible provides a restore procedure) kolla-ansible -i /etc/kolla/inventory mariadb_recovery - Revert globals.yml:
git -C /etc/kolla checkout HEAD~1 -- globals.yml - Pull previous release images:
# Set openstack_release back to previous version kolla-ansible -i /etc/kolla/inventory pull - Redeploy:
kolla-ansible -i /etc/kolla/inventory deploy - Verify: Run the full verification suite from the deployment skill
Ansible Connection Failures
Symptoms: Kolla-Ansible commands fail before any container operations with SSH or become errors.
Resolution steps:
- Check SSH connectivity:
-- verify SSH key auth worksssh <inventory-host> - Check become password: if
is required, verify it in the inventory or pass viaansible_become_pass--ask-become-pass - Check inventory file: verify hostnames/IPs in
match actual reachable hosts/etc/kolla/inventory - Check Ansible version compatibility:
-- must be compatible with the Kolla-Ansible releaseansible --version - Check Python on target: Kolla-Ansible requires Python on managed hosts; verify with
ssh <host> python3 --version
Integration Points
- Kolla-ansible (deployment) skill: Shares the same inventory file, globals.yml, and passwords.yml. This skill picks up where the deployment skill leaves off. The deployment skill covers the initial
andbootstrap-servers
commands; this skill covers everything that happens after the cloud is running. Both skills reference the same Kolla-Ansible documentation and configuration patterns.deploy - Backup skill: Every upgrade and reconfigure operation should be preceded by a backup. The backup skill provides the procedures for database dumps, configuration archives, and volume snapshots. This skill references those procedures at the start of every upgrade workflow.
- Monitoring skill: After every operational change (reconfigure, upgrade, restart), verify service health through monitoring dashboards. The monitoring skill provides the metrics and alerts; this skill consumes those health signals to confirm operations succeeded.
- Security skill: Certificate renewal triggers reconfigure operations. Password rotation requires updating passwords.yml and running reconfigure. The security skill defines the security policies; this skill implements the operational procedures to apply those policies.
- All core skills: Every OpenStack service is managed through Kolla-Ansible containers. A reconfigure operation affects the service's configuration; an upgrade changes its container image. Changes to any service follow the procedures in this skill.
- EXEC agent: The primary consumer for infrastructure changes. When the EXEC agent needs to modify the running cloud (change a configuration, upgrade a service, perform maintenance), it uses the procedures in this skill to execute those changes safely and verifiably.
NASA SE Cross-References
| SE Phase | Kolla-Ansible Ops Activity | Reference |
|---|---|---|
| Phase C (Final Design) | Configuration management: all Kolla-Ansible configurations (globals.yml, inventory, passwords.yml, config overrides) are version-controlled and baselined. Every change produces a documented commit with rationale. | SP-6105 SS 6.5 (Configuration Management -- change control process) |
| Phase E (Operations) | Maintenance and upgrade operations: service reconfiguration, patch upgrades, major upgrades, container management, maintenance mode. Each operation follows a documented procedure with prechecks, execution, and verification steps. | SP-6105 SS 5.4 (Product Validation -- operational maintenance) |
| Phase E (Sustainment) | Infrastructure lifecycle: the ongoing cycle of monitor -> identify change -> plan change -> precheck -> execute -> verify -> document that keeps the cloud healthy and current. This cycle maps directly to NASA's sustainment process. | NPR 7123.1 SS 5.4 (Sustainment -- operational baseline evolution) |