Error Handling in Ansible
Understanding Ansible Error Handling
By default, Ansible stops executing tasks on a host when a task fails, but continues on other hosts. Ansible provides comprehensive error handling mechanisms to control playbook execution flow, handle failures gracefully, and ensure reliable automation.
- Resilience: Continue execution despite non-critical failures
- Rollback: Implement recovery procedures when tasks fail
- Cleanup: Ensure cleanup tasks always run
- Custom Logic: Define custom success/failure conditions
Basic Error Handling
1. Ignoring Errors
Use ignore_errors to continue execution even when a task fails:
- name: This task might fail but that's okay
command: /bin/false
ignore_errors: yes
- name: Continue with next task
debug:
msg: "This runs even if previous task failed"
# Practical example
- name: Try to stop service (might not exist)
systemd:
name: optional_service
state: stopped
ignore_errors: yes
ignore_errors only works when tasks execute but return a failed status. It won't suppress undefined variable errors, connection failures, or syntax errors.
2. Ignoring Unreachable Hosts
Continue execution when hosts are unreachable:
- name: Task that continues despite unreachable hosts
command: /bin/true
ignore_unreachable: yes
# Can be set at play level
- hosts: all
ignore_unreachable: yes
tasks:
- name: All tasks ignore unreachable
ping:
3. Resetting Unreachable Hosts
Reactivate hosts that were marked as unreachable:
- name: Try to reach all hosts
ping:
ignore_unreachable: yes
- name: Clear unreachable status
meta: clear_host_errors
- name: Try again on previously unreachable hosts
ping:
Defining Failure Conditions
failed_when
Customize what constitutes a failure based on task output or return codes:
# Fail based on output content
- name: Check for errors in command output
command: /usr/bin/mycommand
register: result
failed_when: "'ERROR' in result.stderr"
# Fail on specific return codes
- name: Custom return code handling
command: /usr/bin/diff file1 file2
register: diff_result
failed_when: diff_result.rc == 0 or diff_result.rc >= 2
# Never fail (alternative to ignore_errors)
- name: This task never fails
command: /bin/might_fail
failed_when: false
# Multiple failure conditions
- name: Complex failure logic
shell: /usr/bin/deploy.sh
register: deploy
failed_when:
- deploy.rc != 0
- "'warning' not in deploy.stdout"
Practical Examples
# Example 1: Check service status without failing
- name: Check if service is running
command: systemctl is-active myservice
register: service_status
failed_when: false
changed_when: false
- debug:
msg: "Service is {{ 'running' if service_status.rc == 0 else 'stopped' }}"
# Example 2: Fail only on critical errors
- name: Run application tests
command: /opt/app/run_tests.sh
register: test_results
failed_when:
- test_results.rc != 0
- "'CRITICAL' in test_results.stderr"
# Example 3: Grep without failing when no matches
- name: Search logs for pattern
shell: grep -i "pattern" /var/log/app.log
register: grep_result
failed_when: grep_result.rc > 1 # rc=1 means no matches, rc>1 is error
changed_when: false
Defining Changed Status
changed_when
Control when tasks report changes and trigger handlers:
# Never report changed
- name: Read-only operation
command: cat /etc/hosts
changed_when: false
# Custom changed condition
- name: Check and create directory
shell: test -d /mydir || mkdir /mydir
register: dir_check
changed_when: dir_check.rc == 0
# Based on output content
- name: Run command
shell: /usr/bin/process.sh
register: process_result
changed_when: "'updated' in process_result.stdout"
# Idempotent shell commands
- name: Add line to file if not present
shell: grep -q "line" /etc/file || echo "line" >> /etc/file
register: line_add
changed_when: line_add.rc != 0
Blocks with Rescue and Always
Blocks provide exception-handling similar to try-catch-finally in programming languages:
Basic Block Structure
- name: Handle errors with block
block:
# Try these tasks
- name: Task that might fail
command: /usr/bin/risky_operation
- name: Another task
debug:
msg: "This runs if previous task succeeds"
rescue:
# Run these if any task in block fails
- name: Recovery task
debug:
msg: "Something went wrong, recovering..."
- name: Send notification
mail:
to: admin@example.com
subject: "Deployment failed"
always:
# Always run these, regardless of success or failure
- name: Cleanup
file:
path: /tmp/deployment
state: absent
- name: Log completion
debug:
msg: "Deployment process completed"
Practical Block Examples
# Example 1: Database backup with rollback
- block:
- name: Backup database
shell: pg_dump mydb > /backup/mydb.sql
- name: Apply migrations
command: /opt/app/migrate.sh
- name: Restart application
systemd:
name: myapp
state: restarted
rescue:
- name: Restore from backup
shell: psql mydb < /backup/mydb.sql
- name: Alert admins
debug:
msg: "Migration failed, database restored from backup"
- fail:
msg: "Deployment aborted due to migration failure"
always:
- name: Remove temporary files
file:
path: /tmp/migration_temp
state: absent
# Example 2: Network configuration with rollback
- block:
- name: Backup network config
copy:
src: /etc/network/interfaces
dest: /tmp/interfaces.backup
remote_src: yes
- name: Apply new network config
template:
src: interfaces.j2
dest: /etc/network/interfaces
- name: Restart networking
systemd:
name: networking
state: restarted
- name: Test connectivity
wait_for:
host: 8.8.8.8
port: 53
timeout: 10
rescue:
- name: Restore old config
copy:
src: /tmp/interfaces.backup
dest: /etc/network/interfaces
remote_src: yes
- name: Restart networking
systemd:
name: networking
state: restarted
- fail:
msg: "Network configuration failed, reverted to backup"
Nested Blocks
- block:
- name: Outer task
debug:
msg: "Outer block"
- block:
- name: Inner task that might fail
command: /bin/false
rescue:
- name: Inner rescue
debug:
msg: "Inner block failed"
always:
- name: Inner cleanup
debug:
msg: "Inner always runs"
rescue:
- name: Outer rescue
debug:
msg: "Outer block failed"
always:
- name: Outer cleanup
debug:
msg: "Outer always runs"
Play-Level Error Controls
any_errors_fatal
Stop the entire play on the first failure across any host:
- hosts: all
any_errors_fatal: true
tasks:
- name: Critical task
command: /usr/bin/critical_operation
# If this fails on ANY host, stop ENTIRE play
# Practical example: Load balancer scenario
- hosts: load_balancers
any_errors_fatal: true
tasks:
- name: Disable datacenter
command: /usr/bin/disable-dc
# Must succeed on all load balancers or abort
- hosts: webservers
tasks:
- name: Deploy updates
# Only runs if all load balancers succeeded
max_fail_percentage
Abort the play when a percentage of hosts fail (useful for rolling updates):
- hosts: webservers
max_fail_percentage: 30
serial: 10
tasks:
- name: Update application
# Abort if more than 30% of hosts fail
# Examples with different host counts
# 10 hosts with max_fail_percentage: 30 = abort after 4 failures
# 20 hosts with max_fail_percentage: 25 = abort after 6 failures
# Important: percentage must be EXCEEDED, not equaled
# For 4 hosts to abort at 2 failures, use 49% not 50%
- hosts: databases[0:4]
max_fail_percentage: 49
serial: 1
Handler Error Handling
Force Handler Execution
# By default, handlers don't run if any task fails
# Force handlers to run even on failure
# In ansible.cfg
[defaults]
force_handlers = True
# Or in playbook
- hosts: all
force_handlers: yes
tasks:
- name: Update config
template:
src: app.conf.j2
dest: /etc/app.conf
notify: Restart app
- name: This might fail
command: /bin/false
# "Restart app" handler still runs
# Or via command line
ansible-playbook playbook.yml --force-handlers
Flush Handlers Early
- name: Update configuration
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: Reload nginx
# Flush handlers immediately instead of waiting until end
- meta: flush_handlers
- name: Test nginx is working
uri:
url: http://localhost
status_code: 200
Advanced Error Handling Patterns
Retry Logic
- name: Retry until success
uri:
url: http://api.example.com/status
status_code: 200
register: api_result
until: api_result.status == 200
retries: 5
delay: 10
# Retry with complex conditions
- name: Wait for deployment
shell: /usr/bin/check_deployment.sh
register: deploy_check
until:
- deploy_check.rc == 0
- "'READY' in deploy_check.stdout"
retries: 30
delay: 10
Conditional Failure
- name: Check required conditions
block:
- name: Verify disk space
shell: df -h / | tail -1 | awk '{print $5}' | sed 's/%//'
register: disk_usage
- name: Fail if disk is too full
fail:
msg: "Insufficient disk space: {{ disk_usage.stdout }}% used"
when: disk_usage.stdout | int > 90
- name: Verify memory
shell: free -m | grep Mem | awk '{print int($3/$2*100)}'
register: mem_usage
- name: Fail if memory is too high
fail:
msg: "High memory usage: {{ mem_usage.stdout }}%"
when: mem_usage.stdout | int > 95
Assert for Validation
- name: Validate prerequisites
assert:
that:
- ansible_distribution == "Ubuntu"
- ansible_distribution_major_version | int >= 20
- ansible_memtotal_mb >= 4096
fail_msg: "System does not meet requirements"
success_msg: "All prerequisites validated"
# Multiple assertions with custom messages
- name: Validate variables
assert:
that:
- db_password is defined
- db_password | length >= 8
- app_port | int > 1024
- app_port | int < 65535
fail_msg: "Configuration validation failed"
quiet: true # Don't show assertion details
Graceful Degradation
- name: Try primary database
postgresql_query:
db: mydb
query: SELECT 1
register: primary_db
ignore_errors: yes
- name: Use secondary database if primary fails
postgresql_query:
db: mydb
query: SELECT 1
login_host: "{{ secondary_db_host }}"
when: primary_db is failed
register: secondary_db
- name: Fail if both databases are down
fail:
msg: "All database connections failed"
when:
- primary_db is failed
- secondary_db is failed
Error Handling Best Practices
- Use Blocks for Related Tasks: Group tasks that should succeed or fail together
- Always Clean Up: Use the
alwayssection for cleanup tasks - Be Specific with failed_when: Define precise failure conditions
- Avoid Overusing ignore_errors: Handle errors explicitly when possible
- Test Rollback Procedures: Verify rescue blocks work as expected
- Log Errors: Record failures for troubleshooting
- Use Assertions Early: Validate prerequisites before running tasks
- Implement Retry Logic: For network operations and external services
Common Error Handling Patterns
Pattern 1: Idempotent Shell Commands
- name: Ensure line in file
shell: grep -q "{{ line }}" /etc/config || echo "{{ line }}" >> /etc/config
register: line_result
changed_when: line_result.rc != 0
failed_when: false
Pattern 2: Optional Dependencies
- name: Install optional package
apt:
name: optional-tool
state: present
register: optional_install
ignore_errors: yes
- name: Set feature flag based on installation
set_fact:
optional_feature_enabled: "{{ optional_install is succeeded }}"
Pattern 3: Pre-flight Checks
- name: Pre-flight validation
block:
- name: Check all required variables
assert:
that:
- item is defined
- item | length > 0
fail_msg: "Required variable {{ item }} is not defined"
loop:
- db_host
- db_name
- db_password
- name: Test database connectivity
wait_for:
host: "{{ db_host }}"
port: 5432
timeout: 10
rescue:
- name: Abort on validation failure
fail:
msg: "Pre-flight checks failed, aborting deployment"
Pattern 4: Rollback on Failure
- name: Deploy with automatic rollback
block:
- name: Get current version
shell: cat /opt/app/VERSION
register: current_version
- name: Deploy new version
unarchive:
src: "/releases/app-{{ new_version }}.tar.gz"
dest: /opt/app
- name: Run smoke tests
command: /opt/app/smoke-tests.sh
rescue:
- name: Rollback to previous version
unarchive:
src: "/releases/app-{{ current_version.stdout }}.tar.gz"
dest: /opt/app
- name: Restart with old version
systemd:
name: app
state: restarted
- fail:
msg: "Deployment failed, rolled back to {{ current_version.stdout }}"
Troubleshooting Error Handling
- Rescue Not Triggering: Check that task actually failed, not just had errors ignored
- Always Not Running: Verify block syntax is correct (indentation)
- failed_when Logic Wrong: Test conditions with debug before using in failed_when
- Handlers Not Running: Enable force_handlers if tasks fail
- max_fail_percentage Not Working: Must use with serial, and percentage must be exceeded
Quick Reference
# Basic error control
ignore_errors: yes # Continue on failure
ignore_unreachable: yes # Continue if host unreachable
failed_when: condition # Custom failure condition
changed_when: condition # Custom changed condition
# Blocks
block:
- task1
rescue:
- recovery_task
always:
- cleanup_task
# Play-level controls
any_errors_fatal: true # Abort on first failure
max_fail_percentage: 30 # Abort after % failures
force_handlers: yes # Run handlers even on failure
# Retry logic
until: condition
retries: 5
delay: 10
# Validation
assert:
that:
- condition1
- condition2
fail_msg: "Validation failed"
# Manual control
- fail: msg="Custom failure" # Force failure
- meta: clear_host_errors # Reset unreachable hosts
- meta: flush_handlers # Run handlers now
Next Steps
- Learn about Testing & Debugging to validate error handling
- Explore Best Practices for robust playbooks
- Master Playbooks for complex workflows
- Try the Playground to experiment with error handling