Error Handling in Ansible

Understanding Ansible Error Handling

By default, Ansible stops executing tasks on a host when a task fails, but continues on other hosts. Ansible provides comprehensive error handling mechanisms to control playbook execution flow, handle failures gracefully, and ensure reliable automation.

Why Error Handling Matters:

Resilience: Continue execution despite non-critical failures
Rollback: Implement recovery procedures when tasks fail
Cleanup: Ensure cleanup tasks always run
Custom Logic: Define custom success/failure conditions

Basic Error Handling

1. Ignoring Errors

Use ignore_errors to continue execution even when a task fails:

- name: This task might fail but that's okay
  command: /bin/false
  ignore_errors: yes

- name: Continue with next task
  debug:
    msg: "This runs even if previous task failed"

# Practical example
- name: Try to stop service (might not exist)
  systemd:
    name: optional_service
    state: stopped
  ignore_errors: yes

Important: ignore_errors only works when tasks execute but return a failed status. It won't suppress undefined variable errors, connection failures, or syntax errors.

2. Ignoring Unreachable Hosts

Continue execution when hosts are unreachable:

- name: Task that continues despite unreachable hosts
  command: /bin/true
  ignore_unreachable: yes

# Can be set at play level
- hosts: all
  ignore_unreachable: yes
  tasks:
    - name: All tasks ignore unreachable
      ping:

3. Resetting Unreachable Hosts

Reactivate hosts that were marked as unreachable:

- name: Try to reach all hosts
  ping:
  ignore_unreachable: yes

- name: Clear unreachable status
  meta: clear_host_errors

- name: Try again on previously unreachable hosts
  ping:

Defining Failure Conditions

failed_when

Customize what constitutes a failure based on task output or return codes:

# Fail based on output content
- name: Check for errors in command output
  command: /usr/bin/mycommand
  register: result
  failed_when: "'ERROR' in result.stderr"

# Fail on specific return codes
- name: Custom return code handling
  command: /usr/bin/diff file1 file2
  register: diff_result
  failed_when: diff_result.rc == 0 or diff_result.rc >= 2

# Never fail (alternative to ignore_errors)
- name: This task never fails
  command: /bin/might_fail
  failed_when: false

# Multiple failure conditions
- name: Complex failure logic
  shell: /usr/bin/deploy.sh
  register: deploy
  failed_when:
    - deploy.rc != 0
    - "'warning' not in deploy.stdout"

Practical Examples

# Example 1: Check service status without failing
- name: Check if service is running
  command: systemctl is-active myservice
  register: service_status
  failed_when: false
  changed_when: false

- debug:
    msg: "Service is {{ 'running' if service_status.rc == 0 else 'stopped' }}"

# Example 2: Fail only on critical errors
- name: Run application tests
  command: /opt/app/run_tests.sh
  register: test_results
  failed_when:
    - test_results.rc != 0
    - "'CRITICAL' in test_results.stderr"

# Example 3: Grep without failing when no matches
- name: Search logs for pattern
  shell: grep -i "pattern" /var/log/app.log
  register: grep_result
  failed_when: grep_result.rc > 1  # rc=1 means no matches, rc>1 is error
  changed_when: false

Defining Changed Status

changed_when

Control when tasks report changes and trigger handlers:

# Never report changed
- name: Read-only operation
  command: cat /etc/hosts
  changed_when: false

# Custom changed condition
- name: Check and create directory
  shell: test -d /mydir || mkdir /mydir
  register: dir_check
  changed_when: dir_check.rc == 0

# Based on output content
- name: Run command
  shell: /usr/bin/process.sh
  register: process_result
  changed_when: "'updated' in process_result.stdout"

# Idempotent shell commands
- name: Add line to file if not present
  shell: grep -q "line" /etc/file || echo "line" >> /etc/file
  register: line_add
  changed_when: line_add.rc != 0

Blocks with Rescue and Always

Blocks provide exception-handling similar to try-catch-finally in programming languages:

Basic Block Structure

- name: Handle errors with block
  block:
    # Try these tasks
    - name: Task that might fail
      command: /usr/bin/risky_operation

    - name: Another task
      debug:
        msg: "This runs if previous task succeeds"

  rescue:
    # Run these if any task in block fails
    - name: Recovery task
      debug:
        msg: "Something went wrong, recovering..."

    - name: Send notification
      mail:
        to: admin@example.com
        subject: "Deployment failed"

  always:
    # Always run these, regardless of success or failure
    - name: Cleanup
      file:
        path: /tmp/deployment
        state: absent

    - name: Log completion
      debug:
        msg: "Deployment process completed"

Practical Block Examples

# Example 1: Database backup with rollback
- block:
    - name: Backup database
      shell: pg_dump mydb > /backup/mydb.sql

    - name: Apply migrations
      command: /opt/app/migrate.sh

    - name: Restart application
      systemd:
        name: myapp
        state: restarted

  rescue:
    - name: Restore from backup
      shell: psql mydb < /backup/mydb.sql

    - name: Alert admins
      debug:
        msg: "Migration failed, database restored from backup"

    - fail:
        msg: "Deployment aborted due to migration failure"

  always:
    - name: Remove temporary files
      file:
        path: /tmp/migration_temp
        state: absent

# Example 2: Network configuration with rollback
- block:
    - name: Backup network config
      copy:
        src: /etc/network/interfaces
        dest: /tmp/interfaces.backup
        remote_src: yes

    - name: Apply new network config
      template:
        src: interfaces.j2
        dest: /etc/network/interfaces

    - name: Restart networking
      systemd:
        name: networking
        state: restarted

    - name: Test connectivity
      wait_for:
        host: 8.8.8.8
        port: 53
        timeout: 10

  rescue:
    - name: Restore old config
      copy:
        src: /tmp/interfaces.backup
        dest: /etc/network/interfaces
        remote_src: yes

    - name: Restart networking
      systemd:
        name: networking
        state: restarted

    - fail:
        msg: "Network configuration failed, reverted to backup"

Nested Blocks

- block:
    - name: Outer task
      debug:
        msg: "Outer block"

    - block:
        - name: Inner task that might fail
          command: /bin/false

      rescue:
        - name: Inner rescue
          debug:
            msg: "Inner block failed"

      always:
        - name: Inner cleanup
          debug:
            msg: "Inner always runs"

  rescue:
    - name: Outer rescue
      debug:
        msg: "Outer block failed"

  always:
    - name: Outer cleanup
      debug:
        msg: "Outer always runs"

Play-Level Error Controls

any_errors_fatal

Stop the entire play on the first failure across any host:

- hosts: all
  any_errors_fatal: true
  tasks:
    - name: Critical task
      command: /usr/bin/critical_operation

    # If this fails on ANY host, stop ENTIRE play

# Practical example: Load balancer scenario
- hosts: load_balancers
  any_errors_fatal: true
  tasks:
    - name: Disable datacenter
      command: /usr/bin/disable-dc

    # Must succeed on all load balancers or abort

- hosts: webservers
  tasks:
    - name: Deploy updates
      # Only runs if all load balancers succeeded

max_fail_percentage

Abort the play when a percentage of hosts fail (useful for rolling updates):

- hosts: webservers
  max_fail_percentage: 30
  serial: 10
  tasks:
    - name: Update application
      # Abort if more than 30% of hosts fail

# Examples with different host counts
# 10 hosts with max_fail_percentage: 30 = abort after 4 failures
# 20 hosts with max_fail_percentage: 25 = abort after 6 failures

# Important: percentage must be EXCEEDED, not equaled
# For 4 hosts to abort at 2 failures, use 49% not 50%
- hosts: databases[0:4]
  max_fail_percentage: 49
  serial: 1

Handler Error Handling

Force Handler Execution

# By default, handlers don't run if any task fails
# Force handlers to run even on failure

# In ansible.cfg
[defaults]
force_handlers = True

# Or in playbook
- hosts: all
  force_handlers: yes
  tasks:
    - name: Update config
      template:
        src: app.conf.j2
        dest: /etc/app.conf
      notify: Restart app

    - name: This might fail
      command: /bin/false

  # "Restart app" handler still runs

# Or via command line
ansible-playbook playbook.yml --force-handlers

Flush Handlers Early

- name: Update configuration
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  notify: Reload nginx

# Flush handlers immediately instead of waiting until end
- meta: flush_handlers

- name: Test nginx is working
  uri:
    url: http://localhost
    status_code: 200

Advanced Error Handling Patterns

Retry Logic

- name: Retry until success
  uri:
    url: http://api.example.com/status
    status_code: 200
  register: api_result
  until: api_result.status == 200
  retries: 5
  delay: 10

# Retry with complex conditions
- name: Wait for deployment
  shell: /usr/bin/check_deployment.sh
  register: deploy_check
  until:
    - deploy_check.rc == 0
    - "'READY' in deploy_check.stdout"
  retries: 30
  delay: 10

Conditional Failure

- name: Check required conditions
  block:
    - name: Verify disk space
      shell: df -h / | tail -1 | awk '{print $5}' | sed 's/%//'
      register: disk_usage

    - name: Fail if disk is too full
      fail:
        msg: "Insufficient disk space: {{ disk_usage.stdout }}% used"
      when: disk_usage.stdout | int > 90

    - name: Verify memory
      shell: free -m | grep Mem | awk '{print int($3/$2*100)}'
      register: mem_usage

    - name: Fail if memory is too high
      fail:
        msg: "High memory usage: {{ mem_usage.stdout }}%"
      when: mem_usage.stdout | int > 95

Assert for Validation

- name: Validate prerequisites
  assert:
    that:
      - ansible_distribution == "Ubuntu"
      - ansible_distribution_major_version | int >= 20
      - ansible_memtotal_mb >= 4096
    fail_msg: "System does not meet requirements"
    success_msg: "All prerequisites validated"

# Multiple assertions with custom messages
- name: Validate variables
  assert:
    that:
      - db_password is defined
      - db_password | length >= 8
      - app_port | int > 1024
      - app_port | int < 65535
    fail_msg: "Configuration validation failed"
    quiet: true  # Don't show assertion details

Graceful Degradation

- name: Try primary database
  postgresql_query:
    db: mydb
    query: SELECT 1
  register: primary_db
  ignore_errors: yes

- name: Use secondary database if primary fails
  postgresql_query:
    db: mydb
    query: SELECT 1
    login_host: "{{ secondary_db_host }}"
  when: primary_db is failed
  register: secondary_db

- name: Fail if both databases are down
  fail:
    msg: "All database connections failed"
  when:
    - primary_db is failed
    - secondary_db is failed

Error Handling Best Practices

Use Blocks for Related Tasks: Group tasks that should succeed or fail together
Always Clean Up: Use the always section for cleanup tasks
Be Specific with failed_when: Define precise failure conditions
Avoid Overusing ignore_errors: Handle errors explicitly when possible
Test Rollback Procedures: Verify rescue blocks work as expected
Log Errors: Record failures for troubleshooting
Use Assertions Early: Validate prerequisites before running tasks
Implement Retry Logic: For network operations and external services

Common Error Handling Patterns

Pattern 1: Idempotent Shell Commands

- name: Ensure line in file
  shell: grep -q "{{ line }}" /etc/config || echo "{{ line }}" >> /etc/config
  register: line_result
  changed_when: line_result.rc != 0
  failed_when: false

Pattern 2: Optional Dependencies

- name: Install optional package
  apt:
    name: optional-tool
    state: present
  register: optional_install
  ignore_errors: yes

- name: Set feature flag based on installation
  set_fact:
    optional_feature_enabled: "{{ optional_install is succeeded }}"

Pattern 3: Pre-flight Checks

- name: Pre-flight validation
  block:
    - name: Check all required variables
      assert:
        that:
          - item is defined
          - item | length > 0
        fail_msg: "Required variable {{ item }} is not defined"
      loop:
        - db_host
        - db_name
        - db_password

    - name: Test database connectivity
      wait_for:
        host: "{{ db_host }}"
        port: 5432
        timeout: 10

  rescue:
    - name: Abort on validation failure
      fail:
        msg: "Pre-flight checks failed, aborting deployment"

Pattern 4: Rollback on Failure

- name: Deploy with automatic rollback
  block:
    - name: Get current version
      shell: cat /opt/app/VERSION
      register: current_version

    - name: Deploy new version
      unarchive:
        src: "/releases/app-{{ new_version }}.tar.gz"
        dest: /opt/app

    - name: Run smoke tests
      command: /opt/app/smoke-tests.sh

  rescue:
    - name: Rollback to previous version
      unarchive:
        src: "/releases/app-{{ current_version.stdout }}.tar.gz"
        dest: /opt/app

    - name: Restart with old version
      systemd:
        name: app
        state: restarted

    - fail:
        msg: "Deployment failed, rolled back to {{ current_version.stdout }}"

Troubleshooting Error Handling

Common Issues:

Rescue Not Triggering: Check that task actually failed, not just had errors ignored
Always Not Running: Verify block syntax is correct (indentation)
failed_when Logic Wrong: Test conditions with debug before using in failed_when
Handlers Not Running: Enable force_handlers if tasks fail
max_fail_percentage Not Working: Must use with serial, and percentage must be exceeded

Quick Reference

# Basic error control
ignore_errors: yes                    # Continue on failure
ignore_unreachable: yes               # Continue if host unreachable
failed_when: condition                # Custom failure condition
changed_when: condition               # Custom changed condition

# Blocks
block:
  - task1
rescue:
  - recovery_task
always:
  - cleanup_task

# Play-level controls
any_errors_fatal: true                # Abort on first failure
max_fail_percentage: 30               # Abort after % failures
force_handlers: yes                   # Run handlers even on failure

# Retry logic
until: condition
retries: 5
delay: 10

# Validation
assert:
  that:
    - condition1
    - condition2
  fail_msg: "Validation failed"

# Manual control
- fail: msg="Custom failure"          # Force failure
- meta: clear_host_errors             # Reset unreachable hosts
- meta: flush_handlers                # Run handlers now

Next Steps

Learn about Testing & Debugging to validate error handling
Explore Best Practices for robust playbooks
Master Playbooks for complex workflows
Try the Playground to experiment with error handling

Try in Playground Practice in Labs

Getting Started

Content Distribution

Core Concepts

Platforms

Development

Infrastructure

Operations

Security

Enterprise