Error Handling in Ansible

Understanding Ansible Error Handling

By default, Ansible stops executing tasks on a host when a task fails, but continues on other hosts. Ansible provides comprehensive error handling mechanisms to control playbook execution flow, handle failures gracefully, and ensure reliable automation.

Why Error Handling Matters:
  • Resilience: Continue execution despite non-critical failures
  • Rollback: Implement recovery procedures when tasks fail
  • Cleanup: Ensure cleanup tasks always run
  • Custom Logic: Define custom success/failure conditions

Basic Error Handling

1. Ignoring Errors

Use ignore_errors to continue execution even when a task fails:

- name: This task might fail but that's okay
  command: /bin/false
  ignore_errors: yes

- name: Continue with next task
  debug:
    msg: "This runs even if previous task failed"

# Practical example
- name: Try to stop service (might not exist)
  systemd:
    name: optional_service
    state: stopped
  ignore_errors: yes
Important: ignore_errors only works when tasks execute but return a failed status. It won't suppress undefined variable errors, connection failures, or syntax errors.

2. Ignoring Unreachable Hosts

Continue execution when hosts are unreachable:

- name: Task that continues despite unreachable hosts
  command: /bin/true
  ignore_unreachable: yes

# Can be set at play level
- hosts: all
  ignore_unreachable: yes
  tasks:
    - name: All tasks ignore unreachable
      ping:

3. Resetting Unreachable Hosts

Reactivate hosts that were marked as unreachable:

- name: Try to reach all hosts
  ping:
  ignore_unreachable: yes

- name: Clear unreachable status
  meta: clear_host_errors

- name: Try again on previously unreachable hosts
  ping:

Defining Failure Conditions

failed_when

Customize what constitutes a failure based on task output or return codes:

# Fail based on output content
- name: Check for errors in command output
  command: /usr/bin/mycommand
  register: result
  failed_when: "'ERROR' in result.stderr"

# Fail on specific return codes
- name: Custom return code handling
  command: /usr/bin/diff file1 file2
  register: diff_result
  failed_when: diff_result.rc == 0 or diff_result.rc >= 2

# Never fail (alternative to ignore_errors)
- name: This task never fails
  command: /bin/might_fail
  failed_when: false

# Multiple failure conditions
- name: Complex failure logic
  shell: /usr/bin/deploy.sh
  register: deploy
  failed_when:
    - deploy.rc != 0
    - "'warning' not in deploy.stdout"

Practical Examples

# Example 1: Check service status without failing
- name: Check if service is running
  command: systemctl is-active myservice
  register: service_status
  failed_when: false
  changed_when: false

- debug:
    msg: "Service is {{ 'running' if service_status.rc == 0 else 'stopped' }}"

# Example 2: Fail only on critical errors
- name: Run application tests
  command: /opt/app/run_tests.sh
  register: test_results
  failed_when:
    - test_results.rc != 0
    - "'CRITICAL' in test_results.stderr"

# Example 3: Grep without failing when no matches
- name: Search logs for pattern
  shell: grep -i "pattern" /var/log/app.log
  register: grep_result
  failed_when: grep_result.rc > 1  # rc=1 means no matches, rc>1 is error
  changed_when: false

Defining Changed Status

changed_when

Control when tasks report changes and trigger handlers:

# Never report changed
- name: Read-only operation
  command: cat /etc/hosts
  changed_when: false

# Custom changed condition
- name: Check and create directory
  shell: test -d /mydir || mkdir /mydir
  register: dir_check
  changed_when: dir_check.rc == 0

# Based on output content
- name: Run command
  shell: /usr/bin/process.sh
  register: process_result
  changed_when: "'updated' in process_result.stdout"

# Idempotent shell commands
- name: Add line to file if not present
  shell: grep -q "line" /etc/file || echo "line" >> /etc/file
  register: line_add
  changed_when: line_add.rc != 0

Blocks with Rescue and Always

Blocks provide exception-handling similar to try-catch-finally in programming languages:

Basic Block Structure

- name: Handle errors with block
  block:
    # Try these tasks
    - name: Task that might fail
      command: /usr/bin/risky_operation

    - name: Another task
      debug:
        msg: "This runs if previous task succeeds"

  rescue:
    # Run these if any task in block fails
    - name: Recovery task
      debug:
        msg: "Something went wrong, recovering..."

    - name: Send notification
      mail:
        to: admin@example.com
        subject: "Deployment failed"

  always:
    # Always run these, regardless of success or failure
    - name: Cleanup
      file:
        path: /tmp/deployment
        state: absent

    - name: Log completion
      debug:
        msg: "Deployment process completed"

Practical Block Examples

# Example 1: Database backup with rollback
- block:
    - name: Backup database
      shell: pg_dump mydb > /backup/mydb.sql

    - name: Apply migrations
      command: /opt/app/migrate.sh

    - name: Restart application
      systemd:
        name: myapp
        state: restarted

  rescue:
    - name: Restore from backup
      shell: psql mydb < /backup/mydb.sql

    - name: Alert admins
      debug:
        msg: "Migration failed, database restored from backup"

    - fail:
        msg: "Deployment aborted due to migration failure"

  always:
    - name: Remove temporary files
      file:
        path: /tmp/migration_temp
        state: absent

# Example 2: Network configuration with rollback
- block:
    - name: Backup network config
      copy:
        src: /etc/network/interfaces
        dest: /tmp/interfaces.backup
        remote_src: yes

    - name: Apply new network config
      template:
        src: interfaces.j2
        dest: /etc/network/interfaces

    - name: Restart networking
      systemd:
        name: networking
        state: restarted

    - name: Test connectivity
      wait_for:
        host: 8.8.8.8
        port: 53
        timeout: 10

  rescue:
    - name: Restore old config
      copy:
        src: /tmp/interfaces.backup
        dest: /etc/network/interfaces
        remote_src: yes

    - name: Restart networking
      systemd:
        name: networking
        state: restarted

    - fail:
        msg: "Network configuration failed, reverted to backup"

Nested Blocks

- block:
    - name: Outer task
      debug:
        msg: "Outer block"

    - block:
        - name: Inner task that might fail
          command: /bin/false

      rescue:
        - name: Inner rescue
          debug:
            msg: "Inner block failed"

      always:
        - name: Inner cleanup
          debug:
            msg: "Inner always runs"

  rescue:
    - name: Outer rescue
      debug:
        msg: "Outer block failed"

  always:
    - name: Outer cleanup
      debug:
        msg: "Outer always runs"

Play-Level Error Controls

any_errors_fatal

Stop the entire play on the first failure across any host:

- hosts: all
  any_errors_fatal: true
  tasks:
    - name: Critical task
      command: /usr/bin/critical_operation

    # If this fails on ANY host, stop ENTIRE play

# Practical example: Load balancer scenario
- hosts: load_balancers
  any_errors_fatal: true
  tasks:
    - name: Disable datacenter
      command: /usr/bin/disable-dc

    # Must succeed on all load balancers or abort

- hosts: webservers
  tasks:
    - name: Deploy updates
      # Only runs if all load balancers succeeded

max_fail_percentage

Abort the play when a percentage of hosts fail (useful for rolling updates):

- hosts: webservers
  max_fail_percentage: 30
  serial: 10
  tasks:
    - name: Update application
      # Abort if more than 30% of hosts fail

# Examples with different host counts
# 10 hosts with max_fail_percentage: 30 = abort after 4 failures
# 20 hosts with max_fail_percentage: 25 = abort after 6 failures

# Important: percentage must be EXCEEDED, not equaled
# For 4 hosts to abort at 2 failures, use 49% not 50%
- hosts: databases[0:4]
  max_fail_percentage: 49
  serial: 1

Handler Error Handling

Force Handler Execution

# By default, handlers don't run if any task fails
# Force handlers to run even on failure

# In ansible.cfg
[defaults]
force_handlers = True

# Or in playbook
- hosts: all
  force_handlers: yes
  tasks:
    - name: Update config
      template:
        src: app.conf.j2
        dest: /etc/app.conf
      notify: Restart app

    - name: This might fail
      command: /bin/false

  # "Restart app" handler still runs

# Or via command line
ansible-playbook playbook.yml --force-handlers

Flush Handlers Early

- name: Update configuration
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  notify: Reload nginx

# Flush handlers immediately instead of waiting until end
- meta: flush_handlers

- name: Test nginx is working
  uri:
    url: http://localhost
    status_code: 200

Advanced Error Handling Patterns

Retry Logic

- name: Retry until success
  uri:
    url: http://api.example.com/status
    status_code: 200
  register: api_result
  until: api_result.status == 200
  retries: 5
  delay: 10

# Retry with complex conditions
- name: Wait for deployment
  shell: /usr/bin/check_deployment.sh
  register: deploy_check
  until:
    - deploy_check.rc == 0
    - "'READY' in deploy_check.stdout"
  retries: 30
  delay: 10

Conditional Failure

- name: Check required conditions
  block:
    - name: Verify disk space
      shell: df -h / | tail -1 | awk '{print $5}' | sed 's/%//'
      register: disk_usage

    - name: Fail if disk is too full
      fail:
        msg: "Insufficient disk space: {{ disk_usage.stdout }}% used"
      when: disk_usage.stdout | int > 90

    - name: Verify memory
      shell: free -m | grep Mem | awk '{print int($3/$2*100)}'
      register: mem_usage

    - name: Fail if memory is too high
      fail:
        msg: "High memory usage: {{ mem_usage.stdout }}%"
      when: mem_usage.stdout | int > 95

Assert for Validation

- name: Validate prerequisites
  assert:
    that:
      - ansible_distribution == "Ubuntu"
      - ansible_distribution_major_version | int >= 20
      - ansible_memtotal_mb >= 4096
    fail_msg: "System does not meet requirements"
    success_msg: "All prerequisites validated"

# Multiple assertions with custom messages
- name: Validate variables
  assert:
    that:
      - db_password is defined
      - db_password | length >= 8
      - app_port | int > 1024
      - app_port | int < 65535
    fail_msg: "Configuration validation failed"
    quiet: true  # Don't show assertion details

Graceful Degradation

- name: Try primary database
  postgresql_query:
    db: mydb
    query: SELECT 1
  register: primary_db
  ignore_errors: yes

- name: Use secondary database if primary fails
  postgresql_query:
    db: mydb
    query: SELECT 1
    login_host: "{{ secondary_db_host }}"
  when: primary_db is failed
  register: secondary_db

- name: Fail if both databases are down
  fail:
    msg: "All database connections failed"
  when:
    - primary_db is failed
    - secondary_db is failed

Error Handling Best Practices

  1. Use Blocks for Related Tasks: Group tasks that should succeed or fail together
  2. Always Clean Up: Use the always section for cleanup tasks
  3. Be Specific with failed_when: Define precise failure conditions
  4. Avoid Overusing ignore_errors: Handle errors explicitly when possible
  5. Test Rollback Procedures: Verify rescue blocks work as expected
  6. Log Errors: Record failures for troubleshooting
  7. Use Assertions Early: Validate prerequisites before running tasks
  8. Implement Retry Logic: For network operations and external services

Common Error Handling Patterns

Pattern 1: Idempotent Shell Commands

- name: Ensure line in file
  shell: grep -q "{{ line }}" /etc/config || echo "{{ line }}" >> /etc/config
  register: line_result
  changed_when: line_result.rc != 0
  failed_when: false

Pattern 2: Optional Dependencies

- name: Install optional package
  apt:
    name: optional-tool
    state: present
  register: optional_install
  ignore_errors: yes

- name: Set feature flag based on installation
  set_fact:
    optional_feature_enabled: "{{ optional_install is succeeded }}"

Pattern 3: Pre-flight Checks

- name: Pre-flight validation
  block:
    - name: Check all required variables
      assert:
        that:
          - item is defined
          - item | length > 0
        fail_msg: "Required variable {{ item }} is not defined"
      loop:
        - db_host
        - db_name
        - db_password

    - name: Test database connectivity
      wait_for:
        host: "{{ db_host }}"
        port: 5432
        timeout: 10

  rescue:
    - name: Abort on validation failure
      fail:
        msg: "Pre-flight checks failed, aborting deployment"

Pattern 4: Rollback on Failure

- name: Deploy with automatic rollback
  block:
    - name: Get current version
      shell: cat /opt/app/VERSION
      register: current_version

    - name: Deploy new version
      unarchive:
        src: "/releases/app-{{ new_version }}.tar.gz"
        dest: /opt/app

    - name: Run smoke tests
      command: /opt/app/smoke-tests.sh

  rescue:
    - name: Rollback to previous version
      unarchive:
        src: "/releases/app-{{ current_version.stdout }}.tar.gz"
        dest: /opt/app

    - name: Restart with old version
      systemd:
        name: app
        state: restarted

    - fail:
        msg: "Deployment failed, rolled back to {{ current_version.stdout }}"

Troubleshooting Error Handling

Common Issues:
  • Rescue Not Triggering: Check that task actually failed, not just had errors ignored
  • Always Not Running: Verify block syntax is correct (indentation)
  • failed_when Logic Wrong: Test conditions with debug before using in failed_when
  • Handlers Not Running: Enable force_handlers if tasks fail
  • max_fail_percentage Not Working: Must use with serial, and percentage must be exceeded

Quick Reference

# Basic error control
ignore_errors: yes                    # Continue on failure
ignore_unreachable: yes               # Continue if host unreachable
failed_when: condition                # Custom failure condition
changed_when: condition               # Custom changed condition

# Blocks
block:
  - task1
rescue:
  - recovery_task
always:
  - cleanup_task

# Play-level controls
any_errors_fatal: true                # Abort on first failure
max_fail_percentage: 30               # Abort after % failures
force_handlers: yes                   # Run handlers even on failure

# Retry logic
until: condition
retries: 5
delay: 10

# Validation
assert:
  that:
    - condition1
    - condition2
  fail_msg: "Validation failed"

# Manual control
- fail: msg="Custom failure"          # Force failure
- meta: clear_host_errors             # Reset unreachable hosts
- meta: flush_handlers                # Run handlers now

Next Steps