Equalizer logo
Collage logo
GPU-SD logo

Node Failures

Author: eilemann@gmail.com
State: Initialization Failure Tolerance implemented in 1.0-alpha.

Overview

Equalizer is currently designed to be completely frame driven, that is, any node failure will result in a failure of the whole config. The purpose of this document is to explore ways to make Equalizer more resilient against failures.

Launching

Entity State Diagram
Server Entity State Diagram

Equalizer already detects the failure of an entity to launch. To run a configuration observing such a failure, implement to following:

This will allow to launch configurations even if one of the entities in the cluster is failing. If the failed node is a source-only node, the configurations will run flawlessly, with slightly degraded performance. If it is a destination node, the corresponding display segment will not be updated. Any dependent source nodes are automatically deactivated and will not be re-assigned to other destination channels, unless they had been already assigned by a view equalizer (cross-segment LB).

Runtime Failures

Runtime Failures and launch failures are handled in the same way. The node is considered to be failed, and can therefore not produce any output frames or participate in a barrier. The open issues are how to detect node failure, assign timeouts to all blocking operations and handle them. This will be problematic, in particular the generic operations and connection handling in eq::net.

File Format

  global
  {
      EQ_CONFIG_IATTR_ROBUSTNESS OFF | ON
  }
  config
  {
      attributes
      {
          robustness OFF | ON  # tolerate resource failures (init only)
      }
  }