Author: eilemann@gmail.com
State: Initialization Failure Tolerance implemented in 1.0-alpha.
Overview
Equalizer is currently designed to be completely frame driven, that is, any node failure will result in a failure of the whole config. The purpose of this document is to explore ways to make Equalizer more resilient against failures.
Launching
Equalizer already detects the failure of an entity to launch. To run a configuration observing such a failure, implement to following:
- A new config attribute to allow init failures
- Leave each unlaunched entity in the INIT_FAILED state
- Deactivate all dependent compounds
- Ignore any layout switches on inactive compounds
Runtime Failures
Runtime Failures and launch failures are handled in the same way. The
node is considered to be failed, and can therefore not produce any
output frames or participate in a barrier. The open issues are how to detect
node failure, assign timeouts to all blocking operations and handle them. This
will be problematic, in particular the generic operations and connection
handling in eq::net
.
File Format
global { EQ_CONFIG_IATTR_ROBUSTNESS OFF | ON } config { attributes { robustness OFF | ON # tolerate resource failures (init only) } }