Concepts

Services

A service is a completely abstract concept for faild and failsh, the only thing they need to know is a name (for faild) and two Tcl procedures to start and stop the service (for failsh). Failsh also uses two lists of hosts, masters and slaves that it expects to provide the same service. Masters and slaves are explained below.

States

A service can be in one of three states.

UP If a service is UP, it does not need a backup and will do whatever the definition of the service expects it to do. Note that it is your duty to define when the service you define is UP.

RECOVER If a service is in the RECOVER state, it is not currently performing its duties, but can be brought up at any time. Whenever a service is started, it should be brought to this state. Only when failsh has decided that starting the service to state UP, would not interfere with what other hosts are doing, it will be started.

FAIL If a service is in the FAIL state, it cannot be brought up. If a host is not reachable, then the services failsh expects that host to provide are also considered to be in the FAIL state.

Masters and Slaves

Failsh needs to decide whether or not to bring up or down a service. It uses two lists of hosts for this purpose. If any host in the list of masters provides a service in either the UP or the RECOVER state, the service should not be started. It should then go to the RECOVER or FAIL state itself.

If none of the masters is ready to provide a service, failsh tries to start the service on the local host. But it must ensure that this would not interfere with some other inferior host still providing the service. E.g. the host could be rebooting after a long downtime, so some backup system may still provide some service. So before starting the service, failsh queries all the slaves to see wether any of them is UP. If the local machine is in the masters list of all the slaves, then all slaves will see that the local host is in the RECOVER state and will themselves go to this state. After this, failsh may start the service on the local host.

Internal States

Internally, failsh knows two more states:

INDOUBT A service can be INDOUBT, which means that state information could not be retrieved for some time, e.g. due to failure of the other host. After a certain time, INDOUBT services are marked FAILed by failsh.

RECLAIM On can think of a scenario where a slave is UP although it should not be up. If a master sees this, it goes into RECLAIM state, and waits until the slave recovers from its current insanity (goes back to RECOVER). As soon as no client is providing the service, the master reclaims the service. The difference to a service being up/down is that the service on the master never was really closed. Such a situation can occur if there is only one heartbeat link an the network cable on the master breaks: the slave will go up, but the master will never really stop the service. When the master is reconnected, the slave and master are simultanuously up. There is one race condition remaining: the master may not see that the slave is up, because it was just a few seconds faster. In this situation, the master will not reclaim the service. This problem of ``hidden'' state changes will be addressed in a future release.

UP	If a service is UP, it does not need a backup and will do whatever the definition of the service expects it to do. Note that it is your duty to define when the service you define is UP.
RECOVER	If a service is in the RECOVER state, it is not currently performing its duties, but can be brought up at any time. Whenever a service is started, it should be brought to this state. Only when failsh has decided that starting the service to state UP, would not interfere with what other hosts are doing, it will be started.
FAIL	If a service is in the FAIL state, it cannot be brought up. If a host is not reachable, then the services failsh expects that host to provide are also considered to be in the FAIL state.