Failover Example

Problem Description
Solution Concept
Details of the configuration

1. Problem Description

A proxy server should be backed up using the failover utilities, so that the second proxy takes over in case the main proxy fails. The user should hardly notice that she is now working with a different proxy, it should not be necessary to restart neither the proxy server, nor the browser.

Note that DNS cannot provide a solution for this problem, since the browser will resolve the name of the proxy only once. However, DNS round robin together with the failover utilities can give a natural load balancing solution with high availability.

2. Solution using the Failover Utilities

The proposed solution uses an additional IP address different from the main addresses of both machines, which will be active on a virtual interface of the active machine. Logically, This IP address is the service that is either active on the main or the backup machine.

To fix ideas let us consider two machines:

name ip address interface function

lukretia 193.5.25.27 hme0 master

macbeth 193.5.25.20 le0 slave

name	ip address	interface	function
lukretia	193.5.25.27	hme0	master
macbeth	193.5.25.20	le0	slave

Both systems happen to be Sun Systems running supported versions of Solaris 2.x.

Failover should happen for the address 193.5.25.62. To use the failover utilities we have to make sure that the following process are implementated on the two machines:

On both machines, the software must be installed, and the faild must be started whenever the system is booted. Furthermore, the service named 193.5.25.62 must be registered in faild, in state FAIL.
On lukretia, failsh must be started with a configuration file that declares macbeth a slave.
On lukretia, the service 193.5.25.62 should be brought to the RECOVER state, so that failsh can start it as soon as it is sure that the service 193.5.25.62 is not provided by the slave (this is important to prevent duplicate IP addresses).
On macbeth, failsh must be started with a configuration file that declares macbeth a master.
On macbeth, the service 193.5.25.62 is brought to the RECOVER state to signal to the master that it could take over the service at any time.

If the master fails or the service is brought down manually on lukretia, the slave will notice an start to provide the service. If the master gets up again, the slave will see the service 193.5.25.62 in the RECOVER state on the master, stop the service and go itself to the RECOVER state. As soon as the master sees that the slave is in recover state, it will bring up the service itself.

3. Details of the configuration

The configuration consists in the following parts:

Start script for faild on both machines
Start script for the service on both machines
Start script for failsh on both machines
Configuration file for failsh on lukretia
Configuration file for failsh on macbeth

All configuration files are in the sample subdirectory of the Solaris package distribution or in the samples/ipfailover-solaris of the source distribution.

Start script for faild

The faild start script should be installed in /etc/init.d and linked to the appropriate directores /etc/rc?.d. It makes sure faild is started and all services correctly registered.

#!/bin/sh
#
# start / stop the faild daemon
#
servicenames="193.5.25.62"
port=1291

case `hostname` in
	macbeth)	other=lukretia;;
	lukretia)	other=macbeth;;
esac

case .$1 in
	.start)
		# if an old state file is around, we use this, this makes
		# it easier to start/stop a faild without loosing all
		# the state. But note that a slave will take over a service,
		# if it can no longer talk to the faild. You have to stop
		# all services that are UP
		/opt/AFMfail/sbin/faild -c communityfile -U -p ${port} \
			-r /opt/AFMfail/var/faild.pid
		for n in ${servicenames}
		do
			echo adding service ${n}
			/opt/AFMfail/bin/failc -c public -T -p ${port} \
				-s FAIL -b ${other} ${n}
		done
		;;
	.stop)
		kill `cat /opt/AFMfail/var/faild.pid`
		;;
esac

The failc call in the script makes sure all the parameters are set, including the purely informational name of tha backup system for a service.

Start script for the service

Previous versions of this file started failsh in this script. This becomes inconvenient if there are several services, so failsh is now started from a separate script.

The service start script is usually named after the service, in this case just the IP address to be failed over. It must be installed in as /etc/init.d/193.5.25.62 and linked to the appropriate directories in /etc/rc?.d. Make sure it is started after faild.

In addition to the standard arguments start und stop it supports the following additional arguments:

start: Brings up the service to a well defined state, and registers it as in RECOVER, used at boot time. The convention for RECOVER is that a service in this state should be startable at any time.
stop: Kill the service. This usually brings it down to the fail state. After stop, a service can usually not be started without going through the start option first.
fail: bring down the service and mark it as FAIL so the failsh will not try to bring it up again.
recover: Place the service in the recover state in faild so that failsh can bring it up if necessary.
reclaim: Reclaim the service after an erratic slave went up.
up: start the service without doing anything to the entries in faild, this is used by the Tcl-scripts below to start/stop the service.
down: stop the service without doing anything to the entries in faild, this is used by the Tcl-scripts below to start/stop the service.

#!/bin/sh
#
# IP address failover using faild/failsh
#
servicename=193.5.25.62
netmask=255.255.255.0
broadcast=193.5.25.255
interface=hme0
subif=:1
port=1291
monitorport=1848

case .$1 in
	.start)
		# The start argument brings the service into a well defined
		# recoverable state. If this is a master, it will later be
		# started by the up method
		/sbin/ifconfig ${interface}${subif} plumb
		/sbin/ifconfig ${interface}${subif} inet ${servicename} \
			netmask ${netmask} broadcast ${broadcast} down
		# for the next step to work, faild must be fully up, it
		# may be necessary to include a small pause to give faild
		# another chance
		/opt/AFMfail/bin/failc -c public -T -p ${port} -sRECOVER \
			${servicename}
		;;
	.recover)
		/opt/AFMfail/bin/failc -c public -T -p ${port} -sRECOVER \
			${servicename}
		;;
	.reclaim)
		# reclaing an IP address: send gratuitous ARP only
		/opt/AFMfail/sbin/grarp  ${interface} ${servicename}
		;;
	.fail)
		# The fail argument brings the service down, and marks it
		# as failed so that it will not be started again
		/sbin/ifconfig ${interface}${subif} down
		/opt/AFMfail/bin/failc -c public -T -p ${port} -sFAIL \
			${servicename}
		;;
	.stop)
		# The stop argument kills all processes associate with
		# the failover 
		kill `cat /opt/AFMfail/var/failsh.pid`
		;;
	.up)
		# starting the service without putting any information in
		# the faild
		/sbin/ifconfig ${interface}${subif} up
		# send gratuitous ARP
		/opt/AFMfail/sbin/grarp  ${interface} ${servicename}
		;;
	.down)
		# Take the service down without putting any information in
		# the faild
		/sbin/ifconfig ${interface}${subif} down
		;;
esac

exit 0

More complex applications, e.g. firewalls with several interfaces, or applications servers with a large number of server processes, will have more complex service start/stop scripts. However, they will want to use the same targets, so that the same relatively simple failsh configuration files can be used.

failsh start script

The failsh start script starts failsh with the appropriate arguments. In particular, it should tell the shell which tcl script to use. It is also recommended that the monitor mode be activated (using the -m option), as in the example below.

#! /bin/sh
#
# IP address failover using faild/failsh
#
# This script only starts the failsh, any initializations of services
# are left to individual scripts for each service
#
# $Id: example.html,v 2.3 2006/03/20 22:44:05 afm Exp $
#
port=1291
monitorport=1848
debugport=1918

case .$1 in
	.start)
		# start the failsh
		/opt/AFMfail/sbin/failsh 				\
			-f /opt/AFMfail/etc/`uname -n`.tcl 		\
			-i /opt/AFMfail/var/failsh.pid 			\
			-m ${monitorport} -D ${debugport}
		;;
	.stop)
		# The stop argument kills all processes associate with
		# the failover 
		kill `cat /opt/AFMfail/var/failsh.pid`
		;;
esac

exit 0

The standard installation uses the names master.tcl and slave.tcl. Using some other names for the configuration names is recommended, as package installation or deinstallation may remove or overwrite your carefully crafted configuration files.

Configuration file on lukretia

The configuration file is a Tcl script that must specify who is the master and slave, what parameters should be used when talking to the remote failds and how to start and stop the service. In this example, the script /etc/init.d/193.5.25.62 with the arguments up and down is called from the functions startservice and stopservice to bring to service up or down.

# # IP address failover on Host lukretia (the master) # # # that startservice procedure starts the IP address on the virtual # interface # proc startservice {} { exec /etc/init.d/193.5.25.62 up return "UP" } # # the stopservice procedure stops the IP service on the virtual interface # proc stopservice {} { exec /etc/init.d/193.5.25.62 down return "FAIL" } # # the recoverservice procedure recovers from a spurious slave # proc reclaimservice {} { exec /etc/init.d/193.5.25.62 reclaim return "UP" } failsvc 193.5.25.62 create failsvc 193.5.25.62 interval 5 failsvc 193.5.25.62 start startservice failsvc 193.5.25.62 stop stopservice failsvc 193.5.25.62 reclaim reclaimservice failsvc 193.5.25.62 slaves macbeth failhost macbeth port 1291 failhost macbeth protocol tcp failhost macbeth community public failhost localhost port 1291 failhost localhost protocol tcp failhost localhost community genesis

Configuration file on macbeth

#
# IP Address failover on host macbeth (the slave)
#

#
# that startservice procedure starts the IP address on the virtual
# interface
#
proc	startservice {} {
	exec /etc/init.d/193.5.25.62 up
	return "UP"
}

#
# the giveip procedure stops the IP service on the interface eth0:1
#
proc	stopservice {} {
	exec /etc/init.d/193.5.25.62 down
	return "RECOVER"
}

#
# the recoverservice procedure recovers from a spurious slave
#
proc    reclaimservice {} {
        exec /etc/init.d/193.5.25.62 reclaim
        return "UP"
}

failsvc 193.5.25.62 create
failsvc 193.5.25.62 interval 5
failsvc 193.5.25.62 start startservice
failsvc 193.5.25.62 stop stopservice
failsvc 193.5.25.62 reclaim reclaimservice
failsvc 193.5.25.62 masters lukretia

failhost lukretia port 1291
failhost lukretia protocol tcp
failhost lukretia community public

failhost localhost port 1291
failhost localhost protocol tcp
failhost localhost community genesis