ChangeLog for the failover package ---------------------------------- Notation: * indicates a point implemented in the corresponsing release. Bugs found and fixed or new features added in a release usually carry this symbol o indicates a point recognized as needing change or a bug fix but not fixed in the corresponding release. Whenever I find a problem I want to be fixed immediately, I such an entry to the ChangeLog. If I fail to fix it by the release date, the point stays there as an open issue Release 0.5.22: * clean up includes * work around issues with ncurses and std != c99 * remove references to curses, which aren't good enough on Solaris anyway * many cleanups for 32/64 bit types in BER, code is now fully 64bit * returned to second time differences (instead of ticks), we were not able to display anything more precise anyway * fix some time formatting bugs Release 0.5.21: * fix some more bugs that prevented failover from working on 64bit systems Release 0.5.20: * fix some problems related to pointers or ints being cast to long or other types that are of different size on a 64bit architecture. This version of failover should work ok on x86_64 Release 0.5.19: * some cleanup work to make sure failover compiles cleanly on x86_64, unfortunately it doesn't work completely correctly Release 0.5.18: * removed send_arp_sol programm, which was not needed and had some bugs that caused it not to work on Sol 8/9. * configure now checks for sys/ethernet.h to prevent duplicate definitions of ether_ntoa and related functions * make sure everything compiles smoothly under Solaris 10 Release 0.5.17: * workaround for a bug in pth.m4: the _ac_pth_line macro causes automake and autoconf to fail. just remove the macro definitions and all the lines that refer to it. The macro just provides eyecandy anyway. * fixed two missing prototypes in ber.h spaceFor_int and spaceFor_int64 * fixed unused variables in asyncconn.c, and signed pointer comparison * fixed missing include of string.h in eber.c * fixed quite a few non-ANSI typedefs in cgi-lib.c and html-lib.c * fixed warning in faildsnmp.h: index shadows global declaration * faildsnmp.c used index as a parameter or local variable many times always conflicting with index in string.h * fixed format string type error in socketpool.c * fixed bad include in community.c, config.h was not included, thus config constants were not known * fixed bad index parameter in faildsnmp.h * removed apparently unused function host_setstate in statehost.c * missing include time.h in failtrap.c, also fixed prototypes with void arguments * fixed wrong number of arguments in failstress.c:123 * fixed nonstatic faild_rcs_id in faildebug.c * fixed incorrect format type in dump.c * fixed incorrect globals argument, causing many warnings in failtcl.c and removed unused functions * fixed bad comment in failsh.c * fixed massing static qualifiers in display.c * fixed overlong string in html.c * fixed missing include of string.h in monlist.c and type cast error in parse_len * fixed shadowing variable name socket in receiver.c * fixed missing time.h include in mondate.c * fixed missing prototype in fwarp.c * removed unused trim function in mondate.c * added missing prototype html_servicedir to html.h * this is the first release compiled with gcc 3 on Linux Release 0.5.16: * Typo in getting_started.html.in pointed out by Petre Scheie * some sockets were not initialized to close on exec, which caused an nmbd started from the failsh to still have the debug socket open, so that failsh could no longer be restarted Release 0.5.15: * fix a bug in socketpool.c caused by gcc 3.3 no longer accepting newlines in strings. Release 0.5.14: * fix some bugs showing up with modern versions of RPM * removed some unsigned typedefs as the are now standard * added a status.php scripts to illustrate how failmon -h can be used to display a status page * integrated modified setup script provided by Mark Akermann for symmetric failover (the sample had not been updated for a long time, and was not useful any longer). Thanks, Mark. * fixed a bug in the configure script that caused the --with-pth- library not to work correctly * updated automake (2.57), automake (1.7.2) and libtool (1.4.3). Note that libtool is particularly nasty, as it includes a configure script from autoconf 2.13, which does not work with autoconf 2.57. It is necessary to run autoconf, autoheader and automake in libltdl * fixed getopt in failsh.c, which did not check for the documented option -t (noted by Stefan Glueckler) * fixed a bug noted by Stefan Glueckler that caused the reclaim procedure not to be called * fixed RPM download links in index.html.in (they did not point to the download directory) Release 0.5.13: * update build system to current versions of autoconf (2.53), automake (1.6) and libtool (1.4.2) * migrate to pth-1.4.1 * adapted the service start scripts for the Linux failover sample configuration to use the new feature of the send_arp utility (see next item) * send_arp.c (used on Linux only) always sent the gratuitous ARPs out through the first interface. It has been extended to include the interface name as the fifth argument. Release 0.5.12: * Version numbers in manual pages were not properly updated, but this does not affect the accuarcy of the pages otherwise. * time ago string was not properly aligned. Release 0.5.11: * fix a small bug in the monitor client, which displays a zero offset as (future?). This should only happen for negative relative times (which shouldn't happen at all). Release 0.5.10: * updated to shtool 1.5.2, as 1.5 failed on Solaris as Solaris awk cannot cope with %X formatting strings. * the freebsd setup script overwrites the linux setup script on a linux system, which really isn't what we want. Caused by a mistake in the automake file. * another mistake in the setup script for ipfailover-linux cause the broadcast address to be set incorrectly. Release 0.5.9: * updated documentation for FreeBSD, info about warnings from ber.h * added --with-pth-library to link with a static Pth library. This allows Solaris packages to be self containted * Sample configuration added for FreeBSD, look in samples/ipfailover- freebsd for examples * FreeBSD does not need anything special for ARP, as it works with aliased interfaces as it should. Whenever an alias is added to an interface a gratuitous ARP is sent. * Add a delete function to fwarp.c so that /usr/sbin/arp can be used if delete an arp entry ``manually'' is too cumbersome. * NOTE: Pth should not be configured with prefix /usr. This causes the M4 macros coming with Pth to add the preprocessor flag -I/usr/include, which causes gcc on Solaris to fail for files that include stdarg.h. * configure.in bug: --with-tcl-version did not set correct -l flag * as we are no longer using kernel threads, it has become irrelevant to use the reentrant versions of gethostbyname and getservbyname. Hence we can now do without netdb_r.h (not fixed yet in the healthchecker) * crypt.h does not exist on FreeBSD, should not include header if it does not exist in community.c * ber.h: defines for types ulong etc not present in types.h on all platforms, don't know how to fix this (I don't want to rewrite ber.{c,h}) * missing includes: sys/types.h in src/include/fail.h sys/types.h in src/server/faild.c unistd.h in src/include/utils.h * some *.x files were still included in the Solaris package, which made it impossible to build the package from the standard distribution * added configureit sample for FreeBSD * portabilty fixes for FreeBSD: Release 0.5.8: * genesis community in sample/ipfailover-linux was RO instead of RW causing out of the box installations to fail * Add note about Tcl being a prerequisite package to the README * add option -n to faild to decrease nice number/priority of the process to improve interactive response * add the option -m to faild which locks all its memory pages from being paged * improve start scripts * fix a inclusion bug in community.h * check the modification timestamp of the communityfile at least once a minute and reread it in case it has changed. Allows changes to community strings without restarting the faild * faild failed to remove the pid file at exit * samples/multi/Makefile.in was missing in the distribution * if failmon is called as failmon.cgi, it defaults to HTML instead of curses, also includes the content-type header * Add option -1 to failmon for a single text based display of the state information * make sure that when faild daemonizes by forking, the parent will not exit before the child is ready to serve requests. This is done by opening a pipe between the too, and the parent waiting for the child to close it. * using shtool to propagate version information in a more consistent way * corrected libtool library version numbers using shtool, this also has consequences for the package and rpm build Release 0.5.7: * a debugging fprintf was left in the code Release 0.5.6: * failstat dumps core if target host not resolvable * failmon should connect to localhost by default, now it takes the address associate with the nodename * monitor displays the time of last connect incorrectly: needs new entry in the host structure: lastconnect. Also modifies the monitor protocol. * fix some minor HTML bugs in html.c * documentation: failmon only gets new info from the shell every minute, unless something changes. Therefore the time of last call is not correct, the header now says time of last reported call Release 0.5.5: * the default setup installed with the Solaris package installation now uses more secure community strings: `public' for read only, updates need `genesis' * same for the Linux sample installation * all communityfiles in the distribution now document the fields of that file * version number on monitor page included * removed unecessary blank inside time ago string in HTML output of failmon * version number for Solaris failover package was wrong Release 0.5.4: * remote the images run.png and sleep.png which are no longer used with the new scheduler. * add image size information to all img tags to improve loading performance. Release 0.5.3: * improve the location for the default communityfile, and make sure it is included in the Solaris package. * make sure the communityfile has permissions 0600, owner root, to prevent local users from stealing community strings. Release 0.5.2: * local last_change variables not updated properly. This was caused by the fact that the remote change time always overwrote the local time, and by the fact that monitor updates are only sent when necessary or at large intervals * fix a bug in the display of the time ago string in the monitor, it was not right alligned and thus did not overwrite some characters from a previous display * Send debug messages to stderr instead of stdout, or debugging failmon is much more difficult than necessary Release 0.5.1: * Bug in host.c and hostlist.c, delays/timeouts were still specified in milliseconds, although the new version uses seconds. * send in a remote call was not protected from blocking, now using pth_write_ev with a timeout event. * parms.in.h still contained 30 seconds as a check interval, which is terribly long. * extend local.c to handle dead local faild (0.4 never did this) * changed title line in host listing: the time stamp indicates the last connect, not the last call * bugfix: indoubt state from a remote faild was not properly propagated to the failmon. Furthermore, the FAIL state was reset with every probe, resulting in a meaningless display in the monitor. * defaults in the package (Solaris) and the setup script (Linux) are now more suitable for the othello private network, to simplify testing Release 0.5.0: This release is a major overhaul of failover. The whole communication architecture and the threading model are completely new (and more correct I hope). In view of the pervasiveness of changes, it wouldn't be very helpful to document every single modification, hence some of the change log entries below are rather sketchy. * replace RPC communication by SNMP, which helps prevent problems with RPC libraries that plagued earlier releases. Also the annoying differences in RPC implementations and the necessity to keep client and server libraries separte because client and server stubs use the same name in some implementations have gone. * the failc client includes an undocumented option -n which can be used to repeat a call to faild many times. This is useful for memory debugging (memory leaks in the client or server show up much earlier) and performance measurement (run time is no longer dominated by process startup time). Of course, the undocumented failstress client can also be used for the same purposes. Some performance numbers: - On a Pentium 233 under Linux, failmon consumes about 4 seconds of CPU time per hour. - A failsh without any monitors attached does not consume more than 60 seconds per day. - a call the faild takes less than 3.2 ms in the client, and as encoding and decoding are quite symmetric, it is reasonable to assume that the server uses about the same time. To summarize: failover uses less than 0.1% of the CPU on a typical system. * replace encoding of monitor communication of XDR by ASN.1 * removed rmalloc, which is superseeded by the more sensitive efence (which e.g. many Linux distributions include). There also were some problems compiling rmalloc due to the fact it is not entirely ANSI compliant. Here are some numbers about memory consumption (in 8k pages on Linux 2.2.13 with glibc2.1) of failover processes: Process Linux VM usage Solaris VM usage virtual resident virtual resident -------- ------- -------- ------- -------- failsh 3588 588 4064 1728 faild 2340 290 2864 1576 failmon 3260 596 3448 2096 * remove pthreads and associated concurrency problems, use GNU portable threads (Pth) instead * use host index and service index in the monitor display, which is less confusing (for the nonhacker) and gives some more space for other parameters * get rid of state files, which were never used in practice * designed and implemented the socketpool abstraction for the faild server * simplify the design of the monitor: instead of one large event loop that handles every type of event (keyboard, new data from server), there are now two independen threads, one receiving data from the server, the other handling keyboard input. The receiver thread sends a complete set of host and service data to the display thread through Pth's message port facility. * fix some missing memdebug calls. * connecting to a remote host no longer uses the asyncconnect library function but instead uses pth_connect_ev which is simpler to use and integrates nicely with the new per host thread model (see next item). * there is now a separate thread per host, while the previous versions used a thread per service. As queries are also per host (see next item) and not per host, many parameters that were previously associated with a service have gone or are now associated with a host. * calls to a remote host no longer call for the state of individual services, but rather for the state of all services. This creates a slight overhead (on average 24 bytes per service that is not used), but simplifies the design considerably, as there is no possiblity of interleaving requests. * hc has been turned into a subpackage to simplify building with or without it. As hc is currently unusable due to a bug in Pth-1.3.6, we prefer to exclude it from the package * building failover without libpth.so (the shared object) is the default, if that library is available. there is no flag like --with-tcl-library for the Pth-library (libtool balks on it), so if you want to build statically linked binaries, you should temporarily remove libpth.so. Release 0.4.8: * realized that the ChangeLog was not updated in previous releases, promised to do better in forthcoming releases * failsh daemonizes now even if debugging, use -F to prevent that: this simplifies debugging (just add -d all to the command line in the start script /etc/init.d/failsh) * removed annoying and nonsensical error message when failstat was called without arguments * failmon connects now asynchronously, and gives up after 30 seconds * moved asyncconn.c/h and netdb_r.h to the top level library/includes, as it seems to be quite usefull there as well * modified debug.{c,h} so that new facilities can be added on the fly * fixed quite a few typos in hc modules * rmalloc can write it's output to a separate file, but this is not yet used in all programs, some messages, however, will always go to the console, the ones generated before the file is opened * all clients now use rmalloc * extensive fixes in several healthchecker modules, which have been brought to the same standard of debugging and configuration * module tests now all succeed Release 0.4.1: * some minor bug fixes * fwarp was missing from package (necessary with Firewall-1) * all GIFs have been converted to PNGs to avoid patent problems with Unisys Release 0.4.0: * first public appearance of healthchecker, still quite buggy * fixes some serious memory leaks. However, there still seem to be some left Release 0.3.4: * fixed a bug that cause a seg fault if start/stop/reclaim proc was not defined for a service * failover.mib was missing from the distribution * a header was not included in failtcl.c, causing the fact that two functions were not called with correct arguments, which could cause seg faults in some cases Release 0.3.3: * added an option -u to the faild to run as a different user faild normally does not need to run as root, while failsh most probably does (it needs to start/stop network interfaces, only root can do that). * added snmp trap support * rpm support to create a Linux binary package Release 0.3.2: * update the ChangeLog, this was not done in 0.3.1 * added two configure options: --with-tcl-version specify tcl version --with-tcl-library statically link a tcl archive library * fixed an inclusion problem in eventloop.c, switch between curses.h and ncurses.h (Solaris does not have ncurses) * fixed a sed script bug in the linux ipfailover example setup Release 0.3.1: * bug fixes in the shell, should hopefully remove crashes seen in Red Hat Release 0.3.0: * new state "reclaim" to recover from erratic behaviour of slaves added * many typos fixed * added script man2html to work around missing man2html * added configuration for linux ip failover Release 0.2.2: * configure on Linux never actually worked automagically, this should be fixed now Release 0.2.1: * fixed a bug in the distribution, where some files for the faildebug utility were not included in the distribution * allowed the debug facility 'none' that clears all previously specified facilities * automatically include version numbers and last change dates in manual pages * fixed a bug in the Solaris package installation where failover was configured even if the administrator had said no to that question Release 0.2: * Solaris package now comes with an automated installation procedure for an implementation for IP address failover * Linux pthreads use USR1 to handle pthread_condition_timewait, which interferes with the signaling mechanism to modify the debug level. On the other hand, the old scheme used is really only apropriate if debug levels are ordered. So we remove this interface completely and replace it by an other RPC interface * provide a simple replacement for the posix4 system calls clock_gettime, clock_getres and clock_settime (which is not really implement, but returns ENOSYS). In contrast to a real implementation of clock_gettime, the replacement has only 1 second resolution. clock_gettime is missing from libc on Linux * change name errno in union state_result in failover.x to to err_no, because Linux with glibc (SuSE 6.0) has errno as a macro that interferes with structure expansions. $Id: ChangeLog,v 2.48 2009/06/05 10:11:38 afm Exp $