| 
 ![[Photo of the Author]](../../common/images/Guido-S.gif)  by  Guido Socher (homepage)
 
 About the author:
 
 Guido loves Linux because it is always interessting
    to discover how computers really work. Linux with its
    modularity and open design is the best system for such
    adventures. Content:
 | 
 
A Hardware watchdog and shutdown button
 ![[Illustration]](../../common/images/article239/final_circuit_th.gif) 
Abstract:
    The LCD control panel
    article explained how to build a small microcontroller
    based LCD panel with enormous possibilities. Sometimes you
    don't need all those features. The hardware which we design in
    this article is a lot cheaper (the LCD panel was already a bargain)
    and includes just 2 important features from the LCD panel:
    
- A button to shutdown the server
- A watchdog to supervise the server
The hardware consist only of widely available parts. You will
    have no problems to get those parts. All parts together will
    cost you about 5 Euro.
     
What is a watchdog?
    A watchdog in computer terms is a very reliable hardware which
    ensures that the computer is always running. You find such
    devices in the Mars Pathfinder (who wants to send a person to the
    mars to press the reset button?) or in some extra expensive
    servers. 
    
    The idea behind such a watchdog is very simple: The computer
    has to "say hello" from time to time to the watchdog hardware
    to let it know that it is still alive. If it fails to do that
    then it will get a hardware reset. 
    
    Note that a normal Linux server should be able to run
    uninterrupted for several month, in average probably 1-2 years
    without locking up. If you have machine that locks up every
    week then there is something else wrong and a watchdog is not
    the solution. You should check for defect RAM (see memtest86.com) overheated CPUs,
    too long IDE cables ... 
    
    If Linux is so reliable that it will run for a year without
    any problems then why do you need a watchdog? Well the answer
    is simple to: make it even more reliable. There is as well a
    human problem related to that. A server that made no trouble
    for a year is basically unknown to the service personal. If it
    fails then nobody knows where it is? It might as well lock up
    just before Christmas when everybody is at home. In all such
    cases a watchdog can be very useful. 
    
    A watchdog does however not solve all of the problems. It is no
    protection against defect hardware. If you include a watchdog
    in your server then you should also ensure that you have well
    dimensioned (probably not the latest BIOS bugs and chipset
    bugs, properly cooled hardware).
     
How to use the watchdog?
    The watchdog we design here only ensures that user space
    programs are still executing. To have a truly reliable system
    you still have to monitor your applications (web-servers,
    databases) and your system resources (disk space, perhaps CPU
    temperature). You can do this via other user space applications
    (crontab). All this is already described in the LCD control panel article.
    Therefore I will not go into further details here. 
    
    Examples? Here is a small script that can monitor networking, swap usage
    and disk usage. 
    
    #!/bin/sh
    PATH=/bin:/usr/bin:/usr/local/bin
    export PATH
    #
    # Monitor the disk
    # ----------------
    # check if any of the partitions are more than 80% full.
    # (crontab will automatically send an e-mail if this script
    # produces some output)
    df | egrep ' (8.%|9.%|100%) '
    #
    # Monitor the swap
    # A server should normally be dimensioned such that it
    # does not swap. Swap space should therefore be constant
    # and limited.
    # ----------------
    # check if more than 6 Mb of swap are used
    swpfree=`free | awk '/Swap:/{ print $3 }'`
    if expr $swpfree \> 6000 > /dev/null ; then
        echo "$0 warning! swap usage is now $swpfree"
        echo " "
        free
        echo " "
        ps auxw
    fi
    #
    # Monitor the network
    # -------------------
    # your _own_ IP addr or hostname:
    hostn="linuxbox.your.supercomputer"
    #
    if ping -w 5 -qn -c 1 $hostn > /dev/null ; then
        # ok host is up
        echo "0" > /etc/pingfail
    else
        # no answer count up the ping failures
        if [ -r /etc/pingfail ]; then
           pingfail=`cat /etc/pingfail`
        else
           # we do not handle the case where the
           # pingfail file is missing
           exit 0
        fi
        pingfail=`expr "$pingfail" "+" 1`
        echo "$pingfail ping failures"
        echo "$pingfail" > /etc/pingfail
        if [ $pingfail -gt 10 ]; then
           echo "more than 10 ping failures. System reboot..."
           /sbin/shutdown -t2 -r now
        fi
    fi
    # --- end of monitor script ---
    You can combine this with a crontab entry that will run the
    script every 15 minutes:
    1,15,30,45 * * * * /where/the/script/is
     
The watchdog hardware
    There is no standard relay. Every manufacture has it's own
    design. For our circuit it matters very much what the inner
    resistance of the coil is. Therefore you find below 2 circuits
    one for a 5V, 500 Ohm relay and one for a 5V, 120 Ohm relay.
    Ask for the impedance of the relay or measure it with a
    Ohmmeter before you buy it. You can click on the schematic for
    a bigger picture. 
     120 Ohm relay:
     ![[120 Ohm relay]](../../common/images/article239/linuxpcwd120_schematic_th.gif) 
 
     500 Ohm relay:
     ![[500 Ohm relay]](../../common/images/article239/linuxpcwd500_schematic_th.gif) 
 
     
The shutdown button is a push button that connects RTS and CD when
pressed. It looks a bit strange in the schematic because Eagle does
not have a better symbol.
     
     I don't include a part list in this article. You can see what
    you need in the above schematic (Don't forget the DB9 connector for
    the serial line). For the diodes you can use any diode, e.g 1N4148.
Personally I believe that the
    circuit with the 500 Ohm relay is better because you do not need R4
    and do not need a 2000uF (or 2200uF) capacitor. You can use a smaller
    1000uF capacitor for C1.
     
    Note: That for the 120 Ohm circuit you need a Red LED and for
    the 500 Ohm Relay a green LED. This is not joke. The voltage
    drop over a green LED is higher than over a red LED. 
     Board layout, eagle files and postscript files for etching the
    board are included in the software package which you can
    download at the end of the article. The Eagle CAD software for
    Linux is available from cadsoftusa.com.
     
How the circuit works
    The watchdog circuit is build around the NE555 timer chip.
    This chip includes 2 comparators, a Flipflop and 3 resistors 5K
    Ohm each to have a reference for the comparators. Whenever the pin
    named threshold (6) goes above 2/3 of the supply voltage then
    the Flipflop is set (state on). 
    ![[ne555]](../../common/images/article239/ne555_schematic.gif) 
 
     Now look at the schematic of our circuit: We use the RTS pin
    from the serial line as supply voltage. The voltages on the
    RS232 serial line interface are +/- 10V therefore we need a diode
    before capacitor C1. The capacitor C1 is charged very quickly and
    serves as a energy storage to be able to switch on the relay
    for a moment. Capacitor C2 is charged very slowly over the 4.7M
    resistor. The transistor T1 discharges the capacitor C2 if it
    gets a short pulse via the RS232 DTR pin. If the pulses are not
    coming (because the computer has locked up) then the capacitor
    C2 will eventually (the time is about 40 seconds) be  charged
    above 2/3 of the supply voltage and the Flipflop goes to "on".
    
    
    The capacitor C1, resistor R2 the LED and the relay have to be
    dimensioned such that the relay is switched on shortly from
    the energy in capacitor C1 but there is not enough current to
    keep the relay on all the time. We want the "reset button" to
    be "pressed" just for a second or two.
    
    
    The LED will stay on until the server comes up again after a
    reset. 
    
    As you can see in the schematic there is as well a shutdown
    button connected to pin CD. If you press it for a short while
    (15 sec) then the driver software will run "shutdown -h now"
    and shutdown the server. This is for normal maintenance
    operations and has nothing to do with the watchdog.
     
The driver software
    The driver software is a small C program that can be started
    from the /etc/init.d/ scripts. It will permanently switch on
    the RS232 pin RTS and then send pulses to DTR every 12 seconds
    (the timeout of our watchdog is 40 seconds). If you shut down
    your computer normally then the program will switch off RTS and
    give a last pulse to DTR. The effect is that the supply voltage
    capacitor (C1) will already be discharged before the timeout
    comes. Therefore the watchdog will not hit under normal
    operations. To install the software unpack the
    linuxwd-0.3.tar.gz file which you can get from the download page. Then unpack it and run
    make 
    to compile. Copy the resulting linuxwd executable to
    /usr/sbin/linuxwd. Edit the provided linuxwd_rc script (for
    redhat/mandrake, or linuxwd_rc_anydist for any other
    distribution) and enter the right serial port where the hardware
    is connected (ttyS1=COM2 or ttyS0=COM1). Copy the rc script
    then to 
/etc/rc3.d/S21linuxwd
 and 
/etc/rc5.d/S21linuxwd
 That's
    it.
     
Testing
    When you have soldered everything together you should test
    first the circuit before connecting it to the computer. Connect
    the pin that will later connect to the RTS line of the serial
    port to a 9-10V DC power supply and wait 40-50 seconds. You
    should hear a little click when the relay is switched on and the
    LED should go on. The relay should not stay permanently on. The
    LED will stay on until you connect as well the line that will
    later go to DTR to +10V. 
    When you have verified that this works you can connect it
    properly to the computer. The linuxwd program has a test mode
    where it produces some printouts and stops after some time
    to sent pulses over DTR to simulate a locked up system. Run the
    command
    linuxwd -t /dev/ttyS0
    to run linuxwd in test mode (use /dev/ttyS1 if you have the
    hardware on COM2).
     
Hardware installation
    The RS232 interface has the following pinout: 
     
    9 PIN D-SUB MALE at the Computer.
    
      
        | 9 PIN-connector | 25 PIN-connector | Name | Dir | Description | 
      
        | 1 | 8 | CD | input | Carrier Detect | 
      
        | 2 | 3 | RXD | input | Receive Data | 
      
        | 3 | 2 | TXD | output | Transmit Data | 
      
        | 4 | 20 | DTR | output | Data Terminal Ready | 
      
        | 5 | 7 | GND | -- | System Ground | 
      
        | 6 | 6 | DSR | input | Data Set Ready | 
      
        | 7 | 4 | RTS | output | Request to Send | 
      
        | 8 | 5 | CTS | input | Clear to Send | 
      
        | 9 | 22 | RI | input | Ring Indicator | 
    
    
    Connecting the circuit to the RS232 should be straight forward.
    To connect the CPU reset line with the relay you need to locate
    the wires that go to the reset button on your computer. Connect
    the relay from our circuit in parallel to the reset button.
    
 
Conclusion
A watchdog is certainly not a 100% guarantee to have reliable system but
it adds another level of security. A problem can be a situation where
the file system check does not complete after a hardware reset. The new
journaling filesystems might help here but I have not tried them out yet.
The watchdog presented here is inexpensive, not too complex to build
and almost as good as most commercial products.
     
References
    
    
  
 
Talkback form for this article
Every article has its own talkback page. On this page you can submit a comment or look at comments from other readers:
2002-06-22, generated by lfparser version 2.28