Robert J Gautier, Ateb Ltd.

Design and Management of a Brute-Force Cluster: Use of a 64-node Beowulf Cluster in Bioinformatics

This paper describes Grendel, a 64-node Beowulf cluster comprising 40 600MHz Athlons and 24 PII/350s (it was originally two separate clusters). Grendel is used by the Bioinformatics Research Group at the Computer Science Department, University of Wales, Aberystwyth for work on functional genomics. This workload consists of large number of database searches, each of which might take several days to complete, but which can be executed in any order, as long as they are all eventually done. We describe how those characteristics influenced choices of hardware and software for Grendel.

We chose to use local disks to store the databases used by Grendel jobs, in order to avoid the need for a very high-speed network. This helps to keep jobs independent of each other, and provides us with useful redundancy. However, it also means we have had to solve the problem of keeping 60 disk partitions consistent.

Grendel's job steps can be executed in any order; there is no need for a 'job queue' in the usual sense. However, with jobs taking a week or two to run, we have to be particularly careful that management of the cluster (and crashes, and reboots) can occur with minimal loss of work in progress.

A scheduling and management system ('farmer'), based on a PostgreSQL database and implemented mainly in Tcl, provides (via both command line and Web interfaces) a means to enter work, to control availability of nodes for work, and to monitor both the software and the hardware state of nodes. In order to support long-lived jobs, farmer is built so that it is possible to restart the cluster scheduler, the node supervisor or even the PostgreSQL database itself without disturbing the progress of a job on any node.

We describe how all this works, how it occasionally doesn't, problems we met in building Grendel, and some of the lessons we learnt along the way.

