Since late 2000 we have developed and maintained a general purpose technical and scientific computing cluster running the FreeBSD operating system. In that time we have grown from a cluster of 8 dual Intel Pentium III systems to our current mix of 64 dual Intel Xeon and 289 dual AMD Opteron systems. This paper looks back on the system architecture as documented in our BSDCon 2003 paper "Building a High-performance Computing Cluster Using FreeBSD" and our changes since that time. After a brief overview of the current cluster we revisit the architectural decisions in that paper and reflect on their long term success. We then discuss lessons learned in the process. Finally, we conclude with thoughts on future cluster expansion and designs.
Our cluster, Fellowship (for "The Fellowship of the Ring"), consists of 233 dual CPU nodes. 64 of these nodes use Intel Xeon CPUs and 169 use AMD Opterons (32 of which are dual core). Additionally, another 120 dual-core Opterons are on order and will be installed shortly. This will yield a total of 1010 CPU cores. These nodes are connected to each other and an array of core servers via a gigabit Ethernet switch. The majority of these servers run FreeBSD 6 as do the nodes. The nodes are booted over the network using the Intel PXE framework. They have internal disks, but those disks are used solely for local, temporary storage and are automatically configured during boot. To control access to the cluster we run Sun Grid Engine (SGE), an open source batch queuing system. Users submit their job scripts which may consist of independent single system jobs or parallel jobs using MPI (the Message Passing Interface), PVM (the Parallel Virtual Machine), or Grid Mathematica. System operations are monitored using the Ganglia cluster monitor, the Nagios system and service monitory, and SGE's internal data collection capabilities. We currently have over 100 users and see nearly 100% utilization during the day.
While designing and implementing Fellowship, we made a large number of architectural decisions. Some such as the choice of network booting using the FreeBSD diskless infrastructure have continually paid dividends, in this case by dramatically simplifying maintenance. Others such as using custom chassis in 2-post racks have worked well for us, but aren't for everyone. Similarly, the use of serial consoles was a good idea in the abstract, but has been a feature we have pulled back from due to high costs and limited utility most of the time. Finally, some decisions like allowing users direct system access during early development and attempting to encourage voluntary migration to SGE were definite mistakes, in this case due to user inertia. In this section we discuss the decisions we needed to make, options we considered then or would consider today, our initial decision, any deviations from that decision, and how the decision played out. The goal of this section is to evaluate our architectural decisions and in the process give readers the tools they need to begin designing their own clusters.
In the process of operating Fellowship, we have had a number of lessons driven home to us. None of them are truly shocking, but they were not thing we were expecting. For example, relatively uncommon problems can be major issues in a cluster. In one case, we were forced to disable BIOS access via serial console on some machines because they hung at boot around 1 out of 30 times. That meant we though we had corrected the problem during testing by reducing baud rate, but in fact we had just made it less common. In production, several machine hung at boot every time we rebooted the cluster. This forced us to disable the feature. The need to perform tasks on all nodes has really driven home the power of the Unix tool model. For example the following command was used to automatically add each host the the correct per CPU type host group:
grep r[01][01789] hostlist | fping -a | \ xargs -i node -n1 rsh node 'qconf -aattr hostgroup hostlist hostname \ @cpu_opteron_cpuid | grep "name string" | sed -e "s/[^0-9]*//"'
his and other similar examples highlight the value of knowing the powerful set of tools at your fingers as an administrator. Another key lesson is the difficulty of remembering that our users are experts in one or more domains, but not usually Computer Science or related fields despite the often large and complex applications they have written. This is helpful to keep in mind when debugging as it generally means the users don't know where to start. It is also important to keep in mind when discussing complex issues like scheduling because it is often the case that users have a mental model of computation that does not match reality. For example we have found that the belief that a job that starts sooner--no matter how contended the system is--will finish sooner when there are a number of cases where this is clearly untrue.
Thus far we have followed a model of continued expansion and refresh of Fellowship's hardware and software rather than wholesale replacement. This has had the advantage of allowing us to achieve a larger size than would have otherwise been possible given our capital budget. One downside is that some decisions such as node form factor and network interconnect are hard to change incrementally. With an eye toward an eventual second cluster for more capabilities and improved redundancy we have kept an eye on both technology trends that are easy to apply to Fellowship and ones that would only apply to a clean slate. We are in the process of designing a second redundant cluster to be installed at another location some time this fiscal year. In many respects it will be similar to Fellowship. The main expected differences are a faster network, probably 10 Gigabit Myrinet and using clustered storage such as that provided by Panasas and Isilon to completely eliminate node disks, our single largest source of hardware problems.
Brooks Davis is a Senior Member of Technical Staff in the High Performance Computing Section if the Computer Systems Research Department at the Aerospace Corporation. He has been a FreeBSD user since 1994 and a FreeBSD committer since 2001. He earned a Bachelors Degree in Computer Science from Harvey Mudd College in 1998. His computing interests include high performance computing, networking, security, mobility, and, of course, finding ways to use FreeBSD in all these areas. When not computing, he enjoys reading, brewing, cooking, and pounding on red-hot iron in his garage blacksmith shop.