Cluster Management
What are your cluster management options?

The type of cluster you operate can drive different cluster management and support requirements.
A “lights out” cluster is one that is installed in a remote or co-location environment, with limited physical access by you or your administrators. Lights out clusters need remote power management, remote Keyboard Video Mouse (KVM) or serial port access, and automated tools to monitor and alert when problems are encountered.
“In house” clusters are located in areas that are easily accessible by you or your administrators. While full remote capabilities may not be as critical in an in house cluster, they may be desirable.
Cluster management and support is perhaps one of the most overlooked facets of operating a cluster. Two questions must be answered for your successful cluster deployment. What hardware and software capabilities will be installed on your cluster to facilitate successful management and support, and what are your cluster management, warranty, and support options?
Cluster Management Hardware & Software
Intelligent Platform Management Interface
Aspen highly recommends that you configure Intelligent Platform Management Interface (IPMI) on your cluster nodes. IPMI is a specification for a set of common interfaces for system administrators to monitor and manage any system. IPMI operates independently of the operating system, using a Baseboard Management Controller (BMC) on each node. A BMC is a small solid state card with a network interface that plugs into the node motherboard and is powered by the node power supply. The node does not have to be operational or booted for this access to work.
The IPMI BMC provides remote access to serial ports (Serial over LAN, or SOL) or even the actual video display on the node (KVM over LAN), allowing you to use a web browser or IPMI client to troubleshoot boot problems, diagnose hardware faults, or even modify BIOS settings.
IPMI is also used to remotely power on or off the node, and to retrieve sensor values from the motherboard such as voltages, temperatures, and fan speeds.
IPMI interfaces utilize an Ethernet interface for communications with the node. On some nodes, the IPMI BMC can be “vampired” to the primary Ethernet interface, so that no additional cabling is needed. The Ethernet interface can still be used by the node for communications as well. Almost all nodes can be configured with IPMI “3rd LAN” interfaces, which are a separate Ethernet interface dedicated to IPMI communications. Utilizing vampired IPMI interfaces can cut cost, while utilizing a 3rd LAN interface allows the IPMI network to be separated from the operational network for security or traffic purposes.
IPMI KVM over LAN interfaces can be used instead of or in conjunction with local KVM console capability on the cluster. The cost of a local KVM connection to each node is roughly equivalent to the cost of an IPMI BMC with KVM over LAN capabilities, and IPMI provides remote power and sensor capabilities to the node in addition to the console capability that a KVM solution provides.
Many customers think that the most cost effective solution for remote (via network) and local (at the cluster) management is to configure IPMI for their cluster, then install a single 1U integrated video console (TFT) unit in the cluster which is attached to a master or administrative node with a long cable. In normal operations, the console unit remains attached to the master or administrative node, and a web browser on that node can be used to access every nodes console via node IPMI interfaces. If desired for convenience or on-site work, the console cable can be reattached directly to any node in the cluster and used on that specific node for a time. The master(s) and administrative nodes are normally configured with IPMI “3rd LAN” interfaces, and an additional Ethernet connection from your organization is ran to those interfaces to allow remote console and power control from your organizational network should a fault situation occur.
Aspen Beowulf Cluster Management System
The Aspen Beowulf Cluster (ABC) Management System is a commercial Aspen web based application suite that you can purchase and use to monitor and manage your Aspen cluster. ABC requires IPMI interfaces on all nodes in the cluster, and uses them to present all your cluster management tools in a single secure web browser connection that you can access from anywhere you allow.

Using ABC, you can remotely clone and install nodes, attach to any nodes video screen (if KVM over LAN IPMI is purchased) or serial port, attach to any nodes via ssh, monitor all cluster hardware, upgrade software packages, define alarm thresholds for every monitored item, submit, review and manage scheduled jobs, and set up alerts for different fault conditions.
ABC is especially valuable to the beginning cluster user. It transforms a sometimes complex set of software tools into a converged, homogeneous environment that does not require the in-depth knowledge normally needed to operate, manage, upgrade, or maintain a cluster.
Every user on your cluster automatically has an ABC account. ABC can also provide a web portal for submission of scheduler jobs for your cluster users if torque or Sun Grid Engine schedulers are used. Only certain users are administrators by default (root), but any user account can be configured as an administrator in order to configure ABC and maintain devices and nodes in the cluster.

Using the ABC “Tools” proxy web service, RAID systems, Ethernet, Myrinet and InfiniBand switches and other peripherals which run web servers for their management interface can be displayed through the ABC UI even though they do not have an externally reachable IP. Access to these sensitive configuration interfaces is normally limited to administrators.
One of the many strengths of ABC is the ability to quickly copy any node and use that copy to add a new node or re-image an existing node or all nodes in the cluster. This makes major node upgrades, maintenance, or node recovery extremely quick and easy. Aspen also provides command line tools on our clusters for imaging, remote power, and sensor programs. These are often used by more advanced cluster users to quickly check status on nodes, remotely power them on or off, or to re-image large groups of nodes.
Ganglia
Aspen normally also installs and configures Ganglia on your cluster, and can make Ganglia available as a Tools menu option inside ABC, or externally available as a default web page for organizations who are used to seeing Ganglia as the front end web page for their clusters. Ganglia is a quite popular scalable distributed monitoring system for clusters and grids, and many HPC customers do not consider a cluster complete without it. Aspen will turn Ganglia on or off on your cluster based on your Statement of Work selection.
Switched and Metered Power Distribution Unit (PDU) Options
Your cluster can also be configured with remotely switchable PDUs, which can turn power off or cycle power to any system connected to it. Prior to the wide adoption of IPMI, any cluster which needed the capability to remotely power on and off nodes required these PDUs, and they are still often used on peripherals which do not support IPMI. If you wish to monitor power consumption on circuits, you may configure either switched or metered PDUs for power connections. Metered PDUs can be polled to determine total power consumption on a circuit, but cannot be used to power off an attached system as switched PDUs can. Any Un-Interruptible Power Supply (UPS) system(s) installed in your cluster can also be remotely polled for circuit power consumption.
Serial console switches or servers can be configured and installed in your cluster, which provide serial console access, and logging of any console events in a central location. In this case nodes are configured identically to IPMI Serial Over LAN (SOL) equipped nodes, with BIOS, the boot loader, and the operating system redirected to a serial port on the node. ABC supports these serial consoles, or Aspen can install the “ConMan” console manager for console logging and command line connections if your organization prefers.
KVM (Keyboard, Video, Mouse) Switches
Aspen can configure your cluster with a KVM system which can be connected to some or all nodes on your cluster. This will allow a user located at the cluster to utilize a local console unit to “hot key” switch between all connected nodes video consoles for maintenance purposes. If you desire remote console connectivity to your cluster KVM, an additional remote unit can be installed which can be remotely accessed by a web browser just as IPMI interfaces are, allowing you to remotely access your KVM, then hot key between displays to different hosts in your cluster. Normally, the remote KVM unit is connected to your organization Ethernet and IP addressed within that space, not your cluster internal IP space, to allow remote access from administrators on the organizational network.





