Local Disk Configurations

<< RAID Levels | Local Disk Configurations | NAS and SAN Technology >>


Compute Node Local Disk Deployments

 

In some cases, the compute nodes in your cluster will have disks. Compute node storage could be single disks, or even be RAID configurations on each node. There are several reasons to equip your compute nodes with disks.

 

Your cluster implementation could be disked, which means that each node in the cluster has a local operating system installed on a disk in the node. This design has the advantage of initial simplicity, and is familiar to many HPC administrators, with little or no additional administration training necessary. This also allows local swap partitions to be configured on each node. If a codes memory requirements exceeds the physical RAM in the node, the Linux kernel swaps out less used memory pages and gives memory to your currently running application. This expands the amount of virtual memory available for your applications use by combining the physical memory installed in the node with the space configured for swap. However, disk access times and throughput are much slower than memory, so a code that utilizes swap often will run much more slowly, compromising the performance.

 

Some HPC codes require and utilize local scratch space on each node. The usage is temporary, and only occurs while the code is running. At the end of the model, these codes normally delete the temporary files they used during the calculation, freeing up the space for the next iteration. Even some single image clusters place disks in their compute nodes, even though those disks don't have a local operating system installed. They do this so that local disks can be configured with swap and scratch space, making the cluster more flexible.

 

In some cases where code execution cycles are critical, and it is imperative that the run not be interrupted, nodes may be configured with a RAID 1 mirror in order to prevent a single disk failure from aborting the run. In other cases, local scratch space performance is critical, and becomes the limiting factor for code execution. In these situations, designers utilize software or hardware RAID 0 striped configurations in order to achieve the fastest possible speed for local scratch. Finally, you occasionally have codes with high performance local scratch requirements that are also critical, or not easily re-ran, the worst of all possible worlds. In this case, nodes are often configured with RAID 5 or even RAID 10 local disk configurations.

 

Another local scratch space solution exists for codes who take advantage of MPI-IO , and that is to configure and use dedicated I/O nodes, usually with local RAIDed file systems, to serve PVFS shares to the rest of the cluster nodes. Usually, nodes serving PVFS shares are not used as compute nodes for stability reasons, so some thought is necessary to cost effectively deploy a PVFS solution.

 

Master or Storage Node Local Disk Deployments

 

On small and medium clusters (256 nodes or less), it is quite common to utilize integrated local disk storage on either a master node or a dedicated I/O node to service the clusters data needs using the NFS protocol. You can expect aggregate data speeds of ~100 MB/s when your cluster mounts an NFS data share via Gigabit Ethernet, and ~170 MB/s when the share traverses the high speed Interconnect (Infiniband, Myrinet, 10 Gbit Ethernet). Please note that this is the total performance available from that single NFS server to all nodes in the cluster. This is quite sufficient performance for many HPC codes, and is a cost-effective way to make anywhere from 5 to 100 TB/s of disk storage available for your computing needs.

 

In the small and medium cluster, the size of your cluster and your data serving needs will predicate whether or not you incorporate storage on your master node, or dedicate a separate node for storage. In both cases, Aspen recommends that any head node, be it a master or dedicated I/O node, have dedicated O.S. disks that are RAID 1 mirrored in order to prevent a single disk failure from crippling the node. If the node is to be used as a data server for your cluster, it generally will incorporate a high speed RAID card, which supports RAID 5 , RAID 6 , RAID 10 , RAID 50 , or RAID 60 . While the O.S. disk RAID 1 might be on the same RAID card, a separate RAID volume with separate disks is used to house the data space that will be NFS exported to the cluster. Aspen always recommends the use of hot spare drives in your RAID partitions.

 

Given todays' larger drives (2 TB drives are now available - mid 2009) it's a good idea to limit the number of disks in your RAID volumes, and to sincerely consider utilizing RAID 6 volumes to house your data instead of the more traditional, lower overhead, RAID 5. The reason is somewhat complex, but important. The more disks in the RAID volume, the more likely it becomes to encounter an unrecoverable sector during a RAID reconstruction. This problem is exacerbated by the fact that the read error rate of disks has not gotten better in recent years, even though the disks have become larger. The industry standard read error rate for SATA drives is about 1014. That means that a read error is almost certain to occur once in every 1014 bits read. During a RAID reconstruction on a 10TB RAID volume using larger drives, it is highly probable that a read error will occur at least once during the rebuild. Enterprise quality drives push the read error rate to about 1015, which is slightly safer, but still a scary proposition. The second parity drive used in a RAID6 configuration can protect you when that read error occurs during the RAID rebuild cycle after a disk fails. Let us be perfectly clear.

 

  1. The larger drives used in RAID sets today virtually guarantee that a read error will occur during a RAID rebuild
  2. Drive failures follow the Bathtub Curve , so disk tend to fail at the same time.
  3. If you have configured your disks as a RAID 5 and a read error occurs on one of the non-failed drives during that rebuild, you are virtually guaranteed to lose data.
  4. RAID 6 gives you another level of protection against this possibility
  5. Always use hot spare drives in your RAID partitions to lower your exposure time.

 

Modern RAID cards from Adaptec and LSI can be used to connect up to 24 disks inside a single chassis, and external SAS expansion ports are available on some models. This allows additional chassis to be chained to the RAID card, expanding your storage space even more. Performance can be quite good, depending on your configuration. It's not uncommon to see ~500 to ~700 MB/s local disk read performance, with write slightly less, although your performance will depend on the RAID configuration and local file system used on your system. Local disk performance is somewhat misleading in this configuration, as that local disk performance cannot be accessed by the compute nodes due to bottle necks incurred by the network or the overhead incurred by the use of the NFS protocol.

 


<< RAID Levels | Local Disk Configurations | NAS and SAN Technology >>


 

Bookmark and Share