Storage

What are your storage options?

 

Almost all clusters require shared data space that is exported to all nodes in the cluster. This space can be the users home directories, dedicated data space mounts, or combinations of both. Modern HPC clusters require larger and larger amounts of data space, and also require as much performance as is possible within the allowed budget.

 

 

The use of 1 Terabyte (TB) and larger Serial ATA (SATA) II and smaller Serial Attached SCSI (SAS) drives in combination with high performance internal RAID cards or external RAID systems helps make these goals possible. In a small cluster single master configuration, your master node can be configured with either internal RAID storage, or external RAID storage.

 

 

Internal RAID Systems

 

In internal RAID storage configurations, the master node can have up to 24 drive slots, two of which may be used for the mirrored O.S. drives. Aspen highly recommends that any RAID 5 or 6 sets always be configured with at least one hot spare drive, more if you can afford it. Given a 24 slot drive bay master node, subtracting 2 drive slots for the RAID 1 O.S. drives, using one drive as a hot spare, and using RAID 5, between 3 and 19 TB can currently be configured on a single master node. RAID 6, which provides fault tolerance from two drive failures (RAID 5 protects you from only one disk failure), can be configured on most systems as well, and would yield approximately 18 TB with one hot spare disk. These numbers represent raw space, and your choice of file system will affect how much is actually usable for user data. Most current 15, 16, and 24 slot (3.5” disk) chassis have limited front panel space, so the addition of a burning unit may not be possible. A slimline DVD or CD unit is normally configured into these chassis.

 

 

Hot Spare Disks

 

Configure as many spare disks as you can afford. All disks eventually fail, and the use of hot spares for critical data partitions is a small investment compared to the importance of your data. Aspen will always configure the RAID systems to notify you, and Aspen if possible, in the event of any disk failure, and we will ship you a new disk to replace the failed unit (unless you have contracted for on-site maintenance). So why use hot spares? You use hot spares for several reasons;

 

  • all disks are mechanical devices, and all mechanical devices eventually fail
  • the same model disks have a tendency to fail within a short time of each other
  • other bad things happen, sometimes at the same time your disks are failing


Let's say that you do not have a hot spare disk configured in your RAID system, and leave town for a 3 day weekend on Friday afternoon, a well earned vacation. Friday night, a single drive in your data RAID 5 set fails, notifications are duly sent to you via e-mail, and the RAID recovers. This particular drive is model “XXXX” from manufacturer “YYYY”, and it has a previously unknown adverse reaction to the slightly higher temperatures that it must endure in your facility.

 

You perhaps haven't been as diligent as you could have been about backups, because you've been busy. We all know how that goes. Perhaps your organization does not allow the cluster to e-mail an external source, and Aspen does not receive a notice of the disk failure either. Perhaps your organization mail server is down for maintenance that weekend, so no mail can be sent externally. Or perhaps Internet routing between your organization and Aspen is down that weekend.

 

On Tuesday morning you return, read your e-mail, and contact Aspen for a replacement drive. At this point, your best case scenario is that the replacement drive will arrive mid-morning on Wednesday, leaving your data unprotected, and open to loss from failure by a single disk drive, for over 4 days!

 

But perhaps something else bad happens. Lets say that the work on the mail server trips a breaker, which also happens to power the temperature monitoring for your building cooling system, which in turn caused a spike in your temperature, and another model “XXXX” drive to fail. The result is total loss of your current data set, perhaps including the applications that were running over the weekend while you were on that well deserved vacation, and a period of downtime for your cluster while you retrieve backups, receive additional drives, and rebuild the RAID system. A single hot spare disk is cheap insurance, and would have saved this users data.

 

If your cluster data is critical,

 

  • use at least one hot spare disk per RAID set unless the set is RAID 10 or RAID 1
  • use RAID 6 vs. RAID 5 if performance requirements allow
  • allow Aspen to configure the RAID to notify the appropriate users or administrators in case of disk failure
  • if you can, allow the RAID to be configured to notify Aspen Systems when a drive fails (via email to support@aspsys.com, which automatically opens a support ticket)

 

 

External RAID Systems

 

External RAID systems can be configured on your master node or a dedicated storage node. While often more expensive than an internal RAID solution, these systems can offer more expandability (additional expansion chassis can be added), better performance in some cases, and more streamlined manageability with an embedded web server, telnet and ssh access, and the ability to be managed independently of the master node. Some internal RAID cards can also utilize external expansion chassis as well, which can eliminate the expandability advantage. External RAID systems can also be configured to attach to multiple nodes, normally via Fiber Channel, allowing for node fail-over or parallel file system configuration.

 

Which RAID to select is determined by your storage requirements and budget. If you intend to greatly expand your data storage or upgrade the master node in the future, an external RAID might be better for your purposes. If you intend to utilize storage fail-over or connect the same RAID system to multiple hosts, an external RAID system is especially needed. If your data requirements are known, and within the capacity of an internal RAID solution, an internal RAID system might be more cost effective. In all cases, the minimum rule of one hot spare drive per RAID set should be followed.

 

 

Network File System

 

Many HPC clusters, especially small and medium clusters with data access requirements that can be served by a single host, utilize the Network File System (NFS) to mount data directories on the compute nodes. NFS is the de facto standard for data sharing in HPC clusters because of its ease of configuration and ubiquitous support. An NFS server runs on the node that serves the data space, and an NFS client runs on the compute nodes. This allows the users home directory, or a shared data directory, to be mounted to all compute nodes so that applications running on the compute nodes have access to the same data. The NFS directories can be shared over the clusters administrative network, which is often Gigabit Ethernet, or over the clusters high speed interconnect, which can significantly increase data access speed.

 

While there are projects to make NFS servers function in parallel so that more than one NFS server may be used, this capability is not main stream at this time. NFS relies on a single server to serve any particular data space, which means that the protocol overhead, local RAID speed, bus architecture, network interfaces, memory, and Central Processing Unit (CPU) speed of that server limits the speed as which NFS data can be accessed. If your application(s) data requirements or your overall data requirements for the cluster are higher than this single server can accommodate, the NFS server becomes a performance bottle neck for the cluster. Parallel file systems can be used to solve this issue.

 

 

Parallel File Systems

 

Aspen offers GFS, GPFS, Lustre, OCFS, and PVFS parallel file system options as well as other commercial products. Each parallel file system solution has distinct characteristics, and is used for specific types of data serving needs, so close consultation with your Aspen sales engineer is necessary to select the proper parallel file system to fit your needs. Parallel file systems are complex, and can require specialized knowledge to configure and maintain, so some additional organization training may be necessary. In many cases, a specific parallel file system can be obtained both in an open source, unsupported version, and as a commercial product with support. Direct commercial support for your parallel file system may be necessary to achieve optimum performance and reliability in your configuration.

 

 

GFS

 

GFS is the Red Hat Global File System, and supports shared disk access from multiple nodes to a single RAID. GFS is available on RHEL servers along with their Red Hat Cluster Suite as a supported commercial application, or can be installed in a more limited fashion as an open source application on Red Hat or Red Hat derivatives such as CentOS. A client on each compute node can be used, or each individual GFS server can also be an NFS server. GFS is normally deployed on a maximum of 8 servers, although larger deployments are possible and do exist.

 

 

GPFS

 

GPFS is the IBM General Parallel File System, a commercial product from IBM. Aspen is an IBM partner, and can build your cluster with IBM components and a customized software stack as outlined here and in our Configuration Guide and SOW, and include the GPFS file system. GPFS is a licensed commercial product, and GPFS servers can also serve as NFS servers to compute nodes.

 

 

Lustre

 

Lustre is a parallel file system originally developed by Cluster File Systems, Inc., and now owned by Sun Microsystems, with both commercial licensing and open source versions. Lustre is used in some of the largest HPC clusters in the world, and while considered by some to be difficult to configure, tune, and maintain, it is used in many very high performance environments. Aspen Systems also partners with Terascala, and can integrate the Terascala high throughput, scalable, Lustre parallel storage appliances into your cluster to serve your performance needs. The Terascala appliance removes many of the pitfalls of managing a Lustre implementation yourself while providing the superior speed and scalability of a Lustre parallel file system implementation.

 

 

OCFS

 

OCFS is the Oracle Cluster File System, an open source project from Oracle. OCFS is meant for use in an Oracle database environment, not as a general use file system.

 

 

PVFS

 

Parallel Virtual File System (PVFS) version 2 is an open source project design to provide high performance for parallel applications, where concurrent, large IO and many file access are common. PVFS is designed as a set of clients and servers, so normally a subnet of dedicated nodes provide the storage space and act as PVFS servers, while all other nodes function as clients to access data. PVFS can be configured in multiple ways, but it is recommended not to use PVFS servers themselves as compute nodes, as any crash of a running application on that node could cause the entire cluster to become inoperable. There are high availability configurations for PVFS which can be configured, but PVFS is not designed as long term storage, but rather as very fast scratch space for parallel applications.

 

 

Parallel File System Hardware Requirements

 

Almost all parallel file system implementations other than PVFS will require the use of an external RAID unit that can be connected to more than one host, either by SCSI (not recommended due to speed issues) or by Fiber Channel or InfiniBand.

 

 

Other Options

 

Other commercial software or hybrid software and hardware products exist, such as Panasas, which can provide extreme reliability and access speed for your cluster. Speak to your Aspen Sales engineer about other solutions we offer to meet your requirements. Almost all parallel file systems require additional hardware as well as custom software configurations, and Aspen can help you design your storage to meet your needs and wishes.

 


<< Previous | Next >>


Bookmark and Share