Reference Center >> Storage >> Parallel File Systems
- HPC Buyers Guide
Detailed Buyers Reference- Applications
- Budget
- Processors
- Interconnects
- Memory
- Distributions
- Cluster Management
- Warranty and Support
- Cluster Racking
- Cluster Logical Layout
- Master Node
- Storage
- Backup
- Nodes
- Power
- Cooling
- Commercial Software
- Schedulers
- Account Management
- Security
- Windows Integration
- Shipping and Delivery
- On-Site and Training
- Next Steps
HPC Applications
Processors
System
Network
Storage- Power
- Cooling
- Facilities Estimator
- Remote Management
GPU Computing- Green Computing
- Visualization
- Compilers
- Operating Systems
- Virtualization
- Schedulers and Tools
Parallel File Systems
<< NFS | Parallel File Systems | Overview >>
Aspen offers GFS
,
GPFS
,
Lustre
,
Panasas
PanFS
,
OCFS
,
and
PVFS
parallel file system options to our customers.
Each parallel file system solution has distinct characteristics, and is used for specific types of data serving needs, so close consultation with your Aspen sales engineer is necessary to select the proper parallel file system that best fits your specific needs.
Parallel file systems are complex, and can require specialized knowledge to configure and maintain, so some additional organization training may be necessary. In some cases, a specific parallel file system can be obtained both in an open source, unsupported version, and as a commercial product with support. Direct commercial support for your parallel file system may be necessary to achieve optimum performance and reliability in your configuration. We will cover the parallel file system options we have outlined, and discuss advantages, disadvantages, and situations where a particular solution might be desirable.
Click the
icon to
expand a section.
GFS

GFS is the Red Hat Global File System. GFS is available on RHEL servers along with their Red Hat Cluster Suite as a supported commercial application, or can be installed as an open source application.
GFS Client software on each GFS node can be used to access a GFS share
that resides on a shared block device. The device is normally an external
RAID system or SAN, and is connected to the GFS nodes via Fibre Channel.
You may also use
GNDB
or
ISCSI
to export a device to GFS nodes via an IP network. Generally, GNBD is a lighter
protocol, and requires less CPU overhead, so most configurations of this
nature utilize GNBD.
GFS nodes utilize fencing to isolate failed nodes. Fencing is done via
Switched PDUs
or
IPMI
,
and is used by the other GFS nodes to control
the power on a suspect node. GFS uses a Logical Volume Manager called
CLVM (Clustered Logical Volume Manager) which is a modified to be
cluster aware, and a distributed lock manager to control file modification.
There are two versions of GFS now, so the original version is called GFS.
The newer version is called GFS2, and is included with RHEL as of version
5.3. GFS is also available on and included with the Fedora and CentOS
distributions.
A GFS file system requires quorum to allow the file system to remain active and accessible. Quorum is simply the need to have more than half the designated nodes active in order to continue to operate the file system. This means that the file system will only continue to operate if half of the nodes plus one are active and responding. If your system had 8 GFS nodes, 5 of them must be up and operating for the file system to be active. If you had only four GFS nodes, three must be active to maintain the file system. GFS has a special quorum configuration for 2 GFS nodes, that allow one node to be active and the file system to continue to be served.
GFS uses a cache control mechanism called glocks. Each inode in the file system has two glocks, an iopen, and an inode. There are four modes a glock can be in;
UN (unlocked)
SH (shared - a read lock)
DF (deferred - a read lock incompatible with SH)
EX (exclusive)
Each glock mode allows different caching of the data and metadata, and any operation which changes an inode or the metadata associated with it sets the glock to exclusive lock(EX) mode. This causes an interesting performance problem with some HPC codes when they're ran inside a GFS environment.
When a glock is set to EX mode, the node who set the glock owns the inode for the period that the mode is set. File creations and deletions within the same directory, and operations on a specific file causes the glock for that inode to be set in EX mode. This means creation and deletion of files within the same directory, as well as writes to the same file, should be restricted to the same node within the GFS cluster if optimal performance is desired. If an HPC code specifically creates and deletes files within the same directory from different nodes, which they often do, performance can suffer, as each node must wait for the glock set by the other node to be unset before it can access the inode to set its own glock. This can cause unacceptable delays in the model execution. Adjusting the model to write to separate directories on different nodes can solve this problem.
GFS is often deployed using Fibre Channel adaptors in each node, which are then connected to a common SAN storage device which houses the GFS data store. The requirement to equip all nodes with Fibre Channel adaptors, and fibre channel switches if the node count is higher than 4 or possibly 8 nodes, can become cost prohibitive to a GFS deployment. Utilizing GNBD to mount block devices via network interfaces can cause performance issues if the network is blocking or improperly designed.
One very useful
4 node GFS configuration
is four GFS nodes connected directly to a high performance fibre channel RAID
and
functioning as NFS servers. Some external RAID systems provide multiple
Fibre Channel ports directly on the RAID, and GFS file sytems can be NFS
exported. This configuration can serve a GFS shared file system via NFS to
many nodes, eliminating stock NFS server bottle necks by static load
balancing NFS clients across the available GFS nodes. Every node sees the
same global name space, and performance can be quite good.
For even more fault tolerance in this configuration, RedHat Cluster Suite can be used to serve virtual IP addresses which are shared between GFS nodes, providing NFS fail-over should a single GFS node fail. An NFS server failure becomes almost transparent, and this configuration provides very good scaling, as multiple NFS servers can access the global name space and serve the same data.
If larger numbers of GFS nodes are used, a fibre channel switch may be required that can connect the number of GFS nodes needed and the desired number of ports on the storage system. SAN configurations with larger port counts require larger director class switches, which are commensurately more expensive.
GPFS

GPFS is the IBM General Parallel File System, a commercial product from IBM. Aspen is an IBM partner, and can build your cluster with IBM components and a customized software stack as outlined in our Configuration Guide and SOW, and include the GPFS file system. GPFS is a licensed commercial product, and GPFS servers can also serve as NFS servers to compute nodes. The current version (mid 2009) is GPFS v3.2.
GPFS is only supported on Enterprise Linux distributions from RedHat or Novell, so your cluster must run either RHEL 4, RHEL 5, or SLES 9, 10, or 11. GPFS is more commonly deployed on Novell SuSe systems, rather than RHEL. GPFS also requires specific kernel versions within these distributions. GPFS does not allow SELinux in enforcing mode, and certain kernel configurations are not supported.
GPFS utilizes NSDs (Network Shared Disks), either directly attached via a SAN, or via an NSD server to share disk resources across the cluster. Each NSD is identified uniquely across the name space via its NSD name, and is presented inside a cluster file system to each node. NSDs that are not directly attached to the node via SAN are served from NSD servers. You may have primary and backup NSD servers to facilitate fail-over and fault tolerance. GPFS also utilizes fencing to isolate failing cluster nodes, and has a GPFS specific fencing mechanism called disk leasing. If a host has disk leases, it can submit I/O to that file system. If the lease is withdrawn (node fails), the node can no longer access that file system. In fail-over mode, the metadata log from the failing node is replayed to return the file system to a consistent state.
As with GFS, GPFS requires quorum to allow the file system to remain active. Quorum is simply the need to have more than half the designated nodes active in order to continue to operate the file system. GPFS also uses the concept of tie breaker disks, which can allow you to run with as little as one quorum node active. If tie breaker disks are enabled, you may only have 8 quorum nodes. When tie breaker disks are not used, GPFS does not limit the number of quorum nodes that can be used.
GPFS requires IP connectivity, and
it is best to configure GPFS with a high speed Interconnect, as network
traffic can be substantial. GPFS is better suited for larger cluster
deployments and bigger data shares, and can be expensive. GPFS is licensed
per core, and as two different types of systems.
GPFS servers mount file systems, but also have management functions,
can act as a quorum node and/or as an NSD server, and can provide NFS shares
to non-GPFS nodes.
GPFS clients can only mount GPFS file systems and access the data
on them. The
GPFS FAQ
and the
GPFS Wiki
are both good resources for further study on GPFS configuration and deployment.
If you are considering GPFS, Aspen recommends that we configure your HPC
solution with IBM systems and components where possible.
Lustre

Lustre is a parallel file system originally developed by Cluster File Systems, Inc., and now owned by
Sun Microsystems Oracle,
with both commercial licensing and open source versions. Lustre is used in
some of the largest HPC clusters in the world, but it can be difficult to
configure, tune, and maintain. The current version of Lustre is version 1.8,
and we recommend that any new deployments be that version or newer unless
an older version is predicated by compatibility issues.
Lustre scales to thousands or tens of thousands of client systems, can
handle Petabytes of storage, and provides hundreds of Gigabytes per second
of storage access speed. If you're looking for the large scale cluster file
system of choice, in use by several of the top 10 largest clusters in the
world at any given time, Lustre is it. Both RHEL and Novell SuSe offer
standard enterprise kernels with Lustre support.
Lustre utilizes the following concepts and components to present a unified Lustre file system;
Metadata Servers (MDS),
Metadata Targets (MDT),
Object Storage Servers (OSS),
Object Storage Targets (OST),
Lustre clients
MGS
The MDS makes metadata available to clients via MDTs. So each MDS manages names and directories in the Lustre file system, and provides network connectivity for one or more MDTs, which are local to the MDS. MDTs store metadata (filenames, directories, permissions), and there is only one MDT per Lustre file system.
The OSS provides I/O and network connectivity for one or more local OSTs. A typically configured OSS serves 2 to 8 OSTs, and each OST can be as large as 8 TB. The OST stores the actual file data on one or more OSSs. A single Lustre file system can have many OSTs, and you can stripe across many OSTs for performance using a Logical Object Volume (LOV).
Lustre clients run on nodes, and are used to mount and access the
Lustre file system. Lustre presents a single name space across the
clients, and simultaneous read/write by multiple clients to the
same file is supported. Lustre utilizes a Lustre Networking API
(LNET) to handle metadata and file I/0 across the file system. LNET
supports most of the high performance Interconnects in use today as
well as TCP/IP. Lustre can also take advantage of
remote
direct memory access
over Infiniband to improve throughput and reduce CPU utilization.
Finally, Lustre stores configuration information about all Lustre file systems inside a cluster on an MGS, which is used by Lustre targets and clients to exchange metadata information.
A modified version of the EXT3 file system is used on Lustre MDTs and OSTs to
actually store data, and there are plans to utilize
ZFS
as a possible storage target as well. Lustre is famous for scaling because of
its design. A Lustre client accessing a file requests a file name look up
on its MDS. If the file is to be read or written, the client passes that
info off to a LOV, which maps the offset and size to an object which is
served by an OST. The client then locks the file and executes reads and/or
writes in parallel directly to the OSS containing that OST. This design
allows direct client to
OSS/OST communications, and there can be many OSTs, so total bandwidth for
read/write operations can be increased almost linearly by adding more
OSSs with OSTs to the file system.
There is a lot of flexibility regarding where the MDT or OSTs are located, but normally, an OSS has two to four OSTs, and the MDS is separated for performance. Meta-data is one of the most critical and limiting factors in a Lustre file system. Some configurations place the metadata on solid state drives to increase speed and lower latency.
Lustre does have high availability features, such as active/active OSSs with SAN connectivity to shared disks, and fail-over MDS systems. One interesting reliability feature that Lustre implements, and that other parallel file systems do not, is that the Lustre client does not directly write to the file system served by the OST. Instead, the OSS does the file system modifications. This can isolate the file system from incorrectly configured or defective clients, and forms an additional layer of protection against file system corruption.
The Lustre client can be configured as both a patched and an
unpatched version. The unpatched client does not require a kernel
patch to operate. Aspen generally configures only patched clients
for reliability. One of the major learning curves and administration
issues customers may encounter when deploying and maintaining a Lustre
file system is the configuration and deployment of Lustre servers and
components.
Aspen partners with
Terascala
to provide high throughput and scalable Lustre appliances. The Terascala
appliances are purpose built to serve Lustre, and they are very easy to
configure and administer. They also offer superior scaling. The
Terascala
RTS
1000
system can provide 2 to 10 GB/s via Infiniband to your cluster and up to
192TB in a single rack.
The Terascala appliance removes many of the pitfalls of managing a Lustre
implementation yourself, while providing the superior speed and scalability
of a Lustre parallel file system implementation.
Panasas

Aspen partners with
Panasas
to deploy their
parallel NFS
solution. Panasas is one of the creators of parallel NFS, which is currently
in the process of IETF standards approval.
Parallel NFS (pNFS) is part of the NFS v4.1 standard, and allows
clients to access stroage devices directly, and in parallel. pNFS eliminates
the scalability and performance issues assocated with standard NFS servers
in deployment today. pNFS utilizes metadata, much like Lustre, and moves
the metadata server out of the direct path, just as Lustre does. pNFS, when
ratified, will bring together the benefits of a parallel file system with
the ubiquitous NFS standard. This may well be the long-term solution
of choice for HPC parallel file sytems, however, it is still being developed.
Some of the major players and developers of pNFS are Panasas, EMC, IBM,
LSI, NetApp, and Sun Oracle. Panasas
is arguably the leader in implementing pNFS as a working protocol today.
Panasas open sourced their
DirectFLOW
client software in 2007, and that code is being used as a foundation for
the pNFS client that is being developed by the community today.
pNFS works a bit differently than other parallel file systems we have
covered. pNFS supports files, blocks, and objects for storage access. Files
are served using the NFS v4.1 protocol, much like standard current NFS, while
blocks are served by Fibre Channel, iSCSI, or even
FCOE
.
The pNFS client requests and recieves a layout from the pNFS
server. A layout maps the file into storage devices and addresses,
and the pNFS client uses the layout to perform direct I/O to the
storage system. The server has the capability to recall the layout during
the process, but assuming that doesn't happen, the client commits the
changes and returns the layout to the server.
One of the powerful capabilities of pNFS is the ability for the protocol to gracefully degrade, and use regular NFS v4.1 without the parallel extensions on a standard NFS client. The client is transparent to applications, much like any current NFS client. pNFS may be able to normalize access across different storage vendors in future, using the same client for all of them, which would be greatly appreciated by cluster administrators everywhere. The standard is quite extensible, and future back-end technologies could be used to extend the standard to future storage technology. The current schedule has major Linux distributions shipping supported pNFS in 2010, and pNFS has already been demonstrated via open source clients at SC08. The Panasas storage cluster is pNFS compatible today, and Panasas is heavily involved in the pNFS standards process, so an investment in Panasas is automatically future-proofed.
Panasas has developed the
PanFS
file system, which allows a pNFS server to be layered on
top of PanFS without a protocol gateway. Panasas supports standard NFS client
access to their storage solution in addition to their parallel client, which
can greatly increase the flexibility of the solution. Panasas
offers several different levels of systems, which they call the
7, 8, and 9 Series. All the systems provide a
global name space and unified management capabilities and all the systems
are easily expanded by adding more storage shelves to an existing system, increasing performance
and storage space.
OCFS

OCFS is the Oracle Cluster File System, an open source project from Oracle. OCFS version 1 was meant for use only in an Oracle database environment, not as a general use file system. OCFS version 2 is POSIX compliant, and can be used as a cluster file system.
OCFS provides variable block sizes, flexible extent allocations, and journaling, both ordered and writeback, a distributed lock manager, and support for different kinds of I/O types.
OCFS2 is open source, and included in the mainline Linux kernel as of 2.6.19, but Aspen recommends that OCFS only be deployed as a supported version at this time. This requires that an enterprise distribution from RedHat or SuSe be selected for your cluster operating system, and specific kernel versions are needed
OCFS2 has not yet made many inroads into HPC space. Its roots were more in the Oracle database space, and its original design goals were to service Oracle databases. At this time, Aspen recommends that OCFS and OCFS2 only be deployed in a database cluster, specifically to serve Oracle databases. With the acquisition of Sun Microsystems, it will be interesting to see how Oracle handles its file system deployments in the future.
PVFS

Parallel Virtual File System (
PVFS
) Version 2 is an open source
client/server designed to provide high performance for parallel applications
with a large amount of concurrent I/O and a high number of file accesses.
PVFS is designed as a set of clients and servers, so normally a subset of
dedicated nodes inside your cluster provides the storage space and act as
PVFS servers, while all other nodes function as clients to access data.
PVFS can be configured in multiple ways, but it is recommended not
to use PVFS servers themselves as compute nodes, as any crash of a running
application on that node could cause the entire cluster to become inoperable.
There are
high availability
configurations for PVFS which can be configured, but PVFS is not
designed to serve as long term storage. PVFS functions better as
as very fast scratch space for parallel applications, with output and
non-ephemeral data kept on other types of more reliable storage media.
The use of nodes, usually with high performance Interconnects, as
dedicated I/O servers can become expensive, so some thought needs to go into
your PVFS design and the advantages it brings to your code execution.
The current release of PVFS2 is 2.8.1 (March, 2009). PVFS works
best with codes that have been written with
MPI-IO
( Message Passing Interface - Input/Output ), which is part of
the MPI standard. If your code is written with MPI-IO, PVFS2 might be
a peformance enabler. If it is not, other technologies can give the same
performance while providing much more reliability. To be clear, PVFS
is really meant to be used with codes that utilize MPI-IO. In some
conditions where this is not true and more general I/O requirements
apply, a single NFS server can out-perform a PVFS share.
PVFS clients are available for all major linux distributions, and PVFS kernel modules exist on any Linux kernel after 2.6.0. PVFS is tunable to reach the performance you wish. For instance, adding additional metadata servers can greatly increase small file access and performance. Larger files are better served by a larger number of servers. PVFS doesn't support hard links, shared mmapping, or locking. Aspen recommends configuring PVFS only on systems that run codes that can take advantage of the PVFS design and that heavily utilize MPI-IO. Speak to your sales engineer about your requirements and whether PVFS can help your performance on your cluster.
<< NFS | Parallel File Systems | Overview >>



