Interconnects

What are your Interconnect Options?

 

Many HPC applications and codes are programmed to take advantage of parallel computation, and utilize the Message Passing Interface (MPI) Applications Programming Interface (API) as their primary method of communications between processes running on the same or different nodes in your cluster. Other applications are serial in nature, with an entire computational job running as a single process. These serial applications can have specific data or processing requirements that mandate a high speed Interconnect as well.


The most basic HPC cluster will utilize a single Gigabit Ethernet network for administrative traffic, data sharing (NFS or other protocols) and MPI or applications processing traffic. If your applications are bandwidth or latency sensitive, using only a single Gigabit Ethernet for your cluster network is perhaps the least desirable of your choices.


Often, an HPC cluster will be configured with two (2) internal networks. The first, a Gigabit Ethernet network connected to an on-board Network Interface Card (NIC) on each node and identical to the single Ethernet network used by the basic HPC cluster, is used for scheduling, node maintenance, basic logins, and perhaps data sharing, while the 2nd internal network is dedicated to computational traffic. This configuration ensures that critical computational traffic is not hampered by other traffic (which is normally much less bandwidth or latency sensitive).

 

The 2nd internal cluster dedicated network could also be Gigabit Ethernet as well. This option has one advantage, cost. Most systems used for HPC processing have two on-board Gigabit Ethernet interfaces, so the cost of an additional, dedicated Gigabit Ethernet network is only additional switch ports and cabling. For MPI based code(s), Aspen recommends a second internal Gigabit Ethernet network dedicated to your MPI or applications processing traffic as a minimum.

 

 

Gigabit Ethernet


Gigabit Ethernet provides full duplex communications at 1Gb/s (or 1000Mb/s) and latencies ranging from ~40us to ~300us. However, many codes require higher bandwidth or lower latency than standard Gigabit Ethernet interfaces and switches can provide to operate efficiently. Both InfiniBand and Myrinet technologies are commonly used in these cases.

 

 

InfiniBand


InfiniBand is a switched fabric link topology network which utilizes Host Channel Adapters (HCAs) installed in each node of the cluster to communicate. InfiniBand is made up of 2.5 Gb/s Lanes which are used in parallel to communicate between nodes.

 

A Single Data Rate (SDR) 4x (4 Lane) InfiniBand connection provides 10 Gb/s (or 10,000Mb/s) raw full duplex bandwidth, with 8 Gb/s usable to processes. An InfiniBand Dual Data Rate (DDR) 4x connection provides 20Gb/s (20,000Mb/s) raw full duplex bandwidth, with 16 Gb/s usable to processes. InfiniBand latencies range from ~1us to ~5us, depending on the HCA and switch topology used, and relatively large non-blocking fabrics can be constructed. InfiniBand also supports other protocols to facilitate capabilities such as access to remote memory, sockets, and storage.

 

An InfiniBand HPC network is commonly implemented using CX-4 copper cables, which are thicker than a standard IEEE Cat 5 cable, and there are length limitations. Maximum cable length using CX-4 cables is 15 meters for 4x SDR, and 10 meters for 4x DDR. Fiber optic cable options for InfiniBand networks do exist, but are quite expensive. Aspen recommends using the Open Fabrics Enterprise Distribution (OFED) InfiniBand stack on InfiniBand clusters unless your code(s) or application(s) are not supported.

 

The InfiniBand specification is supported by multiple vendor implementations, and current information on vendor implementation and MPI selection is necessary to determine support for and performance of your code(s) on any given implementation. For instance VASP, a molecular dynamics package from the University of Vienna, currently seems to run best on InfiniBand with Intel compilers (9.1, specifically), using Open MPI version 1.2.6 and the Intel Math Kernel Library on OFED version 1.3.1 using fftw version 3.1.2. Over a hundred combinations of different compilers, MPIs, and utilities were tested to arrive at this selection. As shown in the examples above, some codes can be quite complex to build for best performance.

 

Many gateway solutions also exist that can connect your clusters InfiniBand directly to other Enterprise networks if needed.

 

 

Myrinet


Myrinet is a high speed interconnect supplied by Myricom, an HPC interconnect company. Myricom originally manufactured a 2 Gb/s technology (Myri-2G), which was arguably the most widely deployed low latency clustering technology of its time. Myricom now provides Myri-10G solutions, which combine Myricoms Myrinet (MX) capabilities with near wire-speed 10 Gigabit Ethernet. Myri-10G NICs include processors and firmware to offload network protocol processing, lower node CPU utilization, and provide communications paths that bypass the host kernel. Myri-10G also supports fiber optic cables with a maximum cable length of 85 to 200 meters (depending on protocol used), and can provide ~2.3us MPI latency at 9.6Gb/s(9,600 Mb/s) raw full duplex bandwidth. Myrinet switches are used inside the Myrinet network and software encapsulation is used at the node to utilize 10 Gigabit Ethernet protocols if the node is configured to utilize MX. The ability to mix and match 10 Gigabit Ethernet and Myrinet protocol on the same network is a major advantage of Myrinet technology.

 

 

Other 10 Gigabit Ethernet


10 Gigabit Ethernet networking (other than from Myricom) can also be used to interconnect your HPC cluster, however the latency incurred by the protocols and switches does not currently lend itself well to the requirements of most MPI codes, and the price/performance ratio can be high. Low Latency switches are available, and we have had some success with low latency drivers such as Open-MX on commodity 1 and 10 Gigabit Ethernet. Contact Aspen for more information if you are interested in this type of solution.

 

 

Ethernet Bonding


Some clusters utilize Ethernet bonding, which bonds two Gigabit Ethernet interfaces together to provide more bandwidth than a single Gigabit Ethernet interface can provide. Your switch must provide this capability, and your distribution must support the capability, which most do. This is an efficient method to provide additional bandwidth for the higher data transfer requirements some applications exhibit. Bonding, however, does not help, and can interfere with, your MPI implementations. If your applications utilize MPI and you do not intend to configure a high speed interconnect, Aspen recommends that you do not utilize channel bonding on your Gigabit Ethernets, but instead utilize two dedicated networks, one for storage and administration, and one dedicated to your application MPI traffic.

 

It is difficult to say with any certainty which Interconnect will give your cluster the best price/performance ratio without knowing your specific code(s), situation, and requirements. Low latency interconnects can add significant per node cost to your cluster. Your Aspen Systems Sales Engineer can work with you to determine your requirements, and customize a solution that will serve you well. We also provide benchmarking across the different interconnects so that you can see the differences and implementation specifics of these Interconnects with your code(s). Ask your Aspen Sales Engineer about accessing our benchmarking clusters. We highly recommend that you benchmark your code to determine your best configuration if you are at all unsure of your selection choices or which Interconnect will serve you best.

 


<< Previous | Next >>


Bookmark and Share