High Performance Computing (HPC) technologies have come a long ways in recent years. Gone are the days of “Supercomputers” being “big iron” or “mainframe” computers with many processors all sharing the same memory in shared memory parallel (SMP). Now, more than ever, researchers are using HPC Clusters (HPCC) for their parallel HPC applications.
For a full HPC solution, you not only need the server computing nodes, but you need the low latency networking, the rack enclosure infrastructure, power, potentially liquid cooling, and a cluster management software. Designing a supercomputer is not always as simple as one may think. You have to ask questions such as if utilizing NVIDIA GPUs, or Intel Xeon Phi processors or coprocessors would be beneficial. Which high speed low latency networking should you use? How many CPU cores and how much memory per node should I have? Which CPUs give the best memory and bus speeds? Do I need CPU cache for my jobs? Should I use the Cloud instead? How about a high availability cluster or AMD solutions? Do I need a parallel filesystem? There are so many out there. Would LTS Lustre or even GPFS be right for me? What is ZFS? A complete system design is what Aspen Systems is known for. A cluster of high performance computers working together also requires maintenance, support and our Engineers are experts in this field.
Both the NVIDIA GPUs and Intel Xeon Phi CPUs need to have code optimized to run well on them. Which option you choose should depend on how ready your code is to run on either of these options. We previously mentioned the KNL optimized code list, we also have one for NVIDIA GPU Applications. Once you see that your application is ready to run on either of these platforms, contact your Aspen Systems Engineer about which one would be best for your requirements.
Intel has come out with Xeon Phi CPUs in the Knights Landing (KNL) series of MIC (Many Integrated Core) processors. There are benefits of using KNL CPUs but it is best if the code you like to use has already been optimized to run on this platform. Please take a look at some of the previously optimized code list for KNL to see if your favorite code has already been ported and optimized to run on KNL. If you’re still not sure, one of our Engineers would be glad to help you decide on what’s right for you.
At Aspen Systems, we usually recommend ZFS (Z File System) Storage for a more traditional NFS server. Adding an SSD or NVMe disk will speed up your performance and filesystem capabilities dramatically. Parallel filesystems are also an option with the right configuration. On the other hand, they can be a bit more difficult to maintain and troubleshoot. Parallel filesystems also do not perform the best at smaller storage sizes. Unless you have a few Object Storage Servers (OSSes), you won’t start to gain the performance of a parallel filesystem. Therefore, about 500TB may be thought of as a bare minimum when thinking about a parallel filesystem, and 1PB for a configuration where you really start to gain some performance. You can purchase professional support and releases of Lustre with LTS Lustre, or purchase a full solution with someone like Panasas File System or GPFS. When you are ready for more information, an Aspen Systems Engineer will gladly help you choose the right filesystem for you.
There are different types of networking available for your HPCC. Most clusters have an Ethernet network, and a secondary cluster network of either Mellanox InfiniBand or Intel OmniPath Architecture (OPA). While Ethernet now has the bandwidth to match what IB and OPA offer, the latency isn’t quite identical and the price for the higher bandwidth Ethernet (10Gb, 25Gb, 40Gb and 100Gb) is still rather high compared to the other two options. Both InfiniBand and OPA are designed for HPC loads and thus, there are libraries and tools for making good use of these two technologies. Both Mellanox InfiniBand and Intel OPA are both capable of delivering 56Gbps or 100Gbps bandwidth, with InfiniBand also now able to reach 200Gbps. There are benefits and drawbacks to each. Your Aspen Systems Engineer will be able to guide you to the right solution for your needs and budget.
Yes! Intel has CPUs capable of faster memory speeds than others. Just because a server has a 2666MHz memory DIMM doesn’t mean it will run at 2666MHz. And did you know that there are rules for how many memory DIMMs per channel you can use without losing memory speeds, and it all depends on which memory you use? Your Aspen Systems Engineer has more information about this confusing topic.
There are lots of them out there. From Bright Cluster Manger to Intel Orchestrator, ROCKS to Warewulf, and others written by industry experts. At Aspen Systems, we have our own cluster manager, ACME Software, which consists of open source code and code written by our HPC Engineers. ACME has both a Linux management command console and a web GUI where you can monitor and maintain your HPC cluster. Which management software you’d like to use would depend on your preferences and your budget. Your Aspen Systems Engineer will be able to guide you through your options.
There are two main commercially available HPC compilers that are commonly used in HPC clusters. These are The Intel Parallel Studio, and Portland Group (PGI) compilers from NVIDIA. To add to this, GCC and Open64 are two open source compliers which are commonly available, and free to use. Purchasing one of the two commercial compilers can speed up your code, and the Intel compilers are essential for compiling code on the Intel Xeon Phi processors. PGI is still the leader for using OpenACC directives.
Other necessities for compiling HPC code include the use of MPI libraries, CUDA libraries, and Math libraries. A few MPI Libraries exist including Intel’s IMPI, OpenMPI, and MPICH/MVAPICH. Each library can be linked by a different compiler. CUDA libraries are used for running code on NVIDIA GPUs, and a very popular Math library is the Intel Math Kernel Library (MKL). It can get very confusing when choosing which compiler and library to use, but Aspen Engineers have been working with these tools for a long time. Ask us and we’ll be happy to help you decide on what’s best for your applications.
There are different reasons to buy into the cloud. Not wanting to manage your own cluster, not having the facilities to house the cluster, or even not wanting to deal with the purchasing process are just a few reasons. But if you are cost conscious, is it cheaper to run in the cloud than it is to own your own equipment?
The folks at Holland Computing Center at the University of Nebraska-Lincoln ran some quick calculations. They took into account the cost of a full-time employee, power, cooling, space, racks, and infrastructure for their equations. They neglected networking costs as the Amazon Web Service (AWS) they compared did not have the HPC networking available. Their results can be found here. As you can see, it would cost about $4 Million to run on AWS with the same performance as one of [our or Aspen Systems built] HPC Clusters. The cluster hardware (with HPC Networking) came in at much lower than $4 Million to run on their own the hardware.
This isn’t always a fair comparison as not every cost was taken into account, but as noted in their study, the cost to educate people to help them run on the cloud can also add to the costs. Additionally, there are still some security risks of running on the cloud, as well as time to transfer files to and from the cloud. Yes, you can favor the cloud for smaller systems as the cost to hire a full-time employee for just a few servers can weigh even more heavily for such a system. There are times when the cloud makes sense, but having your own machine can come at a lower price.