GPU Clusters for High Performance Computing
Aspen Systems has extensive experience developing and deploying GPU clusters
Graphics Processing Units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing. Modern GPUs contain hundreds of processing units, capable of achieving up to 1 TFLOPS (trillion floating point operations per second) for single-precision (SP) arithmetic, and over 80 GFLOPS (billion floating point operations per second) for double-precision (DP) calculations. Recent high-performance computing HPC-optimized GPUs contain up to 4GB of on-board memory, and are capable of sustaining memory bandwidths exceeding 100GB/sec. The parallel hardware architecture and high performance of floating point arithmetic and memory operations on GPUs make them particularly well-suited to many of the same scientific and engineering workloads that occupy HPC clusters, leading to their incorporation as HPC accelerators.
GPUs have the potential to significantly reduce space, power and cooling demands, and reduce the number of operating system images that must be managed relative to traditional CPU-only clusters of similar aggregate computational capability. NVIDIA has begun producing commercially available Tesla GPU accelerators tailored for GPU clusters. The Tesla GPUs for HPC are available either as standard add-on boards, or in high-density self-contained 1U rack mount cases containing four GPU devices with independent power and cooling, for attachment to rack-mounted HPC nodes that lack adequate internal space, power, or cooling for internal installation.
Tesla K80 Accelerators for GPU Clusters
A dual GPU board that combines 24 GB of memory with blazing fast memory bandwidth and up to 2.91 Tflops double precision performance with NVIDIA GPU Boost, the Tesla K80 GPU is designed for the most demanding computational tasks. It’s ideal for single and double precision workloads that not only require leading compute performance but also demands high data throughput.
Peak double precision floating point performance
- 2.91 Tflops (GPU boost clocks)
- 1.87 Tflops (Base clocks)
Peak single precision floating point performance
- 8.74 Tflops (GPU boost clocks)
- 5.6 Tflops (Base clocks)
Tesla P100 is a GPU Cluster Beast
The Tesla P100 enables a new class of servers that can deliver the performance of hundreds of CPU server nodes. Based on the new NVIDIA Pascal GPU architecture with five breakthrough technologies, the Tesla P100 delivers unmatched performance and efficiency to power the most computationally demanding applications.
Specifications of the Tesla P100 GPU accelerator
- 5.3 teraflops double-precision performance, 10.6 teraflops single-precision performance and 21.2 teraflops half-precision performance with NVIDIA GPU BOOST technology
- 160GB/sec bi-directional interconnect bandwidth with NVIDIA NVLink
- 16GB of CoWoS HBM2 stacked memory
- 720GB/sec memory bandwidth with CoWoS HBM2 stacked memory
- Enhanced programmability with page migration engine and unified memory
- ECC protection for increased reliability
- Server-optimized for highest data center throughput and reliability
Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (nodes) connected with a high-speed network. GPU Clusters can achieve higher availability, reliability, and scalability than is possible with an HPC Cluster.
GPU Cluster Architecture
There are three principal components used in GPU clusters: host nodes, GPUs and interconnects. Since the expectation is for the GPUs to carry out a substantial portion of the calculations, host memory, PCIe bus and network interconnect performance characteristics need to be matched with the GPU performance in order to maintain a well balanced system. In particular, high-end GPUs, such as the NVIDIA Tesla, require full-bandwidth PCIe Gen 2 x16 slots that do not degrade to x8 speeds when multiple GPUs are used. Also, Mellanox InfiniBand QDR interconnects are highly desirable to match the GPU-to-host bandwidth. Host memory also needs to at least match the amount of memory on the GPUs in order to enable their full utilization, and a one-to-one ratio of CPU cores to GPUs may be desirable from the software development perspective as it greatly simplifies the development of MPI-based applications.
A 7 step process to build a basic GPU cluster:
The Challenges Presented by GPU Accelerators
Any accelerator technology by definition is an addition over the baseline processor. In modern HPC environments, the dominant baseline architecture is x86_64 servers. Virtually all HPC systems use the Linux operating system (OS) and associated tools as a foundation for HPC processing. Both the Linux OS and underlying x86_64 processors are highly integrated and are heavily used in other areas outside of HPC — particularly web servers.
Virtually all GPU accelerators are added via the PCIe bus. This arrangement provides a level of separation from the core OS/processor environment. The separation does not allow for the OS to manage processes that run on the accelerator as if they were running on the main system. Even though the accelerator processes are leveraged by the main processors, the host OS does not track memory usage, processor load, power usage or temperatures for the accelerators. In one sense the GPU is a separate computing domain with its own distinct memory and computing resources.