Graphics Processing Units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing. Modern GPUs contain hundreds of processing units, capable of achieving up to 1 TFLOPS (trillion floating point operations per second) for single-precision (SP) arithmetic, and over 80 GFLOPS (billion floating point operations per second) for double-precision (DP) calculations. Recent high-performance computing HPC-optimized GPUs contain up to 4GB of on-board memory, and are capable of sustaining memory bandwidths exceeding 100GB/sec. The parallel hardware architecture and high performance of floating point arithmetic and memory operations on GPUs make them particularly well-suited to many of the same scientific and engineering workloads that occupy HPC clusters, leading to their incorporation as HPC accelerators.
GPUs have the potential to significantly reduce space, power and cooling demands, and reduce the number of operating system images that must be managed relative to traditional CPU-only clusters of similar aggregate computational capability. NVIDIA has begun producing commercially available Tesla GPU accelerators tailored for a GPU cluster. The Tesla GPUs for HPC are available either as standard add-on boards, or in high-density self-contained 1U rack mount cases containing four GPU devices with independent power and cooling, for attachment to rack-mounted HPC nodes that lack adequate internal space, power, or cooling for internal installation.
A dual GPU board that combines 24 GB of memory with blazing fast memory bandwidth and up to 2.91 Tflops double precision performance with NVIDIA GPU Boost, the Tesla K80 GPU is designed for the most demanding computational tasks. It’s ideal for single and double precision workloads that not only require leading compute performance but also demands high data throughput.
The Tesla P100 enables a new class of servers that can deliver the performance of hundreds of CPU server nodes. Based on the new NVIDIA Pascal GPU server architecture with five breakthrough technologies, the Tesla P100 delivers unmatched performance and efficiency to power the most computationally demanding applications.
There are three principal components used in a GPU cluster: host nodes, GPUs and interconnects. Since the expectation is for the GPUs to carry out a substantial portion of the calculations, host memory, PCIe bus and network interconnect performance characteristics need to be matched with the GPU performance to maintain a well-balanced system. In particular, high-end GPUs, such as the NVIDIA Tesla, require full-bandwidth PCIe Gen 2 x16 slots that do not degrade to x8 speeds when multiple GPUs are used. Also, Mellanox InfiniBand FDR or EDR interconnects are highly desirable to match the GPU-to-host bandwidth. Host memory also needs to at least match the amount of memory on the GPUs to enable their full utilization, and a one-to-one ratio of CPU cores to GPUs may be desirable from the software development perspective as it greatly simplifies the development of MPI-based applications.
Any accelerator technology by definition is an addition to the baseline processor. In modern HPC environments, the dominant baseline architecture is x86_64 servers. Virtually all HPC systems use the Linux operating system (OS) and associated tools as a foundation for HPC processing. Both the Linux OS and underlying x86_64 processors are highly integrated and are heavily used in other areas outside of HPC — particularly web servers.
Until recently, GPU accelerators were added via the PCIe bus. This arrangement provided a level of separation from the core OS/processor environment, and to the other GPUs in the server. With the introduction of P100s, the connection of one GPU to another GPU is much better, and the connection with the same SXM2 link to the CPUs are coming. However, GPUs will still have a separation which does not allow for the OS to manage processes that run on the accelerator as if they were running on the main system. Even though the accelerator processes are leveraged by the main processors, the host OS does not track memory usage, processor load, power usage or temperatures for the accelerators. In one sense the GPU is a separate computing domain with its own distinct memory and computing resources.