Graphics Processing Units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing. Modern GPUs contain hundreds of processing units, capable of achieving up to 1 TFLOPS (trillion floating point operations per second) for single-precision (SP) arithmetic, and over 80 GFLOPS (billion floating point operations per second) for double-precision (DP) calculations. Recent high-performance computing HPC-optimized GPUs contain up to 4GB of on-board memory, and are capable of sustaining memory bandwidths exceeding 100GB/sec. The parallel hardware architecture and high performance of floating point arithmetic and memory operations on GPUs make them particularly well-suited to many of the same scientific and engineering workloads that occupy HPC clusters, leading to their incorporation as HPC accelerators.
GPUs have the potential to significantly reduce space, power and cooling demands, and reduce the number of operating system images that must be managed relative to traditional CPU-only clusters of similar aggregate computational capability. NVIDIA has begun producing commercially available Tesla GPU accelerators tailored for a GPU cluster. The Tesla GPUs for HPC are available either as standard add-on boards, or in high-density self-contained 1U rack mount cases containing four GPU devices with independent power and cooling, for attachment to rack-mounted HPC nodes that lack adequate internal space, power, or cooling for internal installation.
NVIDIA® Tesla® V100S Tensor Core is the most advanced GPU ever built for data center to accelerate AI, high performance computing (HPC), and graphics. It’s powered by NVIDIA Volta architecture in 16 and 32GB configurations, and offers the performance of up to 100 CPUs in a single GPU. Data scientists, researchers, and engineers can now spend less time optimizing memory usage and more time designing the next AI breakthrough.
NVIDIA T4 is a single-slot, low-profile, 6.6-inch PCI Express Gen3 Universal Deep Learning Accelerator based on the TU104 NVIDIA GPU. NVIDIA T4 Tensor Core GPU has 16 GB GDDR6 memory and a 70 W maximum power limit. NVIDIA T4 Tensor Core GPU is supplied as a passively cooled board that requires the whole system air flow to operate on the card within its thermal limits. NVIDIA supports x8 and x16 PCI Express for the T4.
There are three principal components used in a GPU cluster: host nodes, GPUs and interconnects. Since the expectation is for the GPUs to carry out a substantial portion of the calculations, host memory, PCIe bus and network interconnect performance characteristics need to be matched with the GPU performance to maintain a well-balanced system. In particular, high-end GPUs, such as the NVIDIA Tesla, require full-bandwidth PCIe Gen 2 x16 slots that do not degrade to x8 speeds when multiple GPUs are used. Also, Mellanox InfiniBand interconnects are highly desirable to match the GPU-to-host bandwidth. Host memory also needs to at least match the amount of memory on the GPUs to enable their full utilization, and a one-to-one ratio of CPU cores to GPUs may be desirable from the software development perspective as it greatly simplifies the development of MPI-based applications.
Any accelerator technology by definition is an addition to the baseline processor. In modern HPC environments, the dominant baseline architecture is x86_64 servers. Virtually all HPC systems use the Linux operating system (OS) and associated tools as a foundation for HPC processing. Both the Linux OS and underlying x86_64 processors are highly integrated and are heavily used in other areas outside of HPC — particularly web servers.
Until recently, GPU accelerators were added via the PCIe bus. This arrangement provided a level of separation from the core OS/processor environment, and to the other GPUs in the server. With the introduction of P100s, the connection of one GPU to another GPU is much better, and the connection with the same SXM2 link to the CPUs are coming. However, GPUs will still have a separation which does not allow for the OS to manage processes that run on the accelerator as if they were running on the main system. Even though the accelerator processes are leveraged by the main processors, the host OS does not track memory usage, processor load, power usage or temperatures for the accelerators. In one sense the GPU is a separate computing domain with its own distinct memory and computing resources.