NVIDIA GPU Clusters for High Performance Computing

Aspen Systems has extensive experience developing and deploying GPU servers and GPU clusters
NVIDIA Preferred Solutions Provider

Graphics Processing Units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing. Modern GPUs contain hundreds of processing units, capable of achieving up to 1 TFLOPS (trillion floating point operations per second) for single-precision (SP) arithmetic, and over 80 GFLOPS (billion floating point operations per second) for double-precision (DP) calculations. Recent high-performance computing HPC-optimized GPUs contain up to 4GB of on-board memory, and are capable of sustaining memory bandwidths exceeding 100GB/sec. The parallel hardware architecture and high performance of floating point arithmetic and memory operations on GPUs make them particularly well-suited to many of the same scientific and engineering workloads that occupy HPC clusters, leading to their incorporation as HPC accelerators.

GPUs have the potential to significantly reduce space, power and cooling demands, and reduce the number of operating system images that must be managed relative to traditional CPU-only clusters of similar aggregate computational capability. NVIDIA has begun producing commercially available Tesla GPU accelerators tailored for a GPU cluster. The Tesla GPUs for HPC are available either as standard add-on boards, or in high-density self-contained 1U rack mount cases containing four GPU devices with independent power and cooling, for attachment to rack-mounted HPC nodes that lack adequate internal space, power, or cooling for internal installation.

NVIDIA GPU Cluster - Nvidia 1U Server for GPU Clusters
1U Tesla GPU Server with 4 Nvidia V100s

Contact Aspen to find out which solution is best for your organization.

NVIDIA® Tesla® V100 and V100S Tensor Core GPUs

NVIDIA® Tesla® V100S Tensor Core is the most advanced GPU ever built for data center to accelerate AI, high performance computing (HPC), and graphics. It’s powered by NVIDIA Volta architecture in 16 and 32GB configurations, and offers the performance of up to 100 CPUs in a single GPU. Data scientists, researchers, and engineers can now spend less time optimizing memory usage and more time designing the next AI breakthrough.

NVIDIA Tesla V100 for NVLink
NVIDIA Tesla V100 with NVLink
  • 7.8 teraFLOPS for double precision
  • 15.7 teraFLOPS for single precision
  • 125 teraFLOPS for deep learning
  • 300 GB/s for NVLINK
  • 32/16 GB HBM2 for memory capacity
  • 900 GB/s bandwidth
  • 300 WATTS for Max Consumption
NVIDIA Tesla V100 for PCle
NVIDIA Tesla V100 for PCle
  • 7 teraFLOPS for double precision
  • 14 teraFLOPS for single precision
  • 112 teraFLOPS for deep learning
  • 300 GB/s with NVLINK
  • 32/16 GB HBM2 for memory capacity
  • 900 GB/s bandwidth
  • 250 WATTS for Max Consumption
NVIDIA Tesla V100S for PCle
NVIDIA Tesla V100-S for PCle
  • 8.2 teraFLOPS for double precision
  • 16.4 teraFLOPS for single precision
  • 130 teraFLOPS for deep learning
  • 32 GB/s with NVLINK
  • 32 GB HBM2 for memory capacity
  • 1,134 GB/s bandwidth
  • 250 WATTS for Max Consumption

NVIDIA T4 70 W GPU Deep Learning Accelerator

Nvidia Tesla T4 used for GPU Clusters and GPU Servers
Start to build your HPC system using NVIDIA T4 GPU

NVIDIA T4 is a single-slot, low-profile, 6.6-inch PCI Express Gen3 Universal Deep Learning Accelerator based on the TU104 NVIDIA GPU. NVIDIA T4 Tensor Core GPU has 16 GB GDDR6 memory and a 70 W maximum power limit. NVIDIA T4 Tensor Core GPU is supplied as a passively cooled board that requires the whole system air flow to operate on the card within its thermal limits. NVIDIA supports x8 and x16 PCI Express for the T4.

Product Specifications:
  • 320 Turing Tensor Cores
  • 2560 NVIDIA CUDA Cores
  • 8.1 teraflops single precision performance(FP32)
  • 65 FP16 teraflops mixed precision (FP16/FP32)
  • 130 INT8 TOPS for INT8 precision
  • 260 INT4 TOPS for INT4 precision
  • x 16 PCIe Gen3
  • 16 GB GDDR6 for memory capacity
  • 320+ GB/s bandwidth
  • 70 watts for power
  • ECC support

Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (nodes) connected to a high-speed network. GPU Clusters can achieve higher availability, reliability, and scalability than is possible with an HPC Cluster.

GPU Cluster Architecture

There are three principal components used in a GPU cluster: host nodes, GPUs and interconnects. Since the expectation is for the GPUs to carry out a substantial portion of the calculations, host memory, PCIe bus and network interconnect performance characteristics need to be matched with the GPU performance to maintain a well-balanced system. In particular, high-end GPUs, such as the NVIDIA Tesla, require full-bandwidth PCIe Gen 2 x16 slots that do not degrade to x8 speeds when multiple GPUs are used. Also, Mellanox InfiniBand interconnects are highly desirable to match the GPU-to-host bandwidth. Host memory also needs to at least match the amount of memory on the GPUs to enable their full utilization, and a one-to-one ratio of CPU cores to GPUs may be desirable from the software development perspective as it greatly simplifies the development of MPI-based applications.

The Challenges Presented by GPU Accelerators

Any accelerator technology by definition is an addition to the baseline processor. In modern HPC environments, the dominant baseline architecture is x86_64 servers. Virtually all HPC systems use the Linux operating system (OS) and associated tools as a foundation for HPC processing. Both the Linux OS and underlying x86_64 processors are highly integrated and are heavily used in other areas outside of HPC — particularly web servers.

Until recently, GPU accelerators were added via the PCIe bus. This arrangement provided a level of separation from the core OS/processor environment, and to the other GPUs in the server. With the introduction of P100s, the connection of one GPU to another GPU is much better, and the connection with the same SXM2 link to the CPUs are coming. However, GPUs will still have a separation which does not allow for the OS to manage processes that run on the accelerator as if they were running on the main system. Even though the accelerator processes are leveraged by the main processors, the host OS does not track memory usage, processor load, power usage or temperatures for the accelerators. In one sense the GPU is a separate computing domain with its own distinct memory and computing resources.