nvidia elite partnerNVIDIA GPU – Accelerating scientific discovery, visualizing big data for insights, and providing smart services to consumers are everyday challenges for researchers and engineers. Solving these challenges takes increasingly complex and precise simulations, the processing of tremendous amounts of data, or training sophisticated deep learning networks. These workloads also require accelerating data centers to meet the growing exponential demand for computing.

NVIDIA Ampere is the world’s leading platform for accelerated data centers, deployed by some of the world’s largest super-computing centers and enterprises. It combines GPU accelerators, accelerated computing systems, interconnect technologies, development tools, GPU applications and Compilers, like PGI to enable faster scientific discoveries and big data insights.

Ampere is incredibly fast for training and inference, and has the ability to fractionalize and partition itself from a single large GPU with maximum Scale-Up performance, or Scale-Out and partition itself in up to 7 independent GPUs to accelerate multiple smaller applications. The new Ampere architecture yields a new data center architecture for acceleration that is flexible, high throughput and enables higher utilization.

Speak with One of Our System Engineers Today


For more than two decades, NVIDIA has pioneered visual computing, the art and science of computer graphics. With a singular focus on this field, NVIDIA GPUs offers specialized platforms for the gaming, professional visualization, data center, GPU server and automotive markets. NVIDIA’s work is at the center of the most consequential mega-trends in GPU cluster technology — virtual reality, artificial intelligence and self-driving cars.

GPU servers have become an essential part in the computational research world. From bioinformatics to weather modeling, GPUs have offered over 70x speed up on researcher’s code. With hundreds of applications already accelerated by these cards, check to see if your favorite applications are on the GPU applications list.

NVIDIA H100 Tensor Core GPU

An Order-of-Magnitude Leap for Accelerated Computing

Tap into unprecedented performance, scalability, and security for every workload with the NVIDIA® H100 Tensor Core GPU. With the NVIDIA NVLink® Switch System, up to 256 H100 GPUs can be connected to accelerate exascale workloads. The GPU also includes a dedicated Transformer Engine to solve trillion-parameter language models. The H100’s combined technology innovations can speed up large language models (LLMs) by an incredible 30X over the previous generation to deliver industry-leading conversational AI.

nvidia h100 gpu

Platform Features

H100 GPU Feature Highlights

Up to 4X Higher AI Training on GPT-3
h100 workload training graph

Projected performance subject to change. GPT-3 175B training A100 cluster: HDR IB network, H100 cluster: NDR IB network | Mixture of Experts (MoE) Training Transformer Switch-XXL variant with 395B parameters on 1T token dataset, A100 cluster: HDR IB network, H100 cluster: NDR IB network with NVLink Switch System where indicated.

Securely Accelerate Workloads From Enterprise to Exascale

Transformational AI Training

H100 features fourth-generation Tensor Cores and a Transformer Engine with FP8 precision that provides up to 4X faster training over the prior generation for GPT-3 (175B) models. The combination of fourth-generation NVLink, which offers 900 gigabytes per second (GB/s) of GPU-to-GPU interconnect; NDR Quantum-2 InfiniBand networking, which accelerates communication by every GPU across nodes; PCIe Gen5; and NVIDIA Magnum IO™ software delivers efficient scalability from small enterprise systems to massive, unified GPU clusters.

Deploying H100 GPUs at data center scale delivers outstanding performance and brings the next generation of exascale high-performance computing (HPC) and trillion-parameter AI within the reach of all researchers.

Up to 30X higher AI inference performance on the largest models

Megatron chatbot inference (530 billion parameters)

h100 real time deep learning graph

Projected performance subject to change. Inference on Megatron 530B parameter model chatbot for input sequence length=128, output sequence length=20 | A100 cluster: HDR IB network | H100 cluster: NDR IB network for 16 H100 configurations | 32 A100 vs 16 H100 for 1 and 1.5 sec | 16 A100 vs 8 H100 for 2 sec

Real-Time Deep Learning Inference

AI solves a wide array of business challenges, using an equally wide array of neural networks. A great AI inference accelerator has to not only deliver the highest performance but also the versatility to accelerate these networks.

H100 extends NVIDIA’s market-leading inference leadership with several advancements that accelerate inference by up to 30X and deliver the lowest latency. Fourth-generation Tensor Cores speed up all precisions, including FP64, TF32, FP32, FP16, INT8, and now FP8, to reduce memory usage and increase performance while still maintaining accuracy for LLMs.

Up to 7X higher performance for HPC applications
nvidia h100 exascale graph

Projected performance subject to change. 3D FFT (4K^3) throughput | A100 cluster: HDR IB network | H100 cluster: NVLink Switch System, NDR IB | Genome Sequencing (Smith-Waterman) | 1 A100 | 1 H100

Exascale High-Performance Computing

The NVIDIA data center platform consistently delivers performance gains beyond Moore’s law. And H100’s new breakthrough AI capabilities further amplify the power of HPC+AI to accelerate time to discovery for scientists and researchers working on solving the world’s most important challenges.

H100 triples the floating-point operations per second (FLOPS) of double-precision Tensor Cores, delivering 60 teraflops of FP64 computing for HPC. AI-fused HPC applications can also leverage H100’s TF32 precision to achieve one petaflop of throughput for single-precision matrix-multiply operations, with zero code changes.

H100 also features new DPX instructions that deliver 7X higher performance over A100 and 40X speedups over CPUs on dynamic programming algorithms such as Smith-Waterman for DNA sequence alignment and protein alignment for protein structure prediction.

nvidia h100 data analytics

Accelerated Data Analytics

Data analytics often consumes the majority of time in AI application development. Since large datasets are scattered across multiple servers, scale-out solutions with commodity CPU-only servers get bogged down by a lack of scalable computing performance.

Accelerated servers with H100 deliver the compute power—along with 3 terabytes per second (TB/s) of memory bandwidth per GPU and scalability with NVLink and NVSwitch™—to tackle data analytics with high performance and scale to support massive datasets. Combined with NVIDIA Quantum-2 InfiniBand, Magnum IO software, GPU-accelerated Spark 3.0, and NVIDIA RAPIDS™, the NVIDIA data center platform is uniquely able to accelerate these huge workloads with unparalleled levels of performance and efficiency.

Nvidia L40 GPU

Unprecedented visual computing performance for the data center.

The NVIDIA L40, powered by the Ada Lovelace architecture, delivers revolutionary neural graphics, virtualization, compute, and AI capabilities for GPU-accelerated data center workloads.


Double Precision & Compute GPUs

Name H100
Appearance Image Image Image Image
Architecture Ampere Ampere Ampere Ampere
FP64 34 TF 26 TF 68 TF 5.2 TF
FP64 Tensor Core 67 TF 51 TF 134 TF 10.3 TF
FP32 67 TF 51 TF 134 TF 10.3 TF
Tensor Float 32 (TF32) 156 TF | 312 TF* 156 TF | 312 TF* 156 TF | 312 TF* 82 TF | 165 TF*
BFLOAT16 Tensor Core 1,979 TF 1,513 TF 3,958 TF 165 TF | 330 TF*
FP16 Tensor Core
INT8 Tensor Core 3,958 TOPS 3,206 TOPS 7,916 TOPS 330 TOPS | 661 TOPS*
GPU Memory 80 GB 80 GB 188 GB 24GB
GPU Memory Bandwidth 3.35 TB/s 2 TB/s 7.8 TB/3 933 GB/s
TDP 700 W 300-350 W 2x 350-400 W 165 W
Interconnect NVLink: 900GB/s
PCIe Gen5: 128GB/s
NVLink: 600GB/s
PCIe Gen5: 128GB/s
NVLink: 600GB/s
PCIe Gen5: 128GB/s
PCIe Gen4: 64GB/s

Single Precision & Visualization GPUs

Name L40 A40 A6000 A5000 A4000 A2000
Appearance Image Image Image Image Image Image
Architecture Ada Lovelace Ampere Ampere Ampere Ampere Ampere
FP64 1,250 GF 867.8 GF 599 GF 125 GF
FP32 90.5 TF 37.4 TF 38.7 TF 27.8 TF 19.2 TF 8 TF
Tensor Float 32 (TF32) 90.5 TF | 181 TF 74.8 | 149.6* 309.7 TF 222.2 TF 153.4 TF 63.9 TF
BFLOAT16 Tensor Core 181.05 TF | 362.1 TF 149.7 TF | 299.4
FP16 Tensor Core
INT8 Tensor Core 362 TOPS | 724 TOPS 299.3 TOPS | 598.6 TOPS*
GPU Memory 48 GB GDDR6 with ECC 48 GB GDDR6 with ECC 48 GB GDDR6 24 GB GDDR6 16 GB GDDR6 6 GB GDDR6
GPU Memory Bandwidth 864 GB/s 696 GB/s 768 GB/s 768 GB/s 448 GB/s 288 GB/s
TDP 300 W 300 W 300 W 230 W 140 W 70 W
Interconnect PCIe Gen4 64 GB/s
NVIDIA® NVLink® 112.5 GB/s
PCIe Gen4 31.5 GB/s
NVLink: 112.5 GB/s
PCIe Gen4: 64GB/s
NVLink: 112.5 GB/s
PCIe Gen4: 64GB/s
PCIe Gen4: 64GB/s PCIe Gen4: 64GB/s

Virtualization GPUs

Name A16 A10 T4
Appearance Image Image Image
Architecture Ampere Ampere Turing
FP64 271.2 GF
FP32 8.678 TF 31.2 TF 8.1 TF
Tensor Float 32 (TF32) 62.5 TF | 125 TF* 65 TF
BFLOAT16 Tensor Core 8.678 TF 125 TF | 250 TF*
FP16 Tensor Core
INT8 Tensor Core 250 TOPS | 500 TOPS* 130 TOPS
GPU Memory 4x 16GB GDDR6 with ECC 24 GB GDDR6 16 GB GDDR6
GPU Memory Bandwidth 4x 232 GB/s 600 GB/s 300 GB/s
TDP 250 W 150 W 70 W
Interconnect PCI Express Gen 4 x16 PCIe Gen4: 64 GB/s PCIe 3.0 x 16

Speak with One of Our System Engineers Today

Software Tools for GPU Computing

Tensorflow Artificial Intelligence Library

Tensorflow, developed by google, is an open source symbolic math library for high performance computation. It has quickly become an industry standard for artificial intelligence and machine learning applications, and is known for its flexibility, used in many scientific disciplines. It is based on the concept of a Tensor, which, as you may have guessed, is where the Volta Tensor Cores gets its name.

GPU Accelerated Libraries

There are a handful of GPU accelerated libraries that developers can use to speed up applications using GPUs. Many of them are NVIDIA CUDA libraries (such as cuBLAS and CUDA Math Library), but there are others such as IMSL Fortran libraries and HiPLAR (High Performance Linear Algebra in R). These libraries can be linked to replace standard libraries that are commonly used in non-GPU-Accelerated computing.

CUDA Development Toolkit

NVIDIA has created an entire toolkit devoted to computing on their CUDA-enabled GPUs. The CUDA toolkit, which includes the CUDA libraries, are the core of many GPU-Accelerated programs. CUDA is one of the most widely used toolkits in the GPGPU world today.

NVIDIA Deep Learning SDK

In today’s world, Deep Learning is becoming essential in many segments of the industry. For instance, Deep Learning is key in voice and image recognition where the machine must learn while gaining input. Writing algorithms for machines to learn from data is a difficult task, but NVIDIA has written a Deep Learning SDK to provide the tools necessary to help design code to run on GPUs.

OpenACC Parallel Programming Model

OpenACC is a user-driven directive-based performance-portable parallel programming model. It is designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model. . The OpenACC Directives can be a powerful tool in porting a user’s application to run on GPU servers. There are two key features to OpenACC: easy of use and portability. Applications that use OpenACC can not only run on NVIDIA GPUs, but it can run on other GPUs, X86 CPUs & POWER CPUs, as well.

NVIDIA Accelerators dramatically lower data center costs by delivering exceptional performance with fewer, more powerful servers. This increased throughput means more scientific discoveries delivered to researchers every day.