NVIDIA® GPUs

NVIDIA GPU – One Platform. Unlimited Data Center Acceleration.

NVIDIA Elite Partner Logo

NVIDIA GPU – Accelerating scientific discovery, visualizing big data for insights, and providing smart services to consumers are everyday challenges for researchers and engineers. Solving these challenges takes increasingly complex and precise simulations, the processing of tremendous amounts of data, or training sophisticated deep learning networks. These workloads also require accelerating data centers to meet the growing exponential demand for computing.

NVIDIA Ampere is the world’s leading platform for accelerated data centers, deployed by some of the world’s largest super-computing centers and enterprises. It combines GPU accelerators, accelerated computing systems, interconnect technologies, development tools, GPU applications and Compilers, like PGI to enable faster scientific discoveries and big data insights.

Ampere is incredibly fast for training and inference, and has the ability to fractionalize and partition itself from a single large GPU with maximum Scale-Up performance, or Scale-Out and partition itself in up to 7 independent GPUs to accelerate multiple smaller applications. The new Ampere architecture yields a new data center architecture for acceleration that is flexible, high throughput and enables higher utilization.

The Exponential Growth of GPU Computing

For more than two decades, NVIDIA has pioneered visual computing, the art and science of computer graphics. With a singular focus on this field, NVIDIA GPUs offers specialized platforms for the gaming, professional visualization, data center, GPU server and automotive markets. NVIDIA’s work is at the center of the most consequential mega-trends in GPU cluster technology — virtual reality, artificial intelligence and self-driving cars.

GPU servers have become an essential part in the computational research world. From bioinformatics to weather modeling, GPUs have offered over 70x speed up on researcher’s code. With hundreds of applications already accelerated by these cards, check to see if your favorite applications are on the GPU applications list.

NVIDIA Ampere A100

8th Generation Data Center GPU for the Age of Elastic Computing

NVIDIA Ampere A100 Tensor Core GPU adds many new features while delivering significantly faster performance for HPC, AI, and data analytics workloads. Powered by the NVIDIA’s latest Ampere GPU architecture, The latest model, the A100, utilizes 3rd Gen Tensor Cores, Sparsity Acceleration, MIG (Multi-Instance GPUs) and 3rd Gen NVLINK & NVSWITCH.

Read more about NVIDIA A100 Ampere GPU

NVIDIA Volta V100

Three Reasons to Upgrade to the NVIDIA Ampere A100

Accelerating HPC

1
A100 Tensor Cores Accelerate HPC

The performance needs of HPC applications are growing rapidly. The A100 GPU supports Tensor operations that accelerate IEEE-compliant FP64 computations, delivering up to 2.5x the FP64 performance of the NVIDIA Tesla V100 GPU. Each SM in A100 computes a total of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), which is twice the throughput of Tesla V100.

Multi Instance GPU (MIG)

2
Multi Instance GPUs

The new MIG feature can partition each A100 into as many as seven GPU Instances, each fully isolated with their own high-bandwidth memory, cache, and compute cores, for optimal utilization, effectively expanding access to every user and application. With NVIDIA Ampere architecture-based GPU, you can see and schedule jobs on their new virtual GPU instances as if they were physical GPUs.

A100 Unified AI Acceleration

3
AI Training and Inference

From scaling-up AI training and scientific computing, to scaling-out inference applications, to enabling real-time conversational AI, NVIDIA GPUs provide the necessary horsepower to accelerate numerous complex and unpredictable workloads. The NVIDIA A100 GPU delivers exceptional speedups over V100 for AI training and inference workloads

Three Reasons to Upgrade to the NVIDIA Ampere A100

NVIDIA Tesla V100 and V100S

NVIDIA® Tesla® V100 is built to accelerate AI, HPC, and graphics. Powered by NVIDIA Volta™, the NVIDIA Tesla models offer the performance of 100 CPUs in a single GPU—enabling data scientists, researchers, and engineers to tackle challenges that were once impossible.

The V100S, improves clock speed increasing performance up to 130 TFLOPS and memory speed up to 1,134 GB/s.

Read more about NVIDIA Tesla V100 Volta GPU

NVIDIA Volta V100

GPU Form Factor Single-Precision Double-Precision Deep Learning Bi-Directional Power Capacity Bandwidth
V100 SXM2 15.7 TFLOPS 7.8 TFLOPS 125 TFLOPS 300 GB/s 300 WATTS 16 or 32 GB HBM2 900 GB/s
V100 PCIe 14 TFLOPS 7 TFLOPS 112 TFLOPS 32 GB/s 250 WATTS 16 or 32 GB HBM2 900 GB/s
V100S PCIe 16.4 TFLOPS 8.2 TFLOPS 130 TFLOPS 32 GB/s 250 WATTS 32 GB HBM2 1,134 GB/s

Understanding NVIDIA’s product line

Tesla? Volta? GeForce? Turing? Pascal? NVIDIA has a very diverse product line which some find challenging to navigate. While you should contact an expert to determine your specific needs, here is a simplified rundown of some of NVIDIA’s product line to help you understand some of the basics.

Tesla

Tesla GPUs are aimed at data center compute

The Tesla brand of products is geared to the Data Center. These are highly specialized cards for compute, and as such, are the GPU of choice for data centers and supercomputers. They do not have video outputs, for example, and often utilize passive cooling. If you are using GPUs for clusters or for compute, you want a Tesla card. As such, Aspen Systems recommends the Tesla series for HPC.

Quadro

Quadro GPUs are aimed at data science and advanced graphic applications

Quadro cards are usually more powerful than GeForce cards, but are similar, and in some cases have nearly identical specifications and core technology. However, you get something much more valuable whenever you purchase a card with the name “Quadro” on it — World-class support by NVIDIA. In mission-critical applications, the name “Quadro” can shorten downtime and get you up and running fast.

.

GeForce

GeForce GPUs are aimed at consumer market and gaming

You may have heard of GeForce cards before, or have seen them on the shelves at your local consumer electronics store. These cards are consumer-oriented cards, and are usually used for gaming applications and displays. Aspen Systems does not recommend GeForce GPUs for HPC, data science, artificial intelligence, machine learning, or any other compute-intensive applications, as they are not specialized for that purpose.

Architectures

The architecture refers to the generation of technology used in the card, and each new generation usually introduces something new to the mix.

Ampere

(EG A100, A100 PCIe). The NVIDIA Ampere architecture increases throughput of scale up applications and now provides flexibility to accelerate scale out applications with 3rd generation Tensor Cores, Multi-Instance GPUs (MIG), and 3rd generation NVLink connections to accelerate AI and High-Performance Computing (HPC).

Turing

(EG RTX 8000, RTX 6000, Titan RTX). The Turing generation of NVIDIA technology is focused on graphics. Turing architecture introduced the new RT cores that offer real-time ray-tracing directly through the hardware rather than the more cycle-hungry process of software ray-tracing.

Volta

(EG V100, GV100). Volta introduced the new Tensor cores into the mix. These cores are a huge leap forward for applications involving artificial intelligence and machine learning, as they are highly specialized for tasks associated with the “tensors” used in such applications and libraries, such as TensorFlow, and are designed from the ground up to handle these workloads.

Quadro RTX Series

While we recommend the V100 for the Data Center, some applications, especially in data science and visualization, can benefit from the technology that has gone into building the new RTX series of cards, especially the new Turing RT cores, which have ray-tracing capabilities built into the hardware itself. Since these cards also have CUDA and Tensor cores, they are perfect for data science and machine learning applications that require advanced graphic capabilities, supporting UP TO 4 8K Displays!

Quadro RTX 8000

RTX 8000

Quadro RTX 6000

RTX 6000

Read about the new RTX Data Science Workstations featuring these innovative new cards.

Choose the right NVIDIA GPU for you

  A100 SXM4 V100 SXM2 V100 PCIe RTX 8000 RTX 6000 Titan RTX T4
Appearance NVIDIA GPU - NVIDIA Ampere A100 GPU NVIDIA GPU - NVIDIA Tesla V100 GPU Computing Accelerator NVLINK 32GB NVIDIA GPU - NVIDIA Tesla V100 GPU Computing Accelerator PCIe 16GB NVIDIA GPU - NVIDIA RTX 8000 NVIDIA GPU - NVIDIA RTX 6000 NVIDIA GPU - TITAN RTX NVIDIA Tesla T4 used for GPU Clusters and GPU Servers
Specifications
GPU Arcitecture Ampere Volta Volta Turing Turing Turing Turing
Family Tesla Tesla Tesla Quadro Quadro N/A Tesla
Form Factor SXM4 SXM2 PCIe x16
Dual-slot Full
PCIe x16
Dual-slot Full
PCIe x16
Dual-slot Full
PCIe x16
Dual-slot Full
PCIe x16
Single-slot Low-profile
CUDA Cores 6,912 5,120 5,120 4,608 4,608 2,560
Tensor Cores 432 640 576 576 576 320
RT Cores N/A N/A 72 72 72 N/A
Interconnect Bandwidth 600GB/s 300GB/s30GB/s
(no NVLINK)
100 GB/s 100 GB/s 100 GB/s 32GB/sec
Performance
Double-precision: 9.7 TFLOPS 7.8 TFLOPS7 TFLOPS N/A N/A 0.51 TFLOPS N/A
Single-precision: 19.5 TFLOPS 15.7 TFLOPS14 TFLOPS 16.3 TFLOPS 16.3 TFLOPS 16.3 TFLOPS 8.1 TFLOPS
Low-Precision FP16: 312 TFLOPS
INT8: 624 TOPS
FP16: 29.6 TFLOPS
INT8: 59.3 TOPS
FP16: 29.6 TFLOPS
INT8: 59.3 TOPS
FP16: 32.6 TFLOPS
INT8: 206.1 TOPS
FP16: 16.3 TFLOPS FP16: 16.3 TFLOPS INT8: 130 TOPS
INT4: 260 TOPS
Special: Deep Learning: 312 TFLOPS Deep Learning: 118.5 TFLOPSDeep Learning: 118.5 TFLOPS RTX-OPS: 84T RTX-OPS: 84T Tensor: 130 TFLOPS Mixed-Precision(FP16/FP32): 65 TFLOPS
Memory 40GB HBM2 32 *OR* 16 GB HBM2 900GB/s ECC 48GB GDDR6 ECC 672 GB/Sec 24GB GDDR6 ECC 672 GB/Sec 24GB GDDR6 672 GB/s ECC 16 GB GDDR6 300 GB/s ECC
TDP 400W 300W250W 295W 295W 290W 70W

Software Tools for GPU Computing.

Tensorflow

Tensorflow Artificial Intelligence Library

Tensorflow, developed by google, is an open source symbolic math library for high performance computation.

It has quickly become an industry standard for artificial intelligence and machine learning applications, and is known for its flexibility, used in many scientific disciplines.

It is based on the concept of a Tensor, which, as you may have guessed, is where the Volta Tensor Cores gets its name.

GPU Accelerated Libraries

GPU Accelerated Libraries

There are a handful of GPU accelerated libraries that developers can use to speed up applications using GPUs. Many of them are NVIDIA CUDA libraries (such as cuBLAS and CUDA Math Library), but there are others such as IMSL Fortran libraries and HiPLAR (High Performance Linear Algebra in R). These libraries can be linked to replace standard libraries that are commonly used in non-GPU-Accelerated computing.

CUDA

CUDA Development Toolkit

NVIDIA has created an entire toolkit devoted to computing on their CUDA-enabled GPUs. The CUDA toolkit, which includes the CUDA libraries, are the core of many GPU-Accelerated programs. CUDA is one of the most widely used toolkits in the GPGPU world today.

Deep Learning SDK

NVIDIA Deep Learning SDK

In today’s world, Deep Learning is becoming essential in many segments of the industry. For instance, Deep Learning is key in voice and image recognition where the machine must learn while gaining input. Writing algorithms for machines to learn from data is a difficult task, but NVIDIA has written a Deep Learning SDK to provide the tools necessary to help design code to run on GPUs.

OpenACC

OpenACC Parallel Programming Model

OpenACC is a user-driven directive-based performance-portable parallel programming model. It is designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model. . The OpenACC Directives can be a powerful tool in porting a user’s application to run on GPU servers. There are two key features to OpenACC: easy of use and portability. Applications that use OpenACC can not only run on NVIDIA GPUs, but it can run on other GPUs, X86 CPUs & POWER CPUs, as well.

NVIDIA Accelerators dramatically lower data center costs by delivering exceptional performance with fewer, more powerful servers. This increased throughput means more scientific discoveries delivered to researchers every day.

Choose from Some of Our Most Popular GPU Capable Servers

Supermicro Storage Servers

4U SuperWorkstations

Up to 4 GPUs or Coprocessors.

Shop 4U 7048GR-TR

Supermicro 1U SuperServer 1029GQ-TXRT GPU Server

1U SuperServer 1029GQ-TXRT

Up to 4 P100s with 10GBase-T Ethernet

Shop 1U 1029GQ-TXRT

NVIDIA Tesla T4 Tensor Core GPU: The Price Performance Leader

The next level of acceleration, the NVIDIA T4 is a single-slot, 6.6-inch Gen3 PCIe Universal Deep Learning Accelerator based on the TU104 NVIDIA GPU. It supports both x8 and x16 PCI Express, and 32 GB/sec interconnect bandwidth. The T4’s small form factor design allows all of this, yet is energy efficient, consuming only 70 watts of power. Passive thermal design supports bi-directional airflow (R-L or L-R).

The T4 utilizes Turing™ Architecture and has 320 Turing™ Tensor Cores, as well as 2,560 CUDA® cores, supporting CUDA™, TensorRT™, and ONNX compute APIs.

Nvidia Tesla T4 used for GPU Clusters and GPU Servers

Multi-precision performance specifications

  • Single-precision (8.1 TFLOPS)
  • Mixed-Precision FP32 and FP16 (65 TFLOPS)
  • INT8 (130 TOPS)
  • INT4 (260 TOPS)

Memory:

The T4 boasts 16GB of GDDR6 ECC memory, with 256 bit memory bus, memory clock up to 5001 MHz., and peak memory bandwidth up to 320 GBytes per second.

The T4 provides up to 9.3X higher performance than CPUs on training and up to 36X on inference.

Inference Performance vs. CPU*

T4 vs. CPU Inference Performance

Training Performance vs. CPU*

T4 vs. CPU Training Performance
*Comparison made of dual NVIDIA T4 GPUs versus servers with dual-socket Xeon Gold 6140 CPU.

Some of Our Most Popular NVIDIA Capable Servers

Supermicro 1U SuperServer 1029GQ-TRT

1U 1029GQ-TRT SuperServer

Holds 4 GPUs and has Dual Port 10GbE.

Shop 1U 1029GQ-TRT

Supermicro 2U SuperServer 2028GR-TRT

2U 2028GR-TRT SuperServer

Up to 4 GPUs or Coprocessors.

Shop 2U 2028GR-TRT

Supermicro SuperServer 4028GR-TR2

4U 4028GR-TR2 SuperServer

Dual Socket Intel Xeon E5-2600 v3/v4.

Shop 4U 4028GR-TR2