GPUS and Accelerators at CHPC - Center for High Performance Computing

GPU Hardware Overview

The available hardware is listed from the most modern architecture (Hopper) to the oldest architecture (Maxwell) CHPC provides.

Hopper architecture
- NVIDIA H200
  - #CUDA Cores: 16,896 (132 SMs)
  - Compute Capability: 9.0
  - Global Memory: 141 GB (HBM3e)
  - Global Memory Bandwidth: 4.8 TB/s
  - FP64 Performance: 34 TFLOPS
  - FP64 Tensor Core Performance: 67 TFLOPS
  - FP32 Performance: 67 TFLOPS
  - TP32 Tensor Core Performance (with sparsity): 989 TFLOPS
  - BFLOAT16 Tensor Core Performance (with sparsity): 1,979 TFLOPS
  - FP16 Tensor Core Performance (with sparsity): 1,979 TFLOPS
  - FP8 Tensor Core Performance (with sparsity): 3,958 TFLOPS
  - INT8 Tensor Core Performance (with sparsity): 3,958 TOPS
- NVIDIA H100 NVL
  - #CUDA Cores: 14,592
  - Compute Capability: 9.0
  - Global Memory : 96 GB HBM3
  - Memory Bandwidth: 3.94 TB/s
  - FP64 Performance: 30 TFLOPS
  - FP64 Tensor Core Performance: 60 TFLOPS
  - FP32 Performance: 60 TFLOPS
  - TF32 Tensor Core Performance (with sparsity): 835 TFLOPS
  - BFLOAT16 Tensor Core Performance (with sparsity): 1,671 TFLOPS
  - FP16 Tensor Core Performance (with sparsity): 1,671 TFLOPS
  - FP8 Tensor Core Performance (with sparsity): 3,341 TFLOPS
  - INT8 Tensor Core Performance (with sparsity): 3,341 TOPS
Ada Lovelace architecture
- NVIDIA L40 (A)/NVIDIA L40S (B)
  - #CUDA Cores: 18,176
  - Compute Capability: 8.9
  - #Third-generation RT (Ray-Tracing) Cores: 142
  - #Fourth-generation Tensor Cores: 568
  - Compute Capability: 8.9
  - Global Memory: 48 GB GDDR6 with ECC
  - Memory Bandwidth: 864 GB/s
  - RT (Ray-Tracing) Core Performance: 209 TFLOPS
  - FP64 Performance: ~1.41 TFLOPS
  - FP32 Performance: 90.5 TFLOPS (A) / 91.6 TFLOPS (B)
  - TF32 Tensor Core Performance (with sparsity): 181 TFLOPS (A)/ 366 TFLOPS(B)
  - BFLOAT16 Tensor Core Performance (with sparsity): 362.1 TFLOPS (A)/ 733 TFLOPS (B)
  - FP16 Tensor Core Performance (with sparsity): 362.1 TFLOPS(A)/ 733 TFLOPS (B)
  - FP8 Tensor Core Performance (with sparsity): 724 TFLOPS (A)/ 1,466 TFLOPS (B)
  - INT8 Tensor Core Performance (with sparsity): 724 TOPS (A) / 1,466 TFLOPS (B)
  - INT4 Tensor Core Performance (with sparsity): 1,448 TOPS (A) / 1,466 TOPS (B)
- NVIDIA L4 Tensor Core
  - #CUDA Cores: 7,424
  - #Tensor Cores: 240
  - #RT (Ray Tracing) Cores: 60
  - Compute Capability: 8.9
  - Global Memory: 24 GB GDDR6
  - Memory Bandwidth: 300 GB/s
  - FP64 Performance: 0.473 TFLOPS
  - FP32 Performance: 30.29 TFLOPS
  - TF32 Tensor Core Performance (with sparsity): 120 TFLOPS
  - FP16 Tensor Core Performance (with sparsity): 242 TFLOPS
  - BFLOAT16 Tensor Core Performance (with sparsity): 242 TFLOPS
  - FP8 Tensor Core Performance (with sparsity): 485 TFLOPS
  - INT8 Tensor Core Performance (with sparsity): 485 TOPS
- NVIDIA RTX 6000 Ada Generation
  - #CUDA Cores: 18,176
  - #Fourth-generation Tensor Cores: 568
  - #Third-generation RT Cores: 142
  - Compute Capability: 8.9
  - Global Memory: 48 GB GDDR6
  - Memory Bandwidth: 960 GB/s
  - FP64 Performance: ~1.42 TFLOPS
  - FP32 Performance: 91.1 TFLOPS
  - RT Core Performance: 210.6 TFLOPS
  - FP8 Tensor Core Performance (with sparsity): 1,457 TFLOPS
- NVIDIA RTX 5000 Ada Generation
  - # CUDA Cores: 12,800
  - #Fourth-generation Tensor Cores: 400
  - #Third-generation Ray-Tracing Cores: 100
  - Compute Capability: 8.9
  - Global Memory: 32 GB GDDR6
  - Memory Bandwith: 576 GB/s
  - FP64 Performance: ~1.02 FLOPS
  - FP32 Performance: 65.3 TFLOPS
  - RT Core Performance: 151.0 TFLOPS
  - FP8 Tensor Core Performance (with sparsity): 1,044.4 TFLOPS
- NVIDIA RTX 4500 Ada Generation
  - #CUDA Cores: 7,680
  - #Fourth-generation Tensor Cores: 240
  - #Third-generation RT Cores: 60
  - Compute Capability: 8.9
  - Global Memory: 24 GB GDDR6
  - Memory Bandwith: 432 GB/s
  - FP64 Performance: 0.619 TFLOPS
  - FP32 Performance: 39.63 TFLOPS
  - RT Core Performance: 91.6 TFLOPS
  - FP8 Tensor Core Performance (with sparsity): 634.0 TFLOPS
- NVIDIA RTX 2000 Ada Generation
  - #CUDA Cores: 2,816
  - #Fourth-generation Tensor Cores: 88
  - #Third-generation RT Cores: 22
  - Compute Capability: 8.9
  - Global Memory: 16 GB GDDR6 with ECC
  - Memory Bandwith: 224 GB/s
  - FP64 Performance: 0.19 TFLOPS
  - FP32 Performance: 12.0 TFLOPS
  - RT Core Performance: 27.7 TFLOPS
  - FP8 Tensor Core performance (with sparsity): 191.9 TFLOPS
Ampere architecture
- NVIDIA A100-PCIE-40GB (A) / NVIDIA A100 80GB PCIe (B) / NVIDIA 100-SXM4-40GB (C) / NVIDIA A100-SXM4-80GB (D)
  - #CUDA Cores: 6,192
  - #Tensor Cores: 432
  - Compute Capability: 8.0
  - Global Memory: 40 GB HBM2 (A), 80 GB HBM2e (B), 40 GB HBM2 (C), 80 GB HBM2e (D)
  - Memory Bandwidth: 1.555 TB/s (A), 1.935 TB/s (B), 1.555 TB/s (C), 2.039 TB/s (D)
  - FP64 Performance: 9.7 TFLOPS
  - FP64 Tensor Core: 19.5 TFLOPS
  - FP32 Performance: 19.5 TFLOPS
  - TF32 Performance (with sparsity): 312 TFLOPS
  - BFLOAT16 Tensor Core Performance (with sparsity): 624 TFLOPS
  - FP16 Tensor Core Performance (with sparsity): 624 TFLOPS
  - INT8 Tensor Core Performance (with sparsity): 1,248 TOPS
- NVIDIA A800 40GB Active
  - #CUDA Cores: 6,192
  - #Third-generation Tensor Cores: 432
  - Compute Capability: 8.0
  - Global Memory: 40 GB HBM2
  - Memory Bandwidth: 1.555 TB/s
  - FP64 Performance: 9.7 TFLOPS
  - FP32 Performance: 19.5 TFLOPS
  - FP16 Tensor Core Performance: 623.8 TFLOPS
  - INT8 Tensor Core Performance: 1,247 TOPS
- NVIDIA A30 and here
  - #CUDA Cores: 3,584
  - #Tensor Cores: 224
  - Compute Capability: 8.0
  - Global Memory: 24 GB HBM2e
  - Memory Bandwidth: 0.933 TB/s
  - FP64 Performance: 5.2 TFLOPS
  - FP32 Performance: 10.3 TFLOPS
  - FP64 Tensor Core Performance: 10.3 TFLOPS
  - TF32 Tensor Core Performance (with sparsity): 165 TFLOPS
  - BFLOAT16 Tensor Core Performance (with sparsity): 330 TFLOPS
  - FP16 Tensor Core Performance (with sparsity): 330 TFLOPS
  - INT8 Tensor Core Performance (with sparsity): 661 TOPS
  - INT4 Tensor Core Performace (with sparsity): 1,321 TOPS
- NVIDIA A40
  - #CUDA Cores: 10,752
  - #Third-generation Tensor Cores: 336
  - #Second-generation RT Cores: 84
  - Compute Capability: 8.6
  - Global Memory: 48 GB GDDR6 with ECC
  - Memory Bandwidth: 696 GB/s
  - FP64 Performance: 0.585 TFLOPS
  - FP32 Performance: 37.4 TFLOPS
  - TF32 Tensor Core Performance (with sparsity): 149.6 TFLOPS
  - FP16 Tensor Core Performance (with sparsity): 299.4 TFLOPS
  - RT Tensor Core Performance: 73.1TFLOPS
  - BF16 Tensor Core Performance (with sparsity): 299.4 TFLOPS
  - INT8 Tensor Core Performance (with sparsity): 598.6 TOPS
  - INT4 Tensor Core Performance (with sparsity): 1,197.4 TOPS
- NVIDIA RTX A6000
  - #CUDA Cores: 10,752
  - #Third-generation Tensor Cores: 336
  - #Second-generaton RT Cores: 84
  - Compute Capability: 8.6
  - Global Memory: 48 GB GDDR6 with ECC
  - Memory Bandwidth: 768 GB/s
  - FP64 Performance: ~0.605 TFLOPS
  - FP32 Performance: 38.7 TFLOPS
  - RT Core Performance: 75.6 TFLOPS
  - Tensor Core Performance (with sparsity): 309.7 TFLOPS
- NVIDIA RTX A5500
  - #CUDA Cores: 10,240
  - #Third-generation Tensor Cores: 320
  - #Second-generation RT Cores: 80
  - Compute Capability: 8.6
  - Global Memory: 24 GB GDDR6
  - Memory Bandwidth: 768 GB/s
  - FP64 Performance: 0.53 TFLOPS
  - FP32 Performance: 34.1 TFLOPS
  - RT Core Performance: 66.6 TFLOPS
  - Tensor Core Performance (with sparsity): 272.8 TFLOPS
- NVIDIA GeForce RTX 3090
  - #CUDA Cores: 10,496
  - #Third-generation Tensor Cores: 328
  - #Second-generation RT Cores: 82
  - Compute Capability: 8.6
  - Global Memory: 24 GB GDDR6X
  - Memory Bandwidth: 936.2 GB/s
  - FP64 Performance: ~0.56 TFLOPS
  - FP32 Performance: ~35.6 TFLOPS
  - RT Core Performance: 56 TFLOPS
  - Tensor Core Performance (with sparsity): 285 TFLOPS

Turing architecture
- NVIDIA GeForce RTX 2080 Ti
  - #CUDA Cores: 4,352
  - #Second-generation Tensor Cores: 544
  - #First-generation RT Cores: 68
  - Compute Capability: 7.5
  - Global Memory: 11 GB GDDR6
  - Memory Bandwidth: 616 GB/s
  - FP64 Performance: ~0.42 TFLOPS
  - FP32 Performance: ~13.4 TFLOPS
  - Tensor Core Performance (with sparsity): 114 TFLOPS
- Tesla T4
  - #CUDA Cores: 2,560
  - #Tensor Cores: 320
  - #First-generation RT Cores: 40
  - Compute Capability: 7.5
  - Global Memory: 16 GB GDDR6
  - Memory Bandwidth: 300 GB/s
  - FP64 Performance: ~0.25 TFLOPS
  - FP32 Performance: 8.1 TFLOPS
  - FP16 Performance: 65 TFLOPS
  - INT8 Performance: 130 TOPS
  - INT4 Performance: 260 TOPS
Volta architecture
- Tesla V100-PCIE-16GB and here
  - #CUDA Cores: 5,120
  - #Tensor Cores: 640
  - Compute Capability: 7.0
  - Global Memory: 16 GB HBM2
  - Memory Bandwidth: 897 .0 GB/s
  - FP64 Performance: 7.066 TFLOPS
  - FP32 Performance: 14.13 TFLOPS
  - FP16 Performance: 28.26 TFLOPS
- NVIDIA TITAN V
  - #CUDA Cores: 5,120
  - #First-generation Tensor Cores: 640
  - Compute Capability: 7.0
  - Global Memory: 12 GB HBM2
  - Memory Bandwidth: 652.8 GB/s
  - FP64 Performance: 7.450 TFLOPS
  - FP32 Performance: 14.9 TFLOPS
  - FP16 Performance: 29.80 TFLOPS

Pascal architecture
- Tesla P40
  - #CUDA Cores: 3,840
  - Compute Capbility: 6.1
  - Global Memory: 24 GB GDDR5
  - Memory Bandwidth: 346 GB/s
  - FP64 Performance: ~0.375 TFLOPS
  - FP32 Performance: 12 TFLOPS
- NVIDIA GeForce GTX 1080 Ti
  - #CUDA Cores: 3,584
  - Compute Capability: 6.1
  - Global Memory: 11 GB GDDR5X
  - Memory Bandwidth: 484 GB/s
  - FP32 performance: 11.30 TFLOPS
  - FP64 performance: 0.35 TFLOPS
- Tesla P100-PCIE-16GB (see also here)
  - #CUDA Cores: 3,584
  - Compute Capability: 6.0
  - Global Memory: 16 GB HBM2
  - Memory Bandwith: 732 GB/s
  - FP64 performance: 4.67 TFLOPS
  - FP32 performace: 9.34 TFLOPS

Maxwell architecture
- NVIDIA GeForce GTX Titan X
  - #CUDA Cores: 3,072
  - Compute Capability: 5.2
  - Global Memory: 12 GB GDDR5
  - Memory Bandwith: 336.5 GB/s
  - FP64 performance: ~0.22 TFLOPS
  - FP32 performace: ~7.0 TFLOPS

GPU Hardware at the CHPC

GPU Hardware Overview