GPU Hardware at the CHPC
The CHPC has several compute nodes with GPUs. The GPU devices are to be found on the granite, notchpeak, kingspeak, lonepeak, and redwood (i.e. the Protected Environment) clusters. This document describes the hardware in a little more detail.
GPU Hardware Overview
The available hardware is listed from the most modern architecture (Hopper) to the oldest architecture (Maxwell) CHPC provides.
- Hopper architecture
- NVIDIA H200
- #CUDA Cores: 16,896 (132 SMs)
- Compute Capability: 9.0
- Global Memory: 141 GB (HBM3e)
- Global Memory Bandwidth: 4.8 TB/s
- FP64 Performance: 34 TFLOPS
- FP64 Tensor Core Performance: 67 TFLOPS
- FP32 Performance: 67 TFLOPS
- TP32 Tensor Core Performance (with sparsity): 989 TFLOPS
- BFLOAT16 Tensor Core Performance (with sparsity): 1,979 TFLOPS
- FP16 Tensor Core Performance (with sparsity): 1,979 TFLOPS
- FP8 Tensor Core Performance (with sparsity): 3,958 TFLOPS
- INT8 Tensor Core Performance (with sparsity): 3,958 TOPS
- NVIDIA H100 NVL
- #CUDA Cores: 14,592
- Compute Capability: 9.0
- Global Memory : 96 GB HBM3
- Memory Bandwidth: 3.94 TB/s
- FP64 Performance: 30 TFLOPS
- FP64 Tensor Core Performance: 60 TFLOPS
- FP32 Performance: 60 TFLOPS
- TF32 Tensor Core Performance (with sparsity): 835 TFLOPS
- BFLOAT16 Tensor Core Performance (with sparsity): 1,671 TFLOPS
- FP16 Tensor Core Performance (with sparsity): 1,671 TFLOPS
- FP8 Tensor Core Performance (with sparsity): 3,341 TFLOPS
- INT8 Tensor Core Performance (with sparsity): 3,341 TOPS
- NVIDIA H200
- Ada Lovelace architecture
- NVIDIA L40 (A)/NVIDIA L40S (B)
- #CUDA Cores: 18,176
- Compute Capability: 8.9
- #Third-generation RT (Ray-Tracing) Cores: 142
- #Fourth-generation Tensor Cores: 568
- Compute Capability: 8.9
- Global Memory: 48 GB GDDR6 with ECC
- Memory Bandwidth: 864 GB/s
- RT (Ray-Tracing) Core Performance: 209 TFLOPS
- FP64 Performance: ~1.41 TFLOPS
- FP32 Performance: 90.5 TFLOPS (A) / 91.6 TFLOPS (B)
- TF32 Tensor Core Performance (with sparsity): 181 TFLOPS (A)/ 366 TFLOPS(B)
- BFLOAT16 Tensor Core Performance (with sparsity): 362.1 TFLOPS (A)/ 733 TFLOPS (B)
- FP16 Tensor Core Performance (with sparsity): 362.1 TFLOPS(A)/ 733 TFLOPS (B)
- FP8 Tensor Core Performance (with sparsity): 724 TFLOPS (A)/ 1,466 TFLOPS (B)
- INT8 Tensor Core Performance (with sparsity): 724 TOPS (A) / 1,466 TFLOPS (B)
- INT4 Tensor Core Performance (with sparsity): 1,448 TOPS (A) / 1,466 TOPS (B)
- NVIDIA L4 Tensor Core
- #CUDA Cores: 7,424
- #Tensor Cores: 240
- #RT (Ray Tracing) Cores: 60
- Compute Capability: 8.9
- Global Memory: 24 GB GDDR6
- Memory Bandwidth: 300 GB/s
- FP64 Performance: 0.473 TFLOPS
- FP32 Performance: 30.29 TFLOPS
- TF32 Tensor Core Performance (with sparsity): 120 TFLOPS
- FP16 Tensor Core Performance (with sparsity): 242 TFLOPS
- BFLOAT16 Tensor Core Performance (with sparsity): 242 TFLOPS
- FP8 Tensor Core Performance (with sparsity): 485 TFLOPS
- INT8 Tensor Core Performance (with sparsity): 485 TOPS
- NVIDIA RTX 6000 Ada Generation
- #CUDA Cores: 18,176
- #Fourth-generation Tensor Cores: 568
- #Third-generation RT Cores: 142
- Compute Capability: 8.9
- Global Memory: 48 GB GDDR6
- Memory Bandwidth: 960 GB/s
- FP64 Performance: ~1.42 TFLOPS
- FP32 Performance: 91.1 TFLOPS
- RT Core Performance: 210.6 TFLOPS
- FP8 Tensor Core Performance (with sparsity): 1,457 TFLOPS
- NVIDIA RTX 5000 Ada Generation
- # CUDA Cores: 12,800
- #Fourth-generation Tensor Cores: 400
- #Third-generation Ray-Tracing Cores: 100
- Compute Capability: 8.9
- Global Memory: 32 GB GDDR6
- Memory Bandwith: 576 GB/s
- FP64 Performance: ~1.02 FLOPS
- FP32 Performance: 65.3 TFLOPS
- RT Core Performance: 151.0 TFLOPS
- FP8 Tensor Core Performance (with sparsity): 1,044.4 TFLOPS
- NVIDIA RTX 4500 Ada Generation
- #CUDA Cores: 7,680
- #Fourth-generation Tensor Cores: 240
- #Third-generation RT Cores: 60
- Compute Capability: 8.9
- Global Memory: 24 GB GDDR6
- Memory Bandwith: 432 GB/s
- FP64 Performance: 0.619 TFLOPS
- FP32 Performance: 39.63 TFLOPS
- RT Core Performance: 91.6 TFLOPS
- FP8 Tensor Core Performance (with sparsity): 634.0 TFLOPS
- NVIDIA RTX 2000 Ada Generation
- #CUDA Cores: 2,816
- #Fourth-generation Tensor Cores: 88
- #Third-generation RT Cores: 22
- Compute Capability: 8.9
- Global Memory: 16 GB GDDR6 with ECC
- Memory Bandwith: 224 GB/s
- FP64 Performance: 0.19 TFLOPS
- FP32 Performance: 12.0 TFLOPS
- RT Core Performance: 27.7 TFLOPS
- FP8 Tensor Core performance (with sparsity): 191.9 TFLOPS
- NVIDIA L40 (A)/NVIDIA L40S (B)
- Ampere architecture
- NVIDIA A100-PCIE-40GB (A) / NVIDIA A100 80GB PCIe (B) / NVIDIA 100-SXM4-40GB (C) / NVIDIA A100-SXM4-80GB (D)
- #CUDA Cores: 6,192
- #Tensor Cores: 432
- Compute Capability: 8.0
- Global Memory: 40 GB HBM2 (A), 80 GB HBM2e (B), 40 GB HBM2 (C), 80 GB HBM2e (D)
- Memory Bandwidth: 1.555 TB/s (A), 1.935 TB/s (B), 1.555 TB/s (C), 2.039 TB/s (D)
- FP64 Performance: 9.7 TFLOPS
- FP64 Tensor Core: 19.5 TFLOPS
- FP32 Performance: 19.5 TFLOPS
- TF32 Performance (with sparsity): 312 TFLOPS
- BFLOAT16 Tensor Core Performance (with sparsity): 624 TFLOPS
- FP16 Tensor Core Performance (with sparsity): 624 TFLOPS
- INT8 Tensor Core Performance (with sparsity): 1,248 TOPS
- NVIDIA A800 40GB Active
- #CUDA Cores: 6,192
- #Third-generation Tensor Cores: 432
- Compute Capability: 8.0
- Global Memory: 40 GB HBM2
- Memory Bandwidth: 1.555 TB/s
- FP64 Performance: 9.7 TFLOPS
- FP32 Performance: 19.5 TFLOPS
- FP16 Tensor Core Performance: 623.8 TFLOPS
- INT8 Tensor Core Performance: 1,247 TOPS
- NVIDIA A30 and here
- #CUDA Cores: 3,584
- #Tensor Cores: 224
- Compute Capability: 8.0
- Global Memory: 24 GB HBM2e
- Memory Bandwidth: 0.933 TB/s
- FP64 Performance: 5.2 TFLOPS
- FP32 Performance: 10.3 TFLOPS
- FP64 Tensor Core Performance: 10.3 TFLOPS
- TF32 Tensor Core Performance (with sparsity): 165 TFLOPS
- BFLOAT16 Tensor Core Performance (with sparsity): 330 TFLOPS
- FP16 Tensor Core Performance (with sparsity): 330 TFLOPS
- INT8 Tensor Core Performance (with sparsity): 661 TOPS
- INT4 Tensor Core Performace (with sparsity): 1,321 TOPS
- NVIDIA A40
- #CUDA Cores: 10,752
- #Third-generation Tensor Cores: 336
- #Second-generation RT Cores: 84
- Compute Capability: 8.6
- Global Memory: 48 GB GDDR6 with ECC
- Memory Bandwidth: 696 GB/s
- FP64 Performance: 0.585 TFLOPS
- FP32 Performance: 37.4 TFLOPS
- TF32 Tensor Core Performance (with sparsity): 149.6 TFLOPS
- FP16 Tensor Core Performance (with sparsity): 299.4 TFLOPS
- RT Tensor Core Performance: 73.1TFLOPS
- BF16 Tensor Core Performance (with sparsity): 299.4 TFLOPS
- INT8 Tensor Core Performance (with sparsity): 598.6 TOPS
- INT4 Tensor Core Performance (with sparsity): 1,197.4 TOPS
- NVIDIA RTX A6000
- #CUDA Cores: 10,752
- #Third-generation Tensor Cores: 336
- #Second-generaton RT Cores: 84
- Compute Capability: 8.6
- Global Memory: 48 GB GDDR6 with ECC
- Memory Bandwidth: 768 GB/s
- FP64 Performance: ~0.605 TFLOPS
- FP32 Performance: 38.7 TFLOPS
- RT Core Performance: 75.6 TFLOPS
- Tensor Core Performance (with sparsity): 309.7 TFLOPS
- NVIDIA RTX A5500
- #CUDA Cores: 10,240
- #Third-generation Tensor Cores: 320
- #Second-generation RT Cores: 80
- Compute Capability: 8.6
- Global Memory: 24 GB GDDR6
- Memory Bandwidth: 768 GB/s
- FP64 Performance: 0.53 TFLOPS
- FP32 Performance: 34.1 TFLOPS
- RT Core Performance: 66.6 TFLOPS
- Tensor Core Performance (with sparsity): 272.8 TFLOPS
- NVIDIA GeForce RTX 3090
- #CUDA Cores: 10,496
- #Third-generation Tensor Cores: 328
- #Second-generation RT Cores: 82
- Compute Capability: 8.6
- Global Memory: 24 GB GDDR6X
- Memory Bandwidth: 936.2 GB/s
- FP64 Performance: ~0.56 TFLOPS
- FP32 Performance: ~35.6 TFLOPS
- RT Core Performance: 56 TFLOPS
- Tensor Core Performance (with sparsity): 285 TFLOPS
- NVIDIA A100-PCIE-40GB (A) / NVIDIA A100 80GB PCIe (B) / NVIDIA 100-SXM4-40GB (C) / NVIDIA A100-SXM4-80GB (D)
- Turing architecture
- NVIDIA GeForce RTX 2080 Ti
- #CUDA Cores: 4,352
- #Second-generation Tensor Cores: 544
- #First-generation RT Cores: 68
- Compute Capability: 7.5
- Global Memory: 11 GB GDDR6
- Memory Bandwidth: 616 GB/s
- FP64 Performance: ~0.42 TFLOPS
- FP32 Performance: ~13.4 TFLOPS
- Tensor Core Performance (with sparsity): 114 TFLOPS
- Tesla T4
- #CUDA Cores: 2,560
- #Tensor Cores: 320
- #First-generation RT Cores: 40
- Compute Capability: 7.5
- Global Memory: 16 GB GDDR6
- Memory Bandwidth: 300 GB/s
- FP64 Performance: ~0.25 TFLOPS
- FP32 Performance: 8.1 TFLOPS
- FP16 Performance: 65 TFLOPS
- INT8 Performance: 130 TOPS
- INT4 Performance: 260 TOPS
- NVIDIA GeForce RTX 2080 Ti
- Volta architecture
- Tesla V100-PCIE-16GB and here
- #CUDA Cores: 5,120
- #Tensor Cores: 640
- Compute Capability: 7.0
- Global Memory: 16 GB HBM2
- Memory Bandwidth: 897 .0 GB/s
- FP64 Performance: 7.066 TFLOPS
- FP32 Performance: 14.13 TFLOPS
- FP16 Performance: 28.26 TFLOPS
- NVIDIA TITAN V
- #CUDA Cores: 5,120
- #First-generation Tensor Cores: 640
- Compute Capability: 7.0
- Global Memory: 12 GB HBM2
- Memory Bandwidth: 652.8 GB/s
- FP64 Performance: 7.450 TFLOPS
- FP32 Performance: 14.9 TFLOPS
- FP16 Performance: 29.80 TFLOPS
- Tesla V100-PCIE-16GB and here
- Pascal architecture
- Tesla P40
- #CUDA Cores: 3,840
- Compute Capbility: 6.1
- Global Memory: 24 GB GDDR5
- Memory Bandwidth: 346 GB/s
- FP64 Performance: ~0.375 TFLOPS
- FP32 Performance: 12 TFLOPS
- NVIDIA GeForce GTX 1080 Ti
- #CUDA Cores: 3,584
- Compute Capability: 6.1
- Global Memory: 11 GB GDDR5X
- Memory Bandwidth: 484 GB/s
- FP32 performance: 11.30 TFLOPS
- FP64 performance: 0.35 TFLOPS
- Tesla P100-PCIE-16GB (see also here)
- #CUDA Cores: 3,584
- Compute Capability: 6.0
- Global Memory: 16 GB HBM2
- Memory Bandwith: 732 GB/s
- FP64 performance: 4.67 TFLOPS
- FP32 performace: 9.34 TFLOPS
- Tesla P40
- Maxwell architecture
- NVIDIA GeForce GTX Titan X
- #CUDA Cores: 3,072
- Compute Capability: 5.2
- Global Memory: 12 GB GDDR5
- Memory Bandwith: 336.5 GB/s
- FP64 performance: ~0.22 TFLOPS
- FP32 performace: ~7.0 TFLOPS
- NVIDIA GeForce GTX Titan X