ga104 architecture

notes on ga104 chip as thats whats in my 3070, mostly taken from NVIDIA AMPERE GPU ARCHITECTURE

attempt and notes on starting to write CUDA kernels

----------------------------------------------------

ga104

----------------------------------------------------

SMs

each SM in ga10x contains 128 CUDA Cores, 4 Tensor Cores, a 256 KB Register file, 128 KB L1/shared mem

the ga10x SM is partitioned into 4 processing blocks or partitions, each with a 64KB register file, an L0 instruction cache, one warm scheduler, one dispatch unit, and sets of math and other units

the combined partitions make up the 128KB L1 data cache/shared mem subsystem

ga104

----------------------------------------------------

in last gen, each SM partition only had 1 datapath for FP32 ops

ampere has FP32 processing on both datapths, doubling the peak processing rate for FP32 ops

one datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock

the other datapath in each partition consists of both 16 FP32 cuda cores and 16 int32 cores, capable of executing either 16 fp32 or 16 int32 per clock

----------------------------------------------------

additional gpu arch notes

notes on SMs for Fermi architecture

things to learn:

notes

more notes:

----------------------------------------------------

directory