ga104 architecture

notes on ga104 chip as thats whats in my 3070, mostly taken from NVIDIA AMPERE GPU ARCHITECTURE

attempt and notes on starting to write CUDA kernels

gpu architecture: NVIDIA ampere
GPCs: 6
TPCs: 23
SMs: 46
CUDA Cores / SM: 128
CUDA Cores / GPU: 5888
Tensor Cores / SM: 4
Tensor Cores / GPU: 184
Peak FP16 Tensor TFLOPS with FP16 Accum: 81.3/162.6
Peak INT8 Tensor TOPS: 162.6/325.2
L1 Data Cache / Shared Memory: 5888
L2 Cache Size: 4096 KB
Register File Size: 11776

----------------------------------------------------

----------------------------------------------------

SMs

each SM in ga10x contains 128 CUDA Cores, 4 Tensor Cores, a 256 KB Register file, 128 KB L1/shared mem

the ga10x SM is partitioned into 4 processing blocks or partitions, each with a 64KB register file, an L0 instruction cache, one warm scheduler, one dispatch unit, and sets of math and other units

the combined partitions make up the 128KB L1 data cache/shared mem subsystem

----------------------------------------------------

in last gen, each SM partition only had 1 datapath for FP32 ops

ampere has FP32 processing on both datapths, doubling the peak processing rate for FP32 ops

one datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock

the other datapath in each partition consists of both 16 FP32 cuda cores and 16 int32 cores, capable of executing either 16 fp32 or 16 int32 per clock

----------------------------------------------------

additional gpu arch notes

notes on SMs for Fermi architecture

things to learn:

Streaming Multiprocessor (SM)
warp and warp scheduler
registers

notes

cuda core is the execute unit which has one float and one integer compute processor
the SM schedules threads in group of 32 threads called warps
the warp schedulers means two warps can be issued at the same time
*registers are the fastest memory, L1 cache and shared memory is second

more notes:

a thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operation on. the thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables
a block is a group of threads which execute together in a batch. this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data
a grid is a set blocks which together execute the GPU operation
a warp is a set of 32 threads. so if 128 threads per block, threads 0-31 will be in one warp, 32-63 in another warp, so on
threads within a warp are bound together, they fetch data together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses you will pay 32 mem transactions
each block is launched on a SM, until it is done and the next block will be launched. so if you have 30 SMs then the blocks will be scheduled across the SMs dynamically ensure when you launch a GPU function your grid is composed of large number (at least hunders) to ensure it scales across the GPU
a SM can execute more than one block at any given time. this is why a SM can handle 768(or more) threads while a block is only up to 512 threads
if the SM has the resources available then it will take on additional blocks (up to 8)

----------------------------------------------------