cuda and my gpu

its time to learn cuda

my notes on nvidia ampere (ga104) gpu architecture

----------------------------------------------------

specs

----------------------------------------------------

notes

device: threads > blocks > grid

data: tile

all threads within a single block have shared mem, but threads in different blocks can only share via global mem

first version of matx with tile size 16 ended up doing better than both 32 and 8

max shared memory per block: 48 KBat 4KB per 16x16 that leaves plenty of room for double-buffered tiling. 32x32 would use up too much shared mem

gpt says " Aim to keep all SMs busy by launching enough blocks to maximize occupancy. For example, launching around 46 * 2 = 92 or more blocks would help keep all SMs engaged."

blocks should be a multiple of warp size(32) that fits in max threads(1024). so something like 256 or 512

gpt says " Coalescing Memory Accesses: Design your memory access patterns to be aligned to 32-thread warps to improve memory coalescing, meaning that threads in a warp access contiguous memory addresses. This reduces the number of memory transactions and increases bandwidth utilization."

again, gpt says " Each SM can handle a maximum of 16 resident blocks at once. Therefore, try to design your grid and block configuration to stay below this limit. If your kernel uses very few registers and shared memory, aim for higher occupancy by launching multiple blocks per SM to leverage all available SMs."

need to read up on SM, registers, coalescing memory access

----------------------------------------------------

cuda sm

total thing says like 200ms

but actual matxMul is only 142us

call to cudaMalloc alone is 75ms

this cant be the right way to do this stuff, 99% of time spent not actually doing the thing?

----------------------------------------------------

concepts

----------------------------------------------------

directory