cuda and my gpu

its time to learn cuda

my notes on nvidia ampere (ga104) gpu architecture

----------------------------------------------------

specs

creetz@creetz-System-Product-Name:~/dev/my-cuda$ ./inpsect
GPU Name: NVIDIA GeForce RTX 3070
Compute Capability: 8.6
Number of SMs: 46
Max Threads per Block: 1024
Max Threads per SM: 1536
Warp Size: 32
Max Blocks per SM: 16
Max Shared Memory per Block: 48 KB
Max Shared Memory per SM: 100 KB
Total Global Memory: 7958 MB
Memory Bus Width: 256 bits
Memory Clock Rate: 7001 MHz
L2 Cache Size: 4096 KB
Max Grid Size (x, y, z): (2147483647, 65535, 65535)
Max Threads Dimensions (x, y, z): (1024, 1024, 64)

----------------------------------------------------

notes

device: threads > blocks > grid

data: tile

all threads within a single block have shared mem, but threads in different blocks can only share via global mem

first version of matx with tile size 16 ended up doing better than both 32 and 8

max shared memory per block: 48 KBat 4KB per 16x16 that leaves plenty of room for double-buffered tiling. 32x32 would use up too much shared mem

gpt says " Aim to keep all SMs busy by launching enough blocks to maximize occupancy. For example, launching around 46 * 2 = 92 or more blocks would help keep all SMs engaged."

blocks should be a multiple of warp size(32) that fits in max threads(1024). so something like 256 or 512

gpt says " Coalescing Memory Accesses: Design your memory access patterns to be aligned to 32-thread warps to improve memory coalescing, meaning that threads in a warp access contiguous memory addresses. This reduces the number of memory transactions and increases bandwidth utilization."

again, gpt says " Each SM can handle a maximum of 16 resident blocks at once. Therefore, try to design your grid and block configuration to stay below this limit. If your kernel uses very few registers and shared memory, aim for higher occupancy by launching multiple blocks per SM to leverage all available SMs."

need to read up on SM, registers, coalescing memory access

----------------------------------------------------

total thing says like 200ms

but actual matxMul is only 142us

call to cudaMalloc alone is 75ms

this cant be the right way to do this stuff, 99% of time spent not actually doing the thing?

----------------------------------------------------

concepts

----------------------------------------------------