"Tuning CUDA Applications for NVIDIA Ampere GPU Architecture"

link to the guide here here

CUDA best practices

the performance guidelines and best practices described in the CUDA C++ Programming Guide and the CUDA C++ Best Practices Guide apply to all CUDA-capable GPU arch

find was to parallelize sequential code
minimize data transfers between the host and the device
adjust keren launch configuration to maximize device utilization
ensure global memory accesses are coalesced
minimze redundant accesses to global memory whenever possible
avoid long sequences of divered execution by threads within the same warp

architecture tuning

Streaming Multiprocessor (SMs)

max number of concurrent warps per SM for compute capability 8.6 is 48

max number of registers per thread is 255

max number of thread blocks per SM is 16 for GPUs w compute capab 8.6

shared memory capacity per SM is 100kb

shared memory per thread block is 99kb

100kb = 25000 FP32

16x16 is 256 and 112x112 is 12544

`pipeline` API in cuda allows for asynchronous data copy from global to shared. they avoid extra registers and can also bypass L1 cache

8.6 cap has 2x more FP32 operations per cylce per SM than 8.0

----------------------------

255 registers per thread

1 register = 1 fp32 int

1 thread has max of 255 ints

1 warp has max of 32*255=8160 ints

1 SM can do 48 concurrent warps so 48*8160=391,680 ints

----------------------------------------------------