notes on optimization methods for performance of fp32 gemm on nvidia gpus
-----------------------
after reading a few things seems like cuBLAS gets us from no optimization up to 2D block tiling + 2D warp tiling + 2D thread tiling + vectorized mem access
adding cuBLAS documentation to reading list
global mem -> shared mem -> register file -> SM CUDA core
blocked GEMM -> thread block tile -> warp tile -> thread tile
global memory = DRAM
L2 cache =? GMEM?
L1 = shared memory = SMEM
register file = fastest mem
-----------------------
most of this section comes from this article
-----------------------
other methods taken from here CUDA matmul kernel for cuBLAS-like perf
-----------------------
directory