matmul optimization

notes on optimization methods for performance of fp32 gemm on nvidia gpus

-----------------------

own notes

after reading a few things seems like cuBLAS gets us from no optimization up to 2D block tiling + 2D warp tiling + 2D thread tiling + vectorized mem access

adding cuBLAS documentation to reading list

global mem -> shared mem -> register file -> SM CUDA core

blocked GEMM -> thread block tile -> warp tile -> thread tile

global memory = DRAM

L2 cache =? GMEM?

L1 = shared memory = SMEM

register file = fastest mem

-----------------------

most of this section comes from this article

-----------------------

other methods taken from here CUDA matmul kernel for cuBLAS-like perf

-----------------------

directory