matmul optimization

notes on optimization methods for performance of fp32 gemm on nvidia gpus

-----------------------

own notes

after reading a few things seems like cuBLAS gets us from no optimization up to 2D block tiling + 2D warp tiling + 2D thread tiling + vectorized mem access

adding cuBLAS documentation to reading list

global mem -> shared mem -> register file -> SM CUDA core

blocked GEMM -> thread block tile -> warp tile -> thread tile

global memory = DRAM

L2 cache =? GMEM?

L1 = shared memory = SMEM

-----------------------

most of this section comes from this article

naive implementation with non-coalesced memory access
naive implementation with coalesced memory access
implementation with 2D block tiling
implementation with 2D block tiling and 1D thread tiling
implementation with 2D block tiling and 2D thread tiling
implementation with 2D block tiling, 2D thread tiling, and vectorized memory access
implementation with 2D block tiling, 2D warp tiling, 2D thread tiling, and vectorized mem access

-----------------------

other methods taken from here CUDA matmul kernel for cuBLAS-like perf

naive 1.3%
GMEM coalescing 8.5%
SMEM caching 12.8%
1D block tiling 36.5%
2D block tiling 68.7%
vectorized memory access 78.4%
autotuning 84.8%
warptiling 93.7%
cuBLAS 100.00%

-----------------------