llms, training, inference, 2025

some understanding of attention and modern transformers is assumed.

kv cache attention variations

kv cache enables resuse of the key and value matrices from previous tokens. due to the nature of autoregressive models, the nth token output requires computing all 0 to n-1 tokens as well. luckily we can just cache the outputs of those computations and reuse them to compute n. the drawback is that kv matrices are not necessarily comprised of small vectors, but rather pretty large and especially so in cases of long context, think 100k tokens up to 1m tokens.

due to the memory requirements of large kv stores, it is a major bottleneck and a constant focus point for innovating solutions. a few being mqa, gqa, and mla. in addition to other efforts to quantize kv caches.

ga104

to summarize the figure above:

post training

pre training is the step in llms that give the model huge amounts of unstructured data to teach it general text and world knowledge, while post training is what gives models their assistant like characteristics and make them useful to users.

post training is also the primary concern for companies, given most foundation models and pretraining are the work of labs and teams not necessarily focused on a specific user base. llama and qwen are the most popular foundations to build from.

grpoppo grpoppo

to summarize the figure above:

hardware aware algorithms and techniques


directory