llms, training, inference, 2025

some understanding of attention and modern transformers is assumed.

kv cache attention variations

MHA, Multihead Attention from Attention Is All You Need
MQA, Multiquery Attenton from Fast Transformer Decoding: One Write-Head Is All You Need
GQA, Grouped-Query Attention from GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
MLA, Multihead Latent Attention from DeepSeek-V2: A Strong, Economical and Efficient MoE

kv cache enables resuse of the key and value matrices from previous tokens. due to the nature of autoregressive models, the nth token output requires computing all 0 to n-1 tokens as well. luckily we can just cache the outputs of those computations and reuse them to compute n. the drawback is that kv matrices are not necessarily comprised of small vectors, but rather pretty large and especially so in cases of long context, think 100k tokens up to 1m tokens.

due to the memory requirements of large kv stores, it is a major bottleneck and a constant focus point for innovating solutions. a few being mqa, gqa, and mla. in addition to other efforts to quantize kv caches.

to summarize the figure above:

mha uses 1:1 kv to q, resulting in most kv requirements
mqa use single kv for all q, least mem reqs but perf hurt
gqa uses single kv for groups of q, middle mem reqs, minimal perf loss
mla uses representation of kv instead of actual kv, more mem reqs than mqa but better perf than gqa

post training

RLHF from Training language models to follow instructions with human feedback
DPO from Direct Preference Optimization: Your Language Model is Secretly a Reward Model
PPO from Proximal Policy Optimization Algorithms
GRPO from DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Test Time Compute from Scaling LLM Test-Time Compute Optimally can be More Effective Than Scaling Model Parameters

pre training is the step in llms that give the model huge amounts of unstructured data to teach it general text and world knowledge, while post training is what gives models their assistant like characteristics and make them useful to users.

post training is also the primary concern for companies, given most foundation models and pretraining are the work of labs and teams not necessarily focused on a specific user base. llama and qwen are the most popular foundations to build from.

openai researchers published a notable paper in 2022 that discussed how they conduct reinforcement learning with human feedback through stages from:
1. using SFT to finetune a base model on human annotated candidate generations, results in instruct model
2. sample from this model and have human labeler rank candidate ouputs in order of preference. these examples are used to train a reward model.
3. finally, PPO is conducted with a base model/policy that generates outputs, the reward model from step 2 calculates a reward for the output, and the reward is used to update the policy

to summarize the figure above:

grpo was recently introduced by deepseek as an updated policy optimization method from the prior ppo. ppo can be incur large resource requirements due to maintaining multiple similar sized models throughout training. where grpo does without the value model and instead uses the average of the group as the baseline. grpo also plays well into the nature of po in which we compare multiple generation examples with the goal of getting closer to a model that produces the most desired generation.

hardware aware algorithms and techniques

flash attention from FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Paged attention and vllm from Efficient Memory Management for Large Language Model Serving with PagedAttention
Triton from Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations

positional

sequence length