Inference optimization — quantization, attention, and serving

The inference-cost wall

Training a 70B-parameter model is expensive once; running it for millions of users is expensive forever. Inference optimization has driven most of the practical-deployment progress 2022-2025. Three axes: faster attention (FlashAttention, PagedAttention), smaller weights (GPTQ, AWQ, QLoRA, GGUF quantization), better serving (vLLM, llama.cpp, Ollama, TGI).

Attention improvements

FlashAttention (Dao et al. 2022) recomputes attention with IO-aware tiling, giving the same output with much less memory pressure. PagedAttention (Kwon et al., vLLM 2023) treats KV cache like OS-managed memory pages. Together these unlock context windows that were previously impossible on commodity hardware.

Quantization + adapters

LoRA (Hu et al. 2021) and QLoRA (Dettmers et al. 2023) make fine-tuning a 70B model possible on a single consumer GPU. GPTQ (Frantar et al. 2022) and AWQ quantize trained models to 4-bit with minimal quality loss. The combined effect: a frontier-quality model that runs locally on a $1,500 GPU.

Defined terms (4)

FlashAttention

IO-aware exact attention algorithm by Dao et al. (2022) that reduces memory pressure during attention computation without changing outputs.

Quantization

Reducing the bit-precision of model weights (typically from 16-bit to 4-bit or 8-bit) to lower memory footprint and inference cost.

LoRA

Low-Rank Adaptation. Fine-tunes a small adapter that gets merged with frozen base-model weights. Drastically cheaper than full fine-tuning.

PagedAttention

KV-cache management technique from vLLM that treats GPU memory like OS-managed pages, allowing flexible request scheduling at high throughput.

Inference optimization — quantization, attention, and serving

The inference-cost wall

Attention improvements

Quantization + adapters

Defined terms (4)

All claims in this topic (10)

Related

Other topic hubs

Concept pillars

Framework integrations