Topic hub · 10 claims
Inference optimization — quantization, attention, and serving
The techniques that take a frontier model from "impossible to deploy" to "$0.001 per call." Quantization, attention algorithms, fine-tuning adapters, and serving systems.
The inference-cost wall
Training a 70B-parameter model is expensive once; running it for millions of users is expensive forever. Inference optimization has driven most of the practical-deployment progress 2022-2025. Three axes: faster attention (FlashAttention, PagedAttention), smaller weights (GPTQ, AWQ, QLoRA, GGUF quantization), better serving (vLLM, llama.cpp, Ollama, TGI).
Attention improvements
FlashAttention (Dao et al. 2022) recomputes attention with IO-aware tiling, giving the same output with much less memory pressure. PagedAttention (Kwon et al., vLLM 2023) treats KV cache like OS-managed memory pages. Together these unlock context windows that were previously impossible on commodity hardware.
Quantization + adapters
LoRA (Hu et al. 2021) and QLoRA (Dettmers et al. 2023) make fine-tuning a 70B model possible on a single consumer GPU. GPTQ (Frantar et al. 2022) and AWQ quantize trained models to 4-bit with minimal quality loss. The combined effect: a frontier-quality model that runs locally on a $1,500 GPU.
Defined terms (4)
- FlashAttention
- IO-aware exact attention algorithm by Dao et al. (2022) that reduces memory pressure during attention computation without changing outputs.
- Quantization
- Reducing the bit-precision of model weights (typically from 16-bit to 4-bit or 8-bit) to lower memory footprint and inference cost.
- LoRA
- Low-Rank Adaptation. Fine-tunes a small adapter that gets merged with frozen base-model weights. Drastically cheaper than full fine-tuning.
- PagedAttention
- KV-cache management technique from vLLM that treats GPU memory like OS-managed pages, allowing flexible request scheduling at high throughput.
All claims in this topic (10)
- GPTQ·introduced in Frantar et al. 2022 — accurate post-training quantization for GPT models(1.00 · 2 sources)
- llama.cpp·publicly released on 2023-03-10 by Georgi Gerganov(1.00 · 2 sources)
- LoRA (Low-Rank Adaptation)·introduced in paper LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)(1.00 · 2 sources)
- Low-Rank Adaptation (LoRA)·introduced in paper LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)(1.00 · 2 sources)
- QLoRA·introduced in paper QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)(1.00 · 2 sources)
- SGLang·introduced in Zheng et al. 2024 — efficient LLM serving with structured outputs(1.00 · 2 sources)
- Speculative decoding·introduced in Leviathan, Kalman, Matias 2023 — Google Research(1.00 · 2 sources)
- Triton inference server·publicly released on 2018-11 by NVIDIA — formerly TensorRT Inference Server(1.00 · 2 sources)
- vLLM·introduced in Kwon et al. 2023 — high-throughput LLM serving via PagedAttention(1.00 · 2 sources)
- Groq LPU·publicly released on 2024-02-19 by Groq — language processing unit inference(0.95 · 2 sources)