vLLM introduced in: Kwon et al. 2023 — high-throughput LLM serving via PagedAttention.
Object
Kwon et al. 2023 — high-throughput LLM serving via PagedAttention
Primary source · preprint · 2023-09-12
Efficient Memory Management for Large Language Model Serving with PagedAttention — arXiv (Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica / UC Berkeley)