Topic hub · 85 claims
Foundational AI/ML papers — the canonical reading list
The papers that everything builds on. Each is hand-verified against the primary source — author, date, venue, and a verbatim excerpt from the abstract.
Why a canonical reading list matters
Production AI engineers don't have time to triangulate dates from sometimes-wrong blog posts. "When was the transformer paper published?" should be a 100ms lookup, not a 10-minute SERP triangulation. This hub catalogs the foundational papers with verified dates, authors, venues, and verbatim excerpts — every claim has ≥2 primary sources.
Pre-Transformer era
The deep-learning revival ran on architectures and ideas that pre-date the Transformer. LSTM (Hochreiter & Schmidhuber 1997), Dropout (Hinton et al. 2014), GloVe (Pennington, Socher, Manning 2014), Word2Vec (Mikolov et al. 2013) — the recurrent + embedding foundation that 2015-2017 transformer work would surpass but not erase.
Transformer + pretraining era (2017-2020)
Attention Is All You Need (Vaswani et al. 2017) opened the door. BERT (Devlin et al. 2019) closed the encoder-only branch. GPT-2 (Radford et al. 2019) shipped the decoder-only architecture that would eventually power frontier models. T5 (Raffel et al. 2020), RoBERTa, DistilBERT, ELECTRA each refined the pretraining recipe.
Frontier methods (2021-2025)
Once architectures stabilized, the innovation moved to alignment (RLHF, Constitutional AI, DPO), efficient inference (FlashAttention, LoRA, QLoRA, GPTQ, vLLM), retrieval grounding (RAG, Self-RAG, ReAct), and tool-use (Toolformer, MCP). Each claim here is a paper that downstream work compounds against.
Defined terms (3)
- Foundational paper
- A research paper that other AI/ML papers cite as the canonical reference for an architecture, method, or technique.
- Pretraining
- Training a model on a large general dataset before fine-tuning for a downstream task.
- RLHF
- Reinforcement learning from human feedback — the alignment technique that produced InstructGPT and ChatGPT.
All claims in this topic (85)
- Adam optimizer·introduced in paper Adam: A Method for Stochastic Optimization (Kingma, Ba, 2014)(1.00 · 2 sources)
- AdamW optimizer·introduced in paper Decoupled Weight Decay Regularization (Loshchilov & Hutter, 2017)(1.00 · 2 sources)
- AlexNet·introduced in paper ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, Hinton, 2012)(1.00 · 2 sources)
- AlpacaEval·introduced in Li et al. 2023 — LLM-as-judge evaluation benchmark(1.00 · 2 sources)
- AlphaFold 1·introduced in Senior et al. 2020 — DeepMind protein structure prediction(1.00 · 2 sources)
- AlphaGo·defeated Lee Sedol 4-1 in March 2016(1.00 · 2 sources)
- AlphaZero·published in Science journal December 2018(1.00 · 2 sources)
- Anthropic Constitutional AI Harmlessness·introduced in paper Bai et al. 2022 — training a helpful and harmless assistant(1.00 · 2 sources)
- Backpropagation algorithm·popularized in Rumelhart, Hinton, Williams 1986 — Nature paper(1.00 · 2 sources)
- BART·introduced in Lewis et al. 2019 — denoising sequence-to-sequence pretraining(1.00 · 2 sources)
- Batch Normalization·introduced in paper Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Ioffe & Szegedy, 2015)(1.00 · 2 sources)
- BERT (Bidirectional Encoder Representations from Transformers)·introduced in paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)(1.00 · 2 sources)
- BLEU score·introduced in paper BLEU: a Method for Automatic Evaluation of Machine Translation (Papineni et al., 2002)(1.00 · 2 sources)
- Byte-Pair Encoding (BPE) for Neural Machine Translation·introduced in paper Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)(1.00 · 2 sources)
- Byte-Pair Encoding (BPE) for NMT·introduced in paper Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)(1.00 · 2 sources)
- C4 (Colossal Clean Crawled Corpus)·introduced in paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)(1.00 · 2 sources)
- Chain-of-Thought prompting·introduced in paper Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)(1.00 · 2 sources)
- Chatbot Arena·introduced in Zheng et al. 2023 — LMSYS open platform for evaluating LLMs by human preference(1.00 · 2 sources)
- Chinchilla scaling laws·introduced in paper Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)(1.00 · 2 sources)
- CLIP·introduced in paper Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)(1.00 · 2 sources)
- CLIP (Contrastive Language-Image Pretraining)·introduced in paper Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)(1.00 · 2 sources)
- Codex·introduced in paper Evaluating Large Language Models Trained on Code (Chen et al., 2021)(1.00 · 2 sources)
- Constitutional AI (CAI)·introduced in paper Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)(1.00 · 2 sources)
- Denoising Diffusion Probabilistic Models (DDPM)·introduced in paper Denoising Diffusion Probabilistic Models (Ho, Jain, Abbeel, 2020)(1.00 · 2 sources)
- Direct Preference Optimization (DPO)·introduced in paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023)(1.00 · 2 sources)
- DistilBERT·introduced in Sanh et al. 2019 — a smaller, faster, cheaper BERT via knowledge distillation(1.00 · 2 sources)
- Dropout·introduced in paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava et al., 2014)(1.00 · 2 sources)
- ELECTRA·introduced in paper ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators (Clark et al., 2020)(1.00 · 2 sources)
- ELMo (Embeddings from Language Models)·introduced in paper Deep contextualized word representations (Peters et al., 2018)(1.00 · 2 sources)
- FAISS·introduced in Johnson, Douze, Jégou 2017 — Facebook AI Similarity Search(1.00 · 2 sources)
- Flamingo·introduced in Alayrac et al. 2022 — DeepMind few-shot vision-language model(1.00 · 2 sources)
- FlashAttention·introduced in paper FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)(1.00 · 2 sources)
- Generative Adversarial Networks (GANs)·introduced in paper Generative Adversarial Networks (Goodfellow et al., 2014)(1.00 · 2 sources)
- GloVe·introduced in Pennington, Socher, Manning 2014 — global vectors for word representation(1.00 · 2 sources)
- GLUE benchmark·introduced in paper GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018)(1.00 · 2 sources)
- GPT-2·introduced in paper Language Models are Unsupervised Multitask Learners (Radford et al., 2019)(1.00 · 2 sources)
- GPT-3·introduced in paper Language Models are Few-Shot Learners (Brown et al., 2020)(1.00 · 2 sources)
- GPTQ·introduced in Frantar et al. 2022 — accurate post-training quantization for GPT models(1.00 · 2 sources)
- HumanEval benchmark·introduced in paper Evaluating Large Language Models Trained on Code (Chen et al., 2021)(1.00 · 2 sources)
- Imagen·introduced in paper Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Saharia et al., 2022)(1.00 · 2 sources)
- ImageNet dataset·introduced in paper ImageNet: A Large-Scale Hierarchical Image Database (Deng et al., 2009)(1.00 · 2 sources)
- InstructGPT·introduced in Ouyang et al. 2022 — RLHF-tuned GPT-3, direct ancestor of ChatGPT(1.00 · 2 sources)
- InstructGPT methodology·introduced in paper Training language models to follow instructions with human feedback (Ouyang et al., 2022)(1.00 · 2 sources)
- Instructor library·introduced in Jason Liu 2023 — structured outputs from LLMs via Pydantic(1.00 · 2 sources)
- Knowledge Distillation·popularized in Hinton, Vinyals, Dean 2015 — distilling the knowledge in a neural network(1.00 · 2 sources)
- Latent Diffusion Models (LDM)·introduced in paper High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2021)(1.00 · 2 sources)
- Layer Normalization·introduced in paper Layer Normalization (Ba, Kiros, Hinton, 2016)(1.00 · 1 sources)
- Long Short-Term Memory (LSTM)·introduced in 1997 by Hochreiter and Schmidhuber(1.00 · 2 sources)
- Longformer·introduced in paper Longformer: The Long-Document Transformer (Beltagy, Peters, Cohan, 2020)(1.00 · 2 sources)
- LoRA (Low-Rank Adaptation)·introduced in paper LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)(1.00 · 2 sources)
- Low-Rank Adaptation (LoRA)·introduced in paper LoRA: Low-Rank Adaptation of Large Language Models (Hu et al., 2021)(1.00 · 2 sources)
- Mamba state-space model·introduced in paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu, Dao, 2023)(1.00 · 2 sources)
- Mamba-2·introduced in Dao & Gu 2024 — structured state space duality(1.00 · 2 sources)
- Mixture of Experts (MoE) revival·popularized in Shazeer et al. 2017 — outrageously large neural networks via sparse gating(1.00 · 2 sources)
- MMLU benchmark·introduced in paper Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)(1.00 · 2 sources)
- MTEB benchmark·introduced in Muennighoff et al. 2022 — Massive Text Embedding Benchmark(1.00 · 2 sources)
- PaLM·introduced in paper PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)(1.00 · 2 sources)
- Proximal Policy Optimization (PPO)·introduced in paper Proximal Policy Optimization Algorithms (Schulman et al., 2017)(1.00 · 2 sources)
- QLoRA·introduced in paper QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)(1.00 · 2 sources)
- ReAct (Reasoning + Acting)·introduced in paper ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022)(1.00 · 2 sources)
- Reformer·introduced in paper Reformer: The Efficient Transformer (Kitaev, Kaiser, Levskaya, 2020)(1.00 · 2 sources)
- Reinforcement Learning from Human Feedback (RLHF)·introduced in paper Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017)(1.00 · 3 sources)
- ResNet (Residual Networks)·introduced in paper Deep Residual Learning for Image Recognition (He et al., 2015)(1.00 · 2 sources)
- Retrieval-Augmented Generation (RAG)·introduced in paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (Lewis et al., 2020)(1.00 · 2 sources)
- RoBERTa·introduced in Liu et al. 2019 — A Robustly Optimized BERT Pretraining Approach(1.00 · 2 sources)
- Rotary Position Embedding (RoPE)·introduced in paper RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021)(1.00 · 2 sources)
- ROUGE score·introduced in paper ROUGE: A Package for Automatic Evaluation of Summaries (Lin, 2004)(1.00 · 2 sources)
- Self-RAG·introduced in Asai et al. 2023 — self-reflective retrieval-augmented generation(1.00 · 2 sources)
- SentencePiece tokenizer·introduced in paper SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo & Richardson, 2018)(1.00 · 2 sources)
- Sequence-to-Sequence Learning (seq2seq)·introduced in paper Sequence to Sequence Learning with Neural Networks (Sutskever, Vinyals, Le, 2014)(1.00 · 2 sources)
- SGLang·introduced in Zheng et al. 2024 — efficient LLM serving with structured outputs(1.00 · 2 sources)
- Sparsely-Gated Mixture-of-Experts (MoE)·introduced in paper Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Shazeer et al., 2017)(1.00 · 1 sources)
- Speculative decoding·introduced in Leviathan, Kalman, Matias 2023 — Google Research(1.00 · 2 sources)
- SuperGLUE benchmark·introduced in paper SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019)(1.00 · 2 sources)
- Switch Transformer·introduced in paper Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Fedus et al., 2021)(1.00 · 2 sources)
- T5 (Text-to-Text Transfer Transformer)·introduced in paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)(1.00 · 2 sources)
- Toolformer·introduced in Schick et al. 2023 — self-supervised LLM tool-use(1.00 · 2 sources)
- Transformer architecture·introduced in paper Attention Is All You Need (Vaswani et al., 2017)(1.00 · 3 sources)
- Tree of Thoughts·introduced in Yao et al. 2023 — deliberate problem solving with LLMs(1.00 · 2 sources)
- U-Net·introduced in Ronneberger, Fischer, Brox 2015 — biomedical image segmentation(1.00 · 2 sources)
- VAE (Variational Autoencoder)·introduced in Kingma & Welling 2013 — auto-encoding variational Bayes(1.00 · 2 sources)
- Variational Autoencoder (VAE)·introduced in paper Auto-Encoding Variational Bayes (Kingma, Welling, 2013)(1.00 · 2 sources)
- Vision Transformer (ViT)·introduced in paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2020)(1.00 · 2 sources)
- vLLM·introduced in Kwon et al. 2023 — high-throughput LLM serving via PagedAttention(1.00 · 2 sources)
- Word2Vec·introduced in paper Efficient Estimation of Word Representations in Vector Space (Mikolov et al., 2013)(1.00 · 2 sources)