Tag
alignment
6 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.
Reinforcement Learning from Human Feedback (RLHF) introduced in paper: Deep Reinforcement Learning from Human Preferences (Christiano et al., 2017).
67866330cd60e54d · 3 sources · 100% confidence
Direct Preference Optimization (DPO) introduced in paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Rafailov et al., 2023).
a3e691683a4577af · 2 sources · 100% confidence
Constitutional AI (CAI) introduced in paper: Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022).
ba1eb83c14795107 · 2 sources · 100% confidence
InstructGPT methodology introduced in paper: Training language models to follow instructions with human feedback (Ouyang et al., 2022).
5da8f8dffc038b8e · 2 sources · 100% confidence
InstructGPT introduced in: Ouyang et al. 2022 — RLHF-tuned GPT-3, direct ancestor of ChatGPT.
590b9de765b8126e · 2 sources · 100% confidence
Anthropic Constitutional AI Harmlessness introduced in paper: Bai et al. 2022 — training a helpful and harmless assistant.
6fa575eb9df5ac32 · 2 sources · 100% confidence