Topic hub · 19 claims
Multimodal AI — vision, image generation, and cross-modal models
Models that combine vision, text, audio, or video. Hand-verified release dates, foundational papers, and the organizations behind them.
The vision-language unification
Until 2021, vision and language were largely separate research stacks. CLIP (Radford et al., OpenAI 2021) unified them with contrastive image-text pretraining. Flamingo (DeepMind 2022) demonstrated few-shot multimodal learning. By 2024 every frontier model (GPT-4o, Claude 3 family, Gemini 1.5/2.0) was natively multimodal — vision, audio, and text in a single forward pass.
Image generation — diffusion takes over
GANs (Goodfellow et al. 2014) ruled image synthesis for ~7 years. Then diffusion arrived: DDPM (Ho et al. 2020), Stable Diffusion (CompVis 2022), DALL·E 3 (OpenAI 2023), Imagen (Google 2022), Stable Diffusion 3 (Stability AI 2024). Each generation refined photorealism and prompt-following. The community split between closed (DALL·E, Imagen) and open (Stable Diffusion, Flux).
Speech + video — the remaining modalities
Whisper (OpenAI 2022, large-v3 2023) made high-quality speech-to-text public. Sora (OpenAI 2024) and Veo (Google 2024) opened text-to-video. The trend: every modality becomes accessible to a single API call within ~12 months of the breakthrough paper.
Defined terms (3)
- Multimodal model
- A model that accepts and/or generates more than one modality (text, image, audio, video) in a unified architecture.
- Diffusion model
- A generative model that learns to reverse a noising process. Produces high-quality images, audio, and video samples.
- Contrastive pretraining
- Training paradigm that learns by pulling matched pairs together and pushing unmatched pairs apart in embedding space. Used by CLIP.
All claims in this topic (19)
- AlexNet·introduced in paper ImageNet Classification with Deep Convolutional Neural Networks (Krizhevsky, Sutskever, Hinton, 2012)(1.00 · 2 sources)
- Black Forest Labs Flux·publicly released on 2024-08-01 — Flux.1 [pro/dev/schnell] image generation(1.00 · 2 sources)
- CLIP·introduced in paper Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)(1.00 · 2 sources)
- CLIP (Contrastive Language-Image Pretraining)·introduced in paper Learning Transferable Visual Models From Natural Language Supervision (Radford et al., 2021)(1.00 · 2 sources)
- DALL-E 3·announced on 2023-09-20(1.00 · 1 sources)
- DALL·E 2·released on 2022-04-06(1.00 · 2 sources)
- Denoising Diffusion Probabilistic Models (DDPM)·introduced in paper Denoising Diffusion Probabilistic Models (Ho, Jain, Abbeel, 2020)(1.00 · 2 sources)
- Flamingo·introduced in Alayrac et al. 2022 — DeepMind few-shot vision-language model(1.00 · 2 sources)
- GPT-4 Vision·publicly released on 2023-09-25 by OpenAI(1.00 · 2 sources)
- GPT-4o·released on 2024-05-13(1.00 · 1 sources)
- ImageNet dataset·introduced in paper ImageNet: A Large-Scale Hierarchical Image Database (Deng et al., 2009)(1.00 · 2 sources)
- Latent Diffusion Models (LDM)·introduced in paper High-Resolution Image Synthesis with Latent Diffusion Models (Rombach et al., 2021)(1.00 · 2 sources)
- Llama 3.2 (multimodal release including 11B and 90B vision models)·released on 2024-09-25(1.00 · 2 sources)
- Llama 4·released on 2025-04-05 by Meta — Scout + Maverick + Behemoth lineup(1.00 · 2 sources)
- Midjourney·publicly released on 2022-07-12 — public beta launch(1.00 · 2 sources)
- ResNet (Residual Networks)·introduced in paper Deep Residual Learning for Image Recognition (He et al., 2015)(1.00 · 2 sources)
- Stable Diffusion 1.0·released on 2022-08-22(1.00 · 2 sources)
- Stable Diffusion 1.x·released on 2022-08-22(1.00 · 2 sources)
- Whisper large-v3·publicly released on 2023-11-06 by OpenAI(1.00 · 2 sources)