SourceScore

Topic hub · 19 claims

Multimodal AI — vision, image generation, and cross-modal models

Models that combine vision, text, audio, or video. Hand-verified release dates, foundational papers, and the organizations behind them.

The vision-language unification

Until 2021, vision and language were largely separate research stacks. CLIP (Radford et al., OpenAI 2021) unified them with contrastive image-text pretraining. Flamingo (DeepMind 2022) demonstrated few-shot multimodal learning. By 2024 every frontier model (GPT-4o, Claude 3 family, Gemini 1.5/2.0) was natively multimodal — vision, audio, and text in a single forward pass.

Image generation — diffusion takes over

GANs (Goodfellow et al. 2014) ruled image synthesis for ~7 years. Then diffusion arrived: DDPM (Ho et al. 2020), Stable Diffusion (CompVis 2022), DALL·E 3 (OpenAI 2023), Imagen (Google 2022), Stable Diffusion 3 (Stability AI 2024). Each generation refined photorealism and prompt-following. The community split between closed (DALL·E, Imagen) and open (Stable Diffusion, Flux).

Speech + video — the remaining modalities

Whisper (OpenAI 2022, large-v3 2023) made high-quality speech-to-text public. Sora (OpenAI 2024) and Veo (Google 2024) opened text-to-video. The trend: every modality becomes accessible to a single API call within ~12 months of the breakthrough paper.

Defined terms (3)

Multimodal model
A model that accepts and/or generates more than one modality (text, image, audio, video) in a unified architecture.
Diffusion model
A generative model that learns to reverse a noising process. Produces high-quality images, audio, and video samples.
Contrastive pretraining
Training paradigm that learns by pulling matched pairs together and pushing unmatched pairs apart in embedding space. Used by CLIP.

All claims in this topic (19)

Related

Framework integrations