Multimodal
Models that process and generate multiple types of data - text, images, audio, and video - within a unified architecture.
Multimodal models accept inputs beyond text: images, audio clips, video frames, documents. Early AI systems used separate specialized models for each modality; multimodal architectures process them in a unified way, allowing the model to reason across modalities simultaneously.
Input modalities in 2025-2026
- Text: Universal baseline for all frontier models
- Images: GPT-4o, Claude 3+, Gemini, Llama 4, Qwen-VL - all support image understanding
- Video: Gemini 2.5 Pro, GPT-4o (limited), Amazon Nova Pro - analyze video frames in context
- Audio: GPT-4o, Gemini 2.5 Flash - real-time audio understanding and generation
- Documents (PDF/layout): Most frontier models can parse document structure via vision
How vision works in LLMs
Images are typically processed by a vision encoder (like ViT) that converts image patches into embeddings. These embeddings are projected into the LLM's token embedding space and concatenated with text tokens. The LLM then processes text and image tokens together with the same attention mechanism.
Limitations
Spatial reasoning ("which object is to the left of the red box?") remains weak in most models. Fine-grained OCR in complex layouts is unreliable. Video understanding is limited to relatively short clips and lower frame rates. Audio understanding in non-English languages lags text-only performance significantly.
Related terms
Models relevant to Multimodal
Gemini 2.5 Pro
Google's most capable model with a 1M token context and top science benchmark scores
View model →GPT-4o
OpenAI's multimodal workhorse - fast, affordable, and widely integrated
View model →GPT-5
OpenAI's most capable general-purpose model with strong multimodal and reasoning abilities
View model →Amazon Nova Pro
AWS's multimodal frontier model - natively integrated with Bedrock and the AWS ecosystem
View model →Llama 4
Meta's multimodal open-weights model family with a 10M context window variant
View model →