Multimodal

Multimodal models accept inputs beyond text: images, audio clips, video frames, documents. Early AI systems used separate specialized models for each modality; multimodal architectures process them in a unified way, allowing the model to reason across modalities simultaneously.

Input modalities in 2025-2026

Text: Universal baseline for all frontier models
Images: GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro, Llama 3.2, Qwen-VL - all support image understanding
Video: Gemini 2.5 Pro, GPT-4o (limited frame rates), Claude 3.5 Sonnet - analyze video frames in sequence
Audio: GPT-4o, Gemini 2.5 Flash - real-time audio understanding and generation
Documents (PDF/layout): Most frontier models parse document structure via vision, though OCR reliability varies

How vision works in LLMs

Images are typically processed by a vision encoder (like ViT) that converts image patches into embeddings. These embeddings are projected into the LLM's token embedding space and concatenated with text tokens. The LLM then processes text and image tokens together with the same attention mechanism.

Limitations

Spatial reasoning (which object is to the left of the red box) remains weak in most models. Fine-grained OCR in complex layouts is unreliable. Video understanding is limited to relatively short clips and lower frame rates compared to specialized video models. Audio understanding in non-English languages lags text-only performance significantly. Most multimodal models still process video by sampling frames rather than analyzing continuous temporal information.

Input modalities in 2025-2026

How vision works in LLMs

Limitations

Related terms

Models relevant to Multimodal