Architecture

Multimodal

Models that process and generate multiple types of data - text, images, audio, and video - within a unified architecture.

Multimodal models accept inputs beyond text: images, audio clips, video frames, documents. Early AI systems used separate specialized models for each modality; multimodal architectures process them in a unified way, allowing the model to reason across modalities simultaneously.

Input modalities in 2025-2026

  • Text: Universal baseline for all frontier models
  • Images: GPT-4o, Claude 3+, Gemini, Llama 4, Qwen-VL - all support image understanding
  • Video: Gemini 2.5 Pro, GPT-4o (limited), Amazon Nova Pro - analyze video frames in context
  • Audio: GPT-4o, Gemini 2.5 Flash - real-time audio understanding and generation
  • Documents (PDF/layout): Most frontier models can parse document structure via vision

How vision works in LLMs

Images are typically processed by a vision encoder (like ViT) that converts image patches into embeddings. These embeddings are projected into the LLM's token embedding space and concatenated with text tokens. The LLM then processes text and image tokens together with the same attention mechanism.

Limitations

Spatial reasoning ("which object is to the left of the red box?") remains weak in most models. Fine-grained OCR in complex layouts is unreliable. Video understanding is limited to relatively short clips and lower frame rates. Audio understanding in non-English languages lags text-only performance significantly.

Related terms

Models relevant to Multimodal