DeepSeek V4 Flash

Compact 284B-param open MoE that keeps a 1M context at a fraction of Pro's cost.

Context window

1M tokens

Input / 1M tokens

Free

Output / 1M tokens

Free

Provider

DeepSeek

Open-weight (MIT license) - free to self-host. The official DeepSeek API is also available: deepseek-v4-flash costs approximately $0.14/1M input (cache miss) and $0.28/1M output; cache-hit input is ~$0.0028/1M. · Data verified 2026-07-02

DeepSeek V4 Flash is the lightweight sibling of V4 Pro, released in preview on April 24, 2026 under the MIT license. It has 284B total parameters with 13B active, yet keeps the same 1M-token default context window and 384K max output. Built on the same DeepSeek Sparse Attention architecture, Flash targets high-throughput, cost-sensitive workloads while, per DeepSeek, closely approaching V4 Pro's reasoning quality.

Capability index

Relative estimates (0-100) to place this model against its peers, grounded in published benchmarks.

Coding

Reasoning

Math

Multimodal

Long context

Speed

Cost efficiency

How to access it

Download open weights from Hugging Face (deepseek-ai/DeepSeek-V4-Flash) to self-host, or call the hosted DeepSeek API with model 'deepseek-v4-flash'. Also available via inference providers and locally through Ollama/vLLM.

Get access →Documentation →

Strengths

✓Open weights under MIT license with a small active-parameter footprint (13B) for cheaper serving
✓Full 1M-token context despite the compact size
✓Reasoning quality that closely approaches V4 Pro at much lower cost
✓Very low API pricing ($0.14/$0.28 per 1M) with steep cache-hit discounts
✓Tool calls, JSON mode, and both thinking/non-thinking modes

Best for developers who...

High-volume, cost-sensitive inferenceLong-context tasks on a budgetSelf-hosting on modest hardwareFast agentic and chat workloads

When to choose it (and when not to)

Reach for DeepSeek V4 Flash when...

→You want most of V4 Pro's capability at a lower price and higher throughput
→You need long context but on a tighter compute or cost budget
→You are serving high request volumes where per-token cost dominates
→You want an open model small enough to self-host on modest multi-GPU setups

Look elsewhere if...

✕You need the absolute best coding/reasoning scores (choose V4 Pro)
✕You require a GA (non-preview) model with SLAs
✕You need multimodal (image/audio) input
✕Your task demands maximum accuracy on the hardest agentic benchmarks

How to use it

›Default to non-thinking mode for chat and simple tasks to minimize latency and cost; switch to thinking mode for hard reasoning
›Reuse stable prefixes/system prompts to trigger cache-hit pricing
›Chunk very long inputs with clear delimiters to make the most of the 1M window
›Only non-thinking mode supports FIM (fill-in-the-middle) completion - use it for code infilling

Quickstart

Python

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_DEEPSEEK_API_KEY",
    base_url="https://api.deepseek.com",
)

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Summarize this long report in 5 bullet points."}],
)
print(resp.choices[0].message.content)

OpenAI-compatible endpoint. For local use, pull deepseek-ai/DeepSeek-V4-Flash from Hugging Face and serve with vLLM or Ollama.

API model id: deepseek-v4-flash

Benchmarks

Benchmark	Score	Notes
Reasoning (vs V4 Pro)	Closely approaches V4 Pro	DeepSeek's announcement states V4-Flash's reasoning capabilities closely approach V4-Pro; no standalone numeric score is published on the announcement page.

Source: DeepSeek V4 Preview announcement (DeepSeek API Docs)

Compare DeepSeek V4 Flash

Compare DeepSeek V4 Flash with any other model

Build a comparison →All model comparisons →

Learn the concepts

MoE (Mixture of Experts)Inference Context Window

← All AI models