Inference

Temperature

A sampling parameter that controls how random or deterministic an LLM's output is - lower values = more predictable, higher values = more creative.

When an LLM generates the next token, it produces a probability distribution over its entire vocabulary. Temperature scales this distribution before sampling: low temperatures sharpen the distribution (the most likely token becomes overwhelmingly favored), while high temperatures flatten it (unlikely tokens get more chances to appear).

Temperature values in practice

  • 0: Greedy decoding - always pick the highest-probability token. Deterministic output. Best for factual extraction, code generation, structured data.
  • 0.1 - 0.3: Near-deterministic. Small variation between runs. Good for most production tasks.
  • 0.7 - 1.0: Noticeable variation. Good for creative writing, brainstorming, diverse generation.
  • Above 1.0: Very random. Output quality degrades quickly. Rarely useful in practice.

Temperature and token sampling

Temperature works together with top-p sampling. A common production setting is temperature=0.2, top-p=0.95 for most tasks - this gives slightly varied output while avoiding degenerate low-probability tokens.

Temperature does not equal quality

A common misconception is that higher temperature = worse quality and lower temperature = better. The right temperature depends on the task. Code output at temperature=1.0 is unreliable; creative writing at temperature=0 is robotic. Match temperature to the task.

Related terms