Glossary

AI Glossary

Plain-English explanations of LLM and AI terms that matter for developers. No fluff, no hype - just what you need to understand to build with AI.

Agentic Chain-of-Thought Context Distillation Embeddings Fine-tuning Hallucination Inference JSON KV Latency LoRA MCP MoE Multimodal Prompt RAG RLHF Structured System Temperature Tokenization Tool Top-p Vector

Core Concepts

Agentic AI

AI systems that autonomously execute multi-step tasks by deciding which tools to use, in what order, and when to stop.

Context Window

The maximum number of tokens an LLM can process in a single inference call, including both input (prompt, context, history) and output (generated response).

Inference

The process of running a trained model to generate outputs from inputs, also called the serving phase as opposed to training.

Tokenization

The process of converting text into a sequence of integers that an LLM can process, where each integer represents a token - typically a subword fragment, full word, or punctuation character.

Embeddings

Numerical vector representations of text, images, or other data that capture semantic meaning so similar content has similar vector coordinates.

MCP (Model Context Protocol)

An open protocol released by Anthropic that standardizes how AI assistants connect to external data sources, APIs, and tools through a unified interface.

A neural network architecture where many specialized sub-networks (experts) exist within a single model, but only a subset is activated for each input token, enabling large total parameter counts with lower per-token compute costs.

Multimodal

Models that process and generate multiple types of data - text, images, audio, and video - within a unified architecture.

RAG (Retrieval-Augmented Generation)

A technique that improves LLM responses by retrieving relevant documents from an external knowledge base before generation, reducing hallucinations and enabling access to information beyond training data.

Vector Database

A database optimized for storing and querying high-dimensional embedding vectors, enabling fast approximate nearest-neighbor search.

Distillation

Training a smaller model to replicate the behavior of a larger model by learning from its output distributions rather than ground-truth labels alone.

Fine-tuning

Continued training of a pre-trained model on a smaller, task-specific dataset to adapt its weights for a particular domain, task, or output format.

LoRA (Low-Rank Adaptation)

A parameter-efficient fine-tuning method that freezes model weights and trains low-rank adapter matrices, reducing trainable parameters by roughly 10,000x while maintaining comparable task performance.

RLHF (Reinforcement Learning from Human Feedback)

A training method that uses human preference ratings to fine-tune a model's behavior after initial pre-training, making it more aligned with desired outputs for helpfulness, accuracy, and safety.

JSON Mode

An API setting that constrains an LLM to output valid JSON without enforcing a specific schema.

KV Cache

A performance optimization that stores key and value tensors from attention computations for previously seen tokens, eliminating recomputation during autoregressive generation.

Latency vs Throughput

Latency measures how long a single request takes to complete; throughput measures how many requests can be processed per second.

Structured Output

Constraining an LLM to produce output in a specific format (JSON, XML, a defined schema) rather than free-form text.

Temperature

A sampling parameter that controls randomness in an LLM's output by scaling the probability distribution of tokens - lower values produce more deterministic results, higher values produce more varied results.

Top-p (Nucleus Sampling)

A sampling strategy that selects tokens by including the most probable candidates until their cumulative probability reaches a threshold p, allowing the number of candidates to vary based on the model's confidence.

Chain-of-Thought (CoT)

A prompting technique where a model is instructed to produce intermediate reasoning steps before generating a final answer, improving performance on complex tasks like math, logic, and code generation.

Prompt Injection

An attack where malicious instructions embedded in external content attempt to override an LLM's system prompt or alter its behavior.

System Prompt

Instructions provided to an LLM at the start of a conversation that define its persona, behavior, response format, and constraints.

Tool Use / Function Calling

The ability of an LLM to request execution of external functions, APIs, or tools by outputting structured function calls, enabling it to take actions beyond text generation.

Hallucination

When an LLM generates confident, fluent text that is factually incorrect or entirely fabricated.

AI Glossary

Core Concepts

Agentic AI

Context Window

Inference

Tokenization

Architecture

Embeddings

MCP (Model Context Protocol)

MoE (Mixture of Experts)

Multimodal

RAG (Retrieval-Augmented Generation)

Vector Database

Training

Distillation

Fine-tuning

LoRA (Low-Rank Adaptation)

RLHF (Reinforcement Learning from Human Feedback)

Inference

JSON Mode

KV Cache

Latency vs Throughput

Structured Output

Temperature

Top-p (Nucleus Sampling)

Prompting

Chain-of-Thought (CoT)

Prompt Injection

System Prompt

Tool Use / Function Calling

Evaluation

Hallucination