Plain-English definitions for every term you'll encounter when evaluating, budgeting, and optimizing AI API costs.
The fundamental unit of measurement that AI language model providers use to calculate API costs. A token is not exactly a word — it's closer to a syllable or a common character sequence. On average, one token represents approximately 0.75 English words, which means 100 words equals roughly 133 tokens. Short, common words like "the" or "is" are often single tokens, while longer or less common words may be split into two or three tokens. Numbers and punctuation are also tokenized separately.
See also: Input Tokens, Output Tokens, Token-to-Word Ratio
The tokens contained in the content you send to the model — including your system prompt, conversation history, and the user's message. Input tokens are processed by the model before it generates any response. Providers price input tokens separately from output tokens, and input tokens are typically 3 to 6 times cheaper than output tokens because reading and processing existing text requires less compute than generating new text.
See also: Output Tokens, System Prompt, Context Window
The tokens contained in the content the model generates in response to your prompt. Output tokens are the most significant cost driver for most applications because they are priced substantially higher than input tokens — typically 3 to 6× more — and because the volume of output depends on what you ask the model to generate. A request that asks for a full 800-word article will consume far more output tokens than one asking for a one-sentence summary.
See also: Input Tokens, Batch Processing
The conversion factor used to estimate how many tokens a given number of words will consume when processed by a language model. The widely accepted industry standard for English text is 1.33 tokens per word (or 133 tokens per 100 words). This ratio varies by language, content type, and model — code and technical content often tokenize differently from conversational prose, and some languages like Chinese or Japanese may tokenize more efficiently.
The standard pricing unit used by all major AI API providers in 2026. Rather than charging per individual token (which would produce very small decimal numbers), providers quote rates per one million tokens. This makes it easier to compare costs across providers and model tiers. To calculate your actual cost per request, you take your token count, divide by one million, and multiply by the applicable rate.
A pricing tier offered by OpenAI (Batch API) and Anthropic (Message Batches API) that provides a 50% discount on all token costs in exchange for a relaxed service level agreement — typically 24 hours for completion rather than milliseconds. Batch requests are submitted as a file of multiple API calls, queued, processed during off-peak hours on the provider's infrastructure, and returned as a downloadable results file. The model and prompt quality are identical to real-time API calls; only the delivery timing differs.
See also: Real-time Inference, SLA, JSONL
The standard mode of API access where requests are processed immediately and responses are returned within milliseconds to seconds. Real-time inference is priced at the full token rate (no batch discount) because the provider must maintain dedicated compute capacity to handle requests on-demand. This mode is required for any user-facing application where a person is actively waiting for a response — chat interfaces, voice assistants, interactive coding tools, and live Q&A systems.
See also: Batch Processing, Rate Limit
A feature offered by Anthropic and OpenAI that stores frequently repeated prompt prefixes (such as long system prompts or static context) in the provider's cache, allowing subsequent requests that use the same prefix to be charged at a heavily discounted rate — typically 80–90% cheaper than the standard input token rate. Prompt caching is most valuable for applications with stable, lengthy system prompts that are sent on every request, such as document analysis tools with large reference contexts or coding assistants with extensive code repositories loaded as context.
The maximum number of tokens a model can process in a single API call — encompassing both your input (system prompt + conversation history + user message) and the model's output. If the total token count of a request exceeds the context window limit, the request will either fail or the provider will truncate the oldest content. In 2026, context windows range from 128K tokens (GPT-5 Nano) to 2 million tokens (Gemini 3.1 Pro). Larger context windows enable use cases like full document analysis, long-form research, and processing entire codebases.
An architectural pattern where a language model's responses are augmented with information retrieved from an external knowledge base at query time. Instead of relying solely on the model's training data, a RAG system first searches a vector database for relevant documents or passages, then injects those passages into the prompt context before calling the LLM. RAG is widely used to give models access to proprietary, up-to-date, or domain-specific information without the cost and complexity of fine-tuning.
See also: Embeddings, Context Window
Numerical vector representations of text that capture semantic meaning, allowing a computer to measure how conceptually similar two pieces of text are. Embeddings are generated by dedicated embedding models (separate from chat completion models) and stored in a vector database. They are the foundation of RAG systems and semantic search. Embedding API calls are priced differently from chat completion calls — typically at a flat rate per million input tokens with no separate output token cost, and at much lower rates than frontier models.
The process of further training a pre-trained language model on a custom dataset to adapt its behavior, style, or domain knowledge for a specific use case. Fine-tuned models can learn a company's unique tone, domain-specific terminology, or specialized output format. In 2026, fine-tuning is offered by OpenAI and Anthropic as a managed service. Fine-tuned models are typically more expensive per token than their base model counterparts, but can achieve the same quality with shorter prompts (because the model has internalized instructions), potentially producing net savings on complex tasks.
A special prompt sent at the beginning of every API request that provides the model with persistent instructions, persona, context, or constraints that apply to the entire conversation. System prompts are invisible to end users in most applications but are charged as input tokens on every single API call. Because they are repeated across all requests, system prompt length has an outsized impact on total input token costs relative to its length — a bloated system prompt is one of the most common sources of hidden waste in AI budgets.
A prompting technique where one or more examples of the desired input/output format are included in the prompt to demonstrate to the model what is expected. "Zero-shot" means no examples; "one-shot" means one example; "few-shot" means two to five examples. Each example adds to the input token count, so there is a direct cost trade-off between providing more examples (better format compliance) and using fewer tokens (lower cost). For well-defined tasks, zero-shot prompting is often sufficient and significantly cheaper.
A feature supported by most major model providers that constrains the model to produce valid JSON output, eliminating the risk of malformed responses that break downstream parsing logic. Structured output mode is particularly valuable for data extraction, classification, and any application that needs to process model responses programmatically. JSON-formatted responses are also typically more token-efficient than equivalent prose, reducing output costs while simultaneously improving reliability.
An informal term for the most capable, state-of-the-art language models available at any given time — representing the current technological frontier of AI capability. In 2026, frontier models include GPT-5.4, Claude 4.6 Sonnet, and Gemini 3.1 Pro. Frontier models are consistently the most expensive per token, and are best suited for complex reasoning, nuanced generation, multi-step analysis, and customer-facing applications where output quality directly impacts user experience. They are generally not the right choice for high-volume, low-complexity tasks.
The spectrum of models available within a provider's lineup, typically categorized from flagship/frontier (highest capability, highest cost) to nano/mini/edge (lower capability, dramatically lower cost). Every major provider maintains multiple tiers simultaneously. GPT-5.4 vs GPT-5 Nano is a clear example: the Nano model costs 37.5× less on output but handles many tasks with comparable quality. Cost-optimized architectures route tasks to the cheapest tier capable of meeting the quality bar, reserving frontier models for the small percentage of tasks that genuinely require them.
A type of artificial intelligence model trained on massive datasets of text using a technique called self-supervised learning, enabling it to understand and generate human language with remarkable fluency. LLMs underpin all the major AI API services — GPT-5, Claude, Gemini, and DeepSeek are all LLMs. The "large" refers to the number of parameters (internal weights) the model contains, which can range from billions to trillions. Larger models are generally more capable but more expensive to run, which is directly reflected in their per-token pricing.
A model or API endpoint that can process multiple types of input beyond text — including images, audio, video, and documents. Multimodal capabilities are natively integrated into Gemini 3.1 Pro and GPT-5.4, allowing a single API call to analyze an image alongside a text prompt. Image inputs are typically priced differently from text tokens — often as a flat cost per image or as a token-equivalent based on image resolution. Multimodal APIs are essential for document processing, visual QA, invoice extraction, and any application that needs to "see" as well as "read."
A restriction imposed by API providers on how many requests or tokens a given API key can consume within a specified time window — typically per minute and per day. Rate limits vary by tier (higher-spending accounts get higher limits), by model, and by provider. Exceeding a rate limit results in a 429 (Too Many Requests) HTTP error. Rate limits are a key architectural consideration for high-volume applications: if your workload approaches limits during peak hours, you may need to implement request queuing, exponential backoff, or multiple API keys — or switch to batch processing to avoid real-time rate constraints entirely.
A commitment from an API provider regarding the quality, availability, and performance of their service — typically expressed as uptime guarantees and response time targets. For AI APIs, the most relevant SLA dimension is latency: real-time API calls are expected to return within seconds, while batch API calls have a 24-hour completion SLA. Enterprise API agreements may include custom SLAs with financial penalties for non-compliance. Understanding your application's actual latency requirements (vs. assumed requirements) is often the key to unlocking batch pricing discounts.
A text file format where each line is a valid, self-contained JSON object — as opposed to standard JSON, where all objects are wrapped in a single array or object. JSONL is the standard format for Batch API request files: each line in the file represents one API call with its own model, messages, parameters, and a custom ID for tracking. JSONL files can be streamed and processed line-by-line, making them efficient for handling very large batches of requests (up to 50,000 per file for OpenAI's Batch API) without loading the entire dataset into memory at once.