22Terms defined
4Categories covered
5Providers referenced
2026Last reviewed
Pricing Fundamentals
Token Pricing

The fundamental unit of measurement that AI language model providers use to calculate API costs. A token is not exactly a word — it's closer to a syllable or a common character sequence. On average, one token represents approximately 0.75 English words, which means 100 words equals roughly 133 tokens. Short, common words like "the" or "is" are often single tokens, while longer or less common words may be split into two or three tokens. Numbers and punctuation are also tokenized separately.

Example: The sentence "AI pricing is complex" contains 5 words but approximately 7 tokens (common words like "is" may tokenize individually while "pricing" and "complex" may each be split).

See also: Input Tokens, Output Tokens, Token-to-Word Ratio

Input Tokens Pricing

The tokens contained in the content you send to the model — including your system prompt, conversation history, and the user's message. Input tokens are processed by the model before it generates any response. Providers price input tokens separately from output tokens, and input tokens are typically 3 to 6 times cheaper than output tokens because reading and processing existing text requires less compute than generating new text.

Example: A 500-word system prompt + 200-word user message = approximately 932 input tokens. At GPT-5.4's rate of $2.50 per million input tokens, that single request costs $0.0000023 in input costs.

See also: Output Tokens, System Prompt, Context Window

Output Tokens Pricing

The tokens contained in the content the model generates in response to your prompt. Output tokens are the most significant cost driver for most applications because they are priced substantially higher than input tokens — typically 3 to 6× more — and because the volume of output depends on what you ask the model to generate. A request that asks for a full 800-word article will consume far more output tokens than one asking for a one-sentence summary.

Example: A 600-word response from GPT-5.4 generates approximately 800 output tokens. At $15.00 per million output tokens, that single response costs $0.000012 — but across 100,000 monthly requests, the same response length would cost $1,200/month in output alone.

See also: Input Tokens, Batch Processing

Token-to-Word Ratio Pricing

The conversion factor used to estimate how many tokens a given number of words will consume when processed by a language model. The widely accepted industry standard for English text is 1.33 tokens per word (or 133 tokens per 100 words). This ratio varies by language, content type, and model — code and technical content often tokenize differently from conversational prose, and some languages like Chinese or Japanese may tokenize more efficiently.

How AICostHub uses it: When you enter 500 words per prompt in our calculator, we multiply by 1.333 to estimate your token consumption before applying the provider's per-million-token rate. This gives a realistic cost estimate rather than a theoretical minimum.
Per-Million Pricing ($/1M tokens) Pricing

The standard pricing unit used by all major AI API providers in 2026. Rather than charging per individual token (which would produce very small decimal numbers), providers quote rates per one million tokens. This makes it easier to compare costs across providers and model tiers. To calculate your actual cost per request, you take your token count, divide by one million, and multiply by the applicable rate.

Formula: Cost = (tokens ÷ 1,000,000) × rate. A request using 2,000 input tokens at $3.00/M costs (2,000 ÷ 1,000,000) × $3.00 = $0.006.
Processing Modes
Batch Processing Strategy

A pricing tier offered by OpenAI (Batch API) and Anthropic (Message Batches API) that provides a 50% discount on all token costs in exchange for a relaxed service level agreement — typically 24 hours for completion rather than milliseconds. Batch requests are submitted as a file of multiple API calls, queued, processed during off-peak hours on the provider's infrastructure, and returned as a downloadable results file. The model and prompt quality are identical to real-time API calls; only the delivery timing differs.

Best for: Content generation pipelines, nightly data enrichment, bulk classification, embedding generation, LLM-as-a-judge evaluations — any workload where results aren't needed immediately. See our full Batch API guide for a step-by-step migration walkthrough.

See also: Real-time Inference, SLA, JSONL

Real-time Inference Technical

The standard mode of API access where requests are processed immediately and responses are returned within milliseconds to seconds. Real-time inference is priced at the full token rate (no batch discount) because the provider must maintain dedicated compute capacity to handle requests on-demand. This mode is required for any user-facing application where a person is actively waiting for a response — chat interfaces, voice assistants, interactive coding tools, and live Q&A systems.

See also: Batch Processing, Rate Limit

Prompt Caching Strategy

A feature offered by Anthropic and OpenAI that stores frequently repeated prompt prefixes (such as long system prompts or static context) in the provider's cache, allowing subsequent requests that use the same prefix to be charged at a heavily discounted rate — typically 80–90% cheaper than the standard input token rate. Prompt caching is most valuable for applications with stable, lengthy system prompts that are sent on every request, such as document analysis tools with large reference contexts or coding assistants with extensive code repositories loaded as context.

Design principle: Structure prompts so static, cacheable content comes first (system prompt, reference documents) and dynamic per-request content (user message) comes last. This maximizes cache hit rates and cost savings.
Architecture
Context Window Technical

The maximum number of tokens a model can process in a single API call — encompassing both your input (system prompt + conversation history + user message) and the model's output. If the total token count of a request exceeds the context window limit, the request will either fail or the provider will truncate the oldest content. In 2026, context windows range from 128K tokens (GPT-5 Nano) to 2 million tokens (Gemini 3.1 Pro). Larger context windows enable use cases like full document analysis, long-form research, and processing entire codebases.

Practical implication: 1 million tokens ≈ 750,000 words ≈ a 1,500-page book. Gemini 3.1 Pro's 2M context window can theoretically process War and Peace, Anna Karenina, and three more novels simultaneously in a single prompt.
RAG (Retrieval-Augmented Generation) Architecture

An architectural pattern where a language model's responses are augmented with information retrieved from an external knowledge base at query time. Instead of relying solely on the model's training data, a RAG system first searches a vector database for relevant documents or passages, then injects those passages into the prompt context before calling the LLM. RAG is widely used to give models access to proprietary, up-to-date, or domain-specific information without the cost and complexity of fine-tuning.

Cost implication: RAG systems increase input token counts because retrieved documents are added to the context. However, they typically produce more accurate responses, reducing the need for expensive retry calls. The net effect on cost depends on the length and number of retrieved passages.

See also: Embeddings, Context Window

Embeddings Technical

Numerical vector representations of text that capture semantic meaning, allowing a computer to measure how conceptually similar two pieces of text are. Embeddings are generated by dedicated embedding models (separate from chat completion models) and stored in a vector database. They are the foundation of RAG systems and semantic search. Embedding API calls are priced differently from chat completion calls — typically at a flat rate per million input tokens with no separate output token cost, and at much lower rates than frontier models.

Typical use: A user asks "what is our refund policy?" — the question is converted to an embedding vector, which is used to search a vector database of company documentation embeddings. The most semantically similar passages are retrieved and injected into the prompt as context.
Fine-tuning Architecture

The process of further training a pre-trained language model on a custom dataset to adapt its behavior, style, or domain knowledge for a specific use case. Fine-tuned models can learn a company's unique tone, domain-specific terminology, or specialized output format. In 2026, fine-tuning is offered by OpenAI and Anthropic as a managed service. Fine-tuned models are typically more expensive per token than their base model counterparts, but can achieve the same quality with shorter prompts (because the model has internalized instructions), potentially producing net savings on complex tasks.

System Prompt Technical

A special prompt sent at the beginning of every API request that provides the model with persistent instructions, persona, context, or constraints that apply to the entire conversation. System prompts are invisible to end users in most applications but are charged as input tokens on every single API call. Because they are repeated across all requests, system prompt length has an outsized impact on total input token costs relative to its length — a bloated system prompt is one of the most common sources of hidden waste in AI budgets.

Cost awareness: A 2,000-token system prompt sent with 500,000 monthly requests = 1 billion input tokens consumed by the system prompt alone. At $3.00/M, that's $3,000/month — often more than the user content itself.
Few-Shot Prompting Technical

A prompting technique where one or more examples of the desired input/output format are included in the prompt to demonstrate to the model what is expected. "Zero-shot" means no examples; "one-shot" means one example; "few-shot" means two to five examples. Each example adds to the input token count, so there is a direct cost trade-off between providing more examples (better format compliance) and using fewer tokens (lower cost). For well-defined tasks, zero-shot prompting is often sufficient and significantly cheaper.

JSON Mode / Structured Output Technical

A feature supported by most major model providers that constrains the model to produce valid JSON output, eliminating the risk of malformed responses that break downstream parsing logic. Structured output mode is particularly valuable for data extraction, classification, and any application that needs to process model responses programmatically. JSON-formatted responses are also typically more token-efficient than equivalent prose, reducing output costs while simultaneously improving reliability.

Models & Strategy
Frontier Model Technical

An informal term for the most capable, state-of-the-art language models available at any given time — representing the current technological frontier of AI capability. In 2026, frontier models include GPT-5.4, Claude 4.6 Sonnet, and Gemini 3.1 Pro. Frontier models are consistently the most expensive per token, and are best suited for complex reasoning, nuanced generation, multi-step analysis, and customer-facing applications where output quality directly impacts user experience. They are generally not the right choice for high-volume, low-complexity tasks.

Model Tiers Strategy

The spectrum of models available within a provider's lineup, typically categorized from flagship/frontier (highest capability, highest cost) to nano/mini/edge (lower capability, dramatically lower cost). Every major provider maintains multiple tiers simultaneously. GPT-5.4 vs GPT-5 Nano is a clear example: the Nano model costs 37.5× less on output but handles many tasks with comparable quality. Cost-optimized architectures route tasks to the cheapest tier capable of meeting the quality bar, reserving frontier models for the small percentage of tasks that genuinely require them.

Strategy: Classification and routing → smallest capable model. Drafting and summarization → mid-tier. Customer-facing generation and complex reasoning → frontier model. This tiered approach typically reduces blended API costs by 40–60%.
LLM (Large Language Model) Technical

A type of artificial intelligence model trained on massive datasets of text using a technique called self-supervised learning, enabling it to understand and generate human language with remarkable fluency. LLMs underpin all the major AI API services — GPT-5, Claude, Gemini, and DeepSeek are all LLMs. The "large" refers to the number of parameters (internal weights) the model contains, which can range from billions to trillions. Larger models are generally more capable but more expensive to run, which is directly reflected in their per-token pricing.

Multimodal Technical

A model or API endpoint that can process multiple types of input beyond text — including images, audio, video, and documents. Multimodal capabilities are natively integrated into Gemini 3.1 Pro and GPT-5.4, allowing a single API call to analyze an image alongside a text prompt. Image inputs are typically priced differently from text tokens — often as a flat cost per image or as a token-equivalent based on image resolution. Multimodal APIs are essential for document processing, visual QA, invoice extraction, and any application that needs to "see" as well as "read."

Rate Limit Pricing

A restriction imposed by API providers on how many requests or tokens a given API key can consume within a specified time window — typically per minute and per day. Rate limits vary by tier (higher-spending accounts get higher limits), by model, and by provider. Exceeding a rate limit results in a 429 (Too Many Requests) HTTP error. Rate limits are a key architectural consideration for high-volume applications: if your workload approaches limits during peak hours, you may need to implement request queuing, exponential backoff, or multiple API keys — or switch to batch processing to avoid real-time rate constraints entirely.

SLA (Service Level Agreement) Strategy

A commitment from an API provider regarding the quality, availability, and performance of their service — typically expressed as uptime guarantees and response time targets. For AI APIs, the most relevant SLA dimension is latency: real-time API calls are expected to return within seconds, while batch API calls have a 24-hour completion SLA. Enterprise API agreements may include custom SLAs with financial penalties for non-compliance. Understanding your application's actual latency requirements (vs. assumed requirements) is often the key to unlocking batch pricing discounts.

JSONL (JSON Lines) Technical

A text file format where each line is a valid, self-contained JSON object — as opposed to standard JSON, where all objects are wrapped in a single array or object. JSONL is the standard format for Batch API request files: each line in the file represents one API call with its own model, messages, parameters, and a custom ID for tracking. JSONL files can be streamed and processed line-by-line, making them efficient for handling very large batches of requests (up to 50,000 per file for OpenAI's Batch API) without loading the entire dataset into memory at once.

Example line: {"custom_id": "req-001", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-5-4", "messages": [{"role": "user", "content": "Summarize this document..."}]}}