Token Core
The fundamental unit of text that language models process and charge for. Roughly 0.75 English words per token on average. All LLM API pricing is denominated in tokens per million โ not words or characters.
Input Token Pricing
Tokens in your prompt โ the system instruction, user message, conversation history, and any context you pass to the model. Input tokens are typically 3โ6ร cheaper than output tokens because processing text requires less GPU compute than generating it.
Output Token Pricing
Tokens generated by the model in its response. Output tokens require one GPU forward pass per token, making them computationally expensive. For generation-heavy workloads, output costs dominate the total bill โ often 70โ85% of total spend.
Context Window Core
The maximum number of tokens a model can process in a single request, including both input and output. Claude 4.6 Sonnet: 200K tokens. GPT-5.4: 128K tokens. Gemini 3.1 Pro: up to 2 million tokens. Larger windows enable full-document processing without chunking.
Batch Processing Pricing
A pricing mode where requests are submitted in bulk and processed within 24 hours instead of real time. Offered by OpenAI and Anthropic at a flat 50% discount on all token costs. Ideal for any non-interactive background workload: pipelines, enrichment, generation at scale.
Prompt Caching Technique
A feature that caches the beginning of a prompt so it doesn't need to be fully re-processed on every call. Cached tokens are re-billed at approximately 10% of the standard input rate โ up to 90% savings on stable, repeated system prompts and context.
RAG โ Retrieval-Augmented Generation Infra
An architecture where relevant documents are retrieved from a vector database and inserted into context at query time. Dramatically reduces token costs for large knowledge bases compared to inserting the entire corpus โ only relevant chunks are sent per request.
Embedding Infra
A numerical vector representation of text, produced by an embedding model and used for semantic search in RAG pipelines. Priced separately from generation models, typically $0.02โ$0.13 per million tokens. Re-embedding costs accumulate as your knowledge base grows and updates.
System Prompt Core
An instruction block sent before the user's message that sets the model's behavior, persona, and output format. System prompt tokens are billed as input tokens on every call โ even when unchanged โ making their length a significant ongoing cost driver at high volume.
Few-Shot Prompting Technique
A prompting strategy that includes example input-output pairs to demonstrate desired behavior. Each example adds 100โ300 tokens per request. Use selectively โ zero-shot often suffices for well-defined tasks and avoids the compounding token cost of examples included in every call.
JSON Mode / Structured Output Technique
A configuration that forces model output to conform to a specific JSON schema. Eliminates format-related retry calls and is slightly more token-efficient than equivalent prose, since it removes transitional language. A best practice for any pipeline requiring machine-readable structured data.
Token-to-Word Ratio Core
The average number of tokens per word for a given content type. English prose: ~1.333 tokens/word. Code: 1.5โ2.0 tokens/word. Non-Latin scripts (CJK, Arabic): 2.0โ4.0 tokens/word. Getting this ratio right is the foundation of any accurate API cost estimate.
Model Tier Pricing
A classification of AI models by capability and cost. Frontier: GPT-5.4, Claude 4.6 Sonnet โ maximum quality, premium rates. Mid-tier: Gemini 3.1 Pro โ strong balance of quality and cost. Budget: GPT-5 Nano, DeepSeek V4 โ optimized for cost-sensitive high-volume tasks, up to 50ร cheaper than frontier.
Retry Rate Infra
The percentage of API calls that fail and must be retried. A 5% retry rate with three retries per failure effectively adds 15% to your token consumption. Tracking retry rate as a KPI surfaces both prompt engineering problems (inconsistent output format) and infrastructure problems (rate limit tier).
Rate Limit Infra
A per-minute or per-day cap on API requests or tokens, set by the provider based on your account tier. Exceeding limits returns a 429 error requiring retry logic. High-volume production deployments must implement queuing, exponential backoff, and request distribution to stay within limits.
LLM-as-a-Judge Technique
A quality evaluation approach where one language model scores or critiques the outputs of another. Provides scalable quality assurance without manual review, but incurs token costs proportional to evaluation sample rate and the length of outputs being scored. Factor into true cost-of-ownership calculations.