๐Ÿ“‹ Cheat Sheet ๐Ÿ’ฐ Pricing Reference ๐Ÿ“– Glossary โœ… Optimization Checklist ๐Ÿ”— Provider Links
Quick Reference

Token Pricing Cheat Sheet

The essential conversion formulas and benchmarks every developer working with LLM APIs should know. Bookmark this.

Token Conversion Rules
Words โ†’ Tokens (English)ร— 1.333
Tokens โ†’ Wordsรท 1.333
100 wordsโ‰ˆ 133 tokens
1,000 words (one essay)โ‰ˆ 1,333 tokens
1 page (~250 words)โ‰ˆ 333 tokens
Code (denser than prose)ร— 1.5โ€“2.0
Non-Latin scripts (CJK etc.)ร— 2.0โ€“4.0
Cost Benchmarks (2026)
1M words input โ€” GPT-5.4โ‰ˆ $3.33
1M words output โ€” GPT-5.4โ‰ˆ $20.00
1M words input โ€” Gemini 3.1โ‰ˆ $2.67
1M words โ€” DeepSeek V4โ‰ˆ $0.37
Batch discount (most models)โˆ’50% all tokens
Prompt cache savings~90% off cached tokens
Typical output:input ratio1.5ร— inputs
Monthly Cost Formula
// Your inputs
requests = 10,000 / mo
words = 500 / request
in_rate = $3.00 / 1M
out_rate = $15.00 / 1M

// Calculate
in_tok = (words ร— 1.333) / 1M
out_tok = in_tok ร— 1.5

cost = (in_tok ร— in_rate) + (out_tok ร— out_rate)
True Cost Multipliers
Base token cost1.0ร—
+ System prompt overhead+10โ€“20%
+ Retry / failure rate+5โ€“15%
+ Embedding pipeline (RAG)+5โ€“15%
+ Evaluation / monitoring+5โ€“10%
Engineering maintenance0.25โ€“0.5 FTE
Recommended total bufferร— 1.5 total
2026 Rate Card

Model Pricing Reference

Current market rates for all major models tracked by AICostHub. Rates verified June 2026 โ€” report a discrepancy if you spot one.

ModelProviderTier Input / 1MOutput / 1M Batch InputBatch Output
GPT-5.4OpenAI Frontier $2.50$15.00 $1.25$7.50
Claude 4.6 SonnetAnthropic Frontier $3.00$15.00 $1.50$7.50
Gemini 3.1 ProGoogle Mid-Tier $2.00$12.00 $1.00$6.00
GPT-5 NanoOpenAI Budget $0.05$0.40 $0.025$0.20
DeepSeek V4DeepSeek Budget $0.28$0.28 N/AN/A
Terminology

AI Pricing Glossary

Plain-English definitions for every term you'll encounter when working with LLM APIs โ€” from fundamentals to advanced optimization techniques.

Token Core
The fundamental unit of text that language models process and charge for. Roughly 0.75 English words per token on average. All LLM API pricing is denominated in tokens per million โ€” not words or characters.
Input Token Pricing
Tokens in your prompt โ€” the system instruction, user message, conversation history, and any context you pass to the model. Input tokens are typically 3โ€“6ร— cheaper than output tokens because processing text requires less GPU compute than generating it.
Output Token Pricing
Tokens generated by the model in its response. Output tokens require one GPU forward pass per token, making them computationally expensive. For generation-heavy workloads, output costs dominate the total bill โ€” often 70โ€“85% of total spend.
Context Window Core
The maximum number of tokens a model can process in a single request, including both input and output. Claude 4.6 Sonnet: 200K tokens. GPT-5.4: 128K tokens. Gemini 3.1 Pro: up to 2 million tokens. Larger windows enable full-document processing without chunking.
Batch Processing Pricing
A pricing mode where requests are submitted in bulk and processed within 24 hours instead of real time. Offered by OpenAI and Anthropic at a flat 50% discount on all token costs. Ideal for any non-interactive background workload: pipelines, enrichment, generation at scale.
Prompt Caching Technique
A feature that caches the beginning of a prompt so it doesn't need to be fully re-processed on every call. Cached tokens are re-billed at approximately 10% of the standard input rate โ€” up to 90% savings on stable, repeated system prompts and context.
RAG โ€” Retrieval-Augmented Generation Infra
An architecture where relevant documents are retrieved from a vector database and inserted into context at query time. Dramatically reduces token costs for large knowledge bases compared to inserting the entire corpus โ€” only relevant chunks are sent per request.
Embedding Infra
A numerical vector representation of text, produced by an embedding model and used for semantic search in RAG pipelines. Priced separately from generation models, typically $0.02โ€“$0.13 per million tokens. Re-embedding costs accumulate as your knowledge base grows and updates.
System Prompt Core
An instruction block sent before the user's message that sets the model's behavior, persona, and output format. System prompt tokens are billed as input tokens on every call โ€” even when unchanged โ€” making their length a significant ongoing cost driver at high volume.
Few-Shot Prompting Technique
A prompting strategy that includes example input-output pairs to demonstrate desired behavior. Each example adds 100โ€“300 tokens per request. Use selectively โ€” zero-shot often suffices for well-defined tasks and avoids the compounding token cost of examples included in every call.
JSON Mode / Structured Output Technique
A configuration that forces model output to conform to a specific JSON schema. Eliminates format-related retry calls and is slightly more token-efficient than equivalent prose, since it removes transitional language. A best practice for any pipeline requiring machine-readable structured data.
Token-to-Word Ratio Core
The average number of tokens per word for a given content type. English prose: ~1.333 tokens/word. Code: 1.5โ€“2.0 tokens/word. Non-Latin scripts (CJK, Arabic): 2.0โ€“4.0 tokens/word. Getting this ratio right is the foundation of any accurate API cost estimate.
Model Tier Pricing
A classification of AI models by capability and cost. Frontier: GPT-5.4, Claude 4.6 Sonnet โ€” maximum quality, premium rates. Mid-tier: Gemini 3.1 Pro โ€” strong balance of quality and cost. Budget: GPT-5 Nano, DeepSeek V4 โ€” optimized for cost-sensitive high-volume tasks, up to 50ร— cheaper than frontier.
Retry Rate Infra
The percentage of API calls that fail and must be retried. A 5% retry rate with three retries per failure effectively adds 15% to your token consumption. Tracking retry rate as a KPI surfaces both prompt engineering problems (inconsistent output format) and infrastructure problems (rate limit tier).
Rate Limit Infra
A per-minute or per-day cap on API requests or tokens, set by the provider based on your account tier. Exceeding limits returns a 429 error requiring retry logic. High-volume production deployments must implement queuing, exponential backoff, and request distribution to stay within limits.
LLM-as-a-Judge Technique
A quality evaluation approach where one language model scores or critiques the outputs of another. Provides scalable quality assurance without manual review, but incurs token costs proportional to evaluation sample rate and the length of outputs being scored. Factor into true cost-of-ownership calculations.
Action Plan

AI Cost Optimization Checklist

A prioritized action list for reducing your AI API spend. Check off each item as you implement it.

โšก Quick Wins (Days 1โ€“3)

Enable Batch Processing for all non-interactive API calls.
โ†“ 50% on eligible traffic
Audit system prompt length โ€” every word costs tokens on every single call.
โ†“ 10โ€“30% input cost
Add output length constraints to all prompts ("respond in under N words").
โ†“ 20โ€“35% output cost
Set budget alerts at 50%, 75%, 90% of monthly target in provider dashboard.
Switch to JSON output mode for all structured data extraction tasks.

๐Ÿ”ง Infrastructure (Week 1โ€“2)

Enable prompt caching for stable system prompts and repeated context.
โ†“ 40โ€“60% on cached input tokens
Implement cost attribution โ€” tag every API call by team, feature, and use case.
Add intelligent retry logic with exponential backoff and jitter.
Validate structured outputs before retrying to avoid duplicate token costs.
Log token usage per request and track P95 cost as a KPI.

๐Ÿ“ Architecture (Weeks 2โ€“4)

Implement model tiering โ€” route classification and extraction to budget models.
โ†“ 40โ€“55% on routed traffic
Evaluate RAG vs. long-context insertion for knowledge-base applications.
Build a batch job queue for all background and scheduled AI processing.
Evaluate fine-tuning a smaller model on your specific task to replace a frontier model.

๐Ÿ”„ Ongoing Maintenance

Quarterly prompt audits โ€” compress, remove redundancy, update examples.
Monitor new model releases โ€” providers frequently cut prices on new model generations.
Track retry rate as a KPI โ€” above 2% signals a prompt engineering or infrastructure issue.
Re-evaluate model tier assignments as budget model capabilities improve each quarter.
Compare budgeted vs. actual spend monthly โ€” variance signals hidden waste to investigate.
Official Sources

Official Provider Pricing Pages

Always verify current rates on each provider's official documentation before making large budget commitments. AI pricing changes frequently โ€” sometimes multiple times per quarter โ€” and our calculator is updated to match.