AI API Resources, Glossary & Cost Optimization Guides 2026

Quick Reference

Token Pricing Cheat Sheet

The essential conversion formulas and benchmarks every developer working with LLM APIs should know. Bookmark this.

Token Conversion Rules

Words → Tokens (English)× 1.333

Tokens → Words÷ 1.333

100 words≈ 133 tokens

1,000 words (one essay)≈ 1,333 tokens

1 page (~250 words)≈ 333 tokens

Code (denser than prose)× 1.5–2.0

Non-Latin scripts (CJK etc.)× 2.0–4.0

Cost Benchmarks (2026)

1M words input — GPT-5.4≈ $3.33

1M words output — GPT-5.4≈ $20.00

1M words input — Gemini 3.1≈ $2.67

1M words — DeepSeek V4≈ $0.37

Batch discount (most models)−50% all tokens

Prompt cache savings~90% off cached tokens

Typical output:input ratio1.5× inputs

Monthly Cost Formula

// Your inputs
requests = 10,000 / mo
words = 500 / request
in_rate = $3.00 / 1M
out_rate = $15.00 / 1M

// Calculate
in_tok = (words × 1.333) / 1M
out_tok = in_tok × 1.5

cost = (in_tok × in_rate) + (out_tok × out_rate)

True Cost Multipliers

Base token cost1.0×

+ System prompt overhead+10–20%

+ Retry / failure rate+5–15%

+ Embedding pipeline (RAG)+5–15%

+ Evaluation / monitoring+5–10%

Engineering maintenance0.25–0.5 FTE

Recommended total buffer× 1.5 total

2026 Rate Card

Model Pricing Reference

Current market rates for all major models tracked by AICostHub. Rates verified June 2026 — report a discrepancy if you spot one.

Model	Provider	Tier	Input / 1M	Output / 1M	Batch Input	Batch Output
GPT-5.4	OpenAI	Frontier	$2.50	$15.00	$1.25	$7.50
Claude 4.6 Sonnet	Anthropic	Frontier	$3.00	$15.00	$1.50	$7.50
Gemini 3.1 Pro	Google	Mid-Tier	$2.00	$12.00	$1.00	$6.00
GPT-5 Nano	OpenAI	Budget	$0.05	$0.40	$0.025	$0.20
DeepSeek V4	DeepSeek	Budget	$0.28	$0.28	N/A	N/A

Terminology

AI Pricing Glossary

Plain-English definitions for every term you'll encounter when working with LLM APIs — from fundamentals to advanced optimization techniques.

Token Core

The fundamental unit of text that language models process and charge for. Roughly 0.75 English words per token on average. All LLM API pricing is denominated in tokens per million — not words or characters.

Input Token Pricing

Tokens in your prompt — the system instruction, user message, conversation history, and any context you pass to the model. Input tokens are typically 3–6× cheaper than output tokens because processing text requires less GPU compute than generating it.

Output Token Pricing

Tokens generated by the model in its response. Output tokens require one GPU forward pass per token, making them computationally expensive. For generation-heavy workloads, output costs dominate the total bill — often 70–85% of total spend.

Context Window Core

The maximum number of tokens a model can process in a single request, including both input and output. Claude 4.6 Sonnet: 200K tokens. GPT-5.4: 128K tokens. Gemini 3.1 Pro: up to 2 million tokens. Larger windows enable full-document processing without chunking.

Batch Processing Pricing

A pricing mode where requests are submitted in bulk and processed within 24 hours instead of real time. Offered by OpenAI and Anthropic at a flat 50% discount on all token costs. Ideal for any non-interactive background workload: pipelines, enrichment, generation at scale.

Prompt Caching Technique

A feature that caches the beginning of a prompt so it doesn't need to be fully re-processed on every call. Cached tokens are re-billed at approximately 10% of the standard input rate — up to 90% savings on stable, repeated system prompts and context.

RAG — Retrieval-Augmented Generation Infra

An architecture where relevant documents are retrieved from a vector database and inserted into context at query time. Dramatically reduces token costs for large knowledge bases compared to inserting the entire corpus — only relevant chunks are sent per request.

Embedding Infra

A numerical vector representation of text, produced by an embedding model and used for semantic search in RAG pipelines. Priced separately from generation models, typically $0.02–$0.13 per million tokens. Re-embedding costs accumulate as your knowledge base grows and updates.

System Prompt Core

An instruction block sent before the user's message that sets the model's behavior, persona, and output format. System prompt tokens are billed as input tokens on every call — even when unchanged — making their length a significant ongoing cost driver at high volume.

Few-Shot Prompting Technique

A prompting strategy that includes example input-output pairs to demonstrate desired behavior. Each example adds 100–300 tokens per request. Use selectively — zero-shot often suffices for well-defined tasks and avoids the compounding token cost of examples included in every call.

JSON Mode / Structured Output Technique

A configuration that forces model output to conform to a specific JSON schema. Eliminates format-related retry calls and is slightly more token-efficient than equivalent prose, since it removes transitional language. A best practice for any pipeline requiring machine-readable structured data.

Token-to-Word Ratio Core

The average number of tokens per word for a given content type. English prose: ~1.333 tokens/word. Code: 1.5–2.0 tokens/word. Non-Latin scripts (CJK, Arabic): 2.0–4.0 tokens/word. Getting this ratio right is the foundation of any accurate API cost estimate.

Model Tier Pricing

A classification of AI models by capability and cost. Frontier: GPT-5.4, Claude 4.6 Sonnet — maximum quality, premium rates. Mid-tier: Gemini 3.1 Pro — strong balance of quality and cost. Budget: GPT-5 Nano, DeepSeek V4 — optimized for cost-sensitive high-volume tasks, up to 50× cheaper than frontier.

Retry Rate Infra

The percentage of API calls that fail and must be retried. A 5% retry rate with three retries per failure effectively adds 15% to your token consumption. Tracking retry rate as a KPI surfaces both prompt engineering problems (inconsistent output format) and infrastructure problems (rate limit tier).

Rate Limit Infra

A per-minute or per-day cap on API requests or tokens, set by the provider based on your account tier. Exceeding limits returns a 429 error requiring retry logic. High-volume production deployments must implement queuing, exponential backoff, and request distribution to stay within limits.

LLM-as-a-Judge Technique

A quality evaluation approach where one language model scores or critiques the outputs of another. Provides scalable quality assurance without manual review, but incurs token costs proportional to evaluation sample rate and the length of outputs being scored. Factor into true cost-of-ownership calculations.

Action Plan

AI Cost Optimization Checklist

A prioritized action list for reducing your AI API spend. Check off each item as you implement it.

⚡ Quick Wins (Days 1–3)

Enable Batch Processing for all non-interactive API calls.
↓ 50% on eligible traffic

Audit system prompt length — every word costs tokens on every single call.
↓ 10–30% input cost

Add output length constraints to all prompts ("respond in under N words").
↓ 20–35% output cost

Set budget alerts at 50%, 75%, 90% of monthly target in provider dashboard.

Switch to JSON output mode for all structured data extraction tasks.

🔧 Infrastructure (Week 1–2)

Enable prompt caching for stable system prompts and repeated context.
↓ 40–60% on cached input tokens

Implement cost attribution — tag every API call by team, feature, and use case.

Add intelligent retry logic with exponential backoff and jitter.

Validate structured outputs before retrying to avoid duplicate token costs.

Log token usage per request and track P95 cost as a KPI.

📐 Architecture (Weeks 2–4)

Implement model tiering — route classification and extraction to budget models.
↓ 40–55% on routed traffic

Evaluate RAG vs. long-context insertion for knowledge-base applications.

Build a batch job queue for all background and scheduled AI processing.

Evaluate fine-tuning a smaller model on your specific task to replace a frontier model.

🔄 Ongoing Maintenance

Quarterly prompt audits — compress, remove redundancy, update examples.

Monitor new model releases — providers frequently cut prices on new model generations.

Track retry rate as a KPI — above 2% signals a prompt engineering or infrastructure issue.

Re-evaluate model tier assignments as budget model capabilities improve each quarter.

Compare budgeted vs. actual spend monthly — variance signals hidden waste to investigate.

Official Sources

Official Provider Pricing Pages

Always verify current rates on each provider's official documentation before making large budget commitments. AI pricing changes frequently — sometimes multiple times per quarter — and our calculator is updated to match.

🤖

OpenAI

GPT-5.4, GPT-5 Nano, o3, GPT-4o — full model pricing

openai.com/api/pricing →

✦

Anthropic

Claude 4.6 Sonnet & Opus — rates and batch pricing

anthropic.com/pricing →

🔵

Google

Gemini 3.1 Pro and Flash — Vertex AI pricing

cloud.google.com →