How LLM API Pricing Works
Every major LLM API bills by the token, not by the API call. There are two categories of tokens, priced separately:
- Input tokens — the entire content sent to the model: system prompt + conversation history + user message + any documents or retrieved context. Every API call re-sends the full context.
- Output tokens — the tokens generated in the model's response. Output tokens are almost always 3–5× more expensive than input tokens.
Why the asymmetry? Processing input tokens uses the model's KV cache (efficient parallel computation). Generating output tokens requires sequential autoregressive sampling — each output token requires a full forward pass, making it computationally more expensive.
The unit of pricing is per million tokens (abbreviated as /MTok or /1M tokens). A request with 1,000 input tokens and 500 output tokens at $2.50/MTok input and $10/MTok output costs:
(1,000 / 1,000,000) × $2.50 = $0.0025 input (500 / 1,000,000) × $10.00 = $0.0050 output Total per request: $0.0075
At 10,000 such requests per day: $75/day, $2,250/month. Scale this to 100,000 requests/day and you are at $22,500/month — from a single LLM integration.
Model Pricing Comparison (2025)
Prices change frequently as competition intensifies. The figures below are accurate as of May 2025 — always verify against the provider's current pricing page before budgeting.
| Model | Input ($/MTok) | Output ($/MTok) | Context |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o mini | $0.15 | $0.60 | 128K |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| Claude 3.5 Haiku | $0.80 | $4.00 | 200K |
| Gemini 1.5 Pro | $1.25 | $5.00 | 2M |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M |
| DeepSeek V3 | $0.27 | $1.10 | 64K |
| Llama 3.3 70B (Groq) | $0.59 | $0.79 | 128K |
The cost difference between frontier and economy tiers is stark: GPT-4o vs GPT-4o mini is a 16× difference on input tokens. For applications where the mini/flash tier performs adequately, running everything on frontier models is one of the most common (and correctable) cost mistakes in AI product development.
Prompt Caching: The Biggest Cost Reduction Nobody Uses
Prompt caching allows the model to cache the KV (key-value) attention state of a repeated prompt prefix and reuse it across multiple requests. If you have a 10,000-token system prompt that stays constant across requests, without caching you pay for 10,000 input tokens every single request. With caching, you pay a fraction for cache reads instead.
Both Anthropic and OpenAI support prompt caching:
| Provider | Cache write | Cache read | TTL |
|---|---|---|---|
| Anthropic (Claude) | 1.25× base input price | 0.10× base input price (90% discount) | 5 minutes |
| OpenAI (GPT-4o) | Standard input price | 0.50× base input price (50% discount) | ~1 hour |
Example impact: A RAG application with a 20,000-token system prompt + retrieved context, serving 10,000 requests/day at Claude 3.5 Sonnet input price ($3.00/MTok):
- Without caching: 10,000 × 20,000 tokens × $3.00/MTok = $600/day
- With caching (90% cache hit rate): write cost + 10% full reads + 90% cache reads = approximately $75/day
- Savings: $525/day = $15,750/month
Prompt caching requires structuring your prompts so that the stable prefix (system prompt, few-shot examples, large document) comes first, followed by the variable user message. The cache is keyed on the exact token sequence — even a single token change invalidates the cache.
Batch Processing: 50% Off for Non-Real-Time Workloads
Both OpenAI and Anthropic offer batch APIs that process requests asynchronously (typically within 24 hours) at 50% of the standard per-token price. For workloads that do not require real-time responses, this is a straightforward cost halving.
Ideal batch processing use cases:
- Nightly classification or labelling of new database records
- Bulk content generation (product descriptions, summaries)
- Embedding generation for search indices
- Evaluation runs comparing prompt variations
- Offline document processing pipelines
Not suitable for batch: user-facing chat, real-time autocomplete, any workflow where a human is waiting for the response.
Cost Estimation: Building a Budget Model
Accurate cost projection requires three measured inputs for your specific use case:
- Average input tokens per request — system prompt + average conversation history + average user message + average retrieved context
- Average output tokens per request — measure this from 100+ real requests; do not guess
- Daily request volume — use your current traffic or a growth projection
def estimate_cost(
avg_input_tokens: int,
avg_output_tokens: int,
daily_requests: int,
input_price_per_mtok: float,
output_price_per_mtok: float,
) -> dict:
"""
Returns daily and monthly cost estimates.
Prices are per million tokens (MTok).
"""
input_cost_per_req = (avg_input_tokens / 1_000_000) * input_price_per_mtok
output_cost_per_req = (avg_output_tokens / 1_000_000) * output_price_per_mtok
total_per_req = input_cost_per_req + output_cost_per_req
daily_cost = total_per_req * daily_requests
monthly_cost = daily_cost * 30
return {
'cost_per_request': round(total_per_req, 6),
'daily_cost': round(daily_cost, 2),
'monthly_cost': round(monthly_cost, 2),
'input_fraction': round(input_cost_per_req / total_per_req * 100, 1),
}
# Example: RAG chatbot on GPT-4o
result = estimate_cost(
avg_input_tokens=3_000, # system + history + retrieved context + user msg
avg_output_tokens=400, # typical assistant reply
daily_requests=5_000,
input_price_per_mtok=2.50, # GPT-4o
output_price_per_mtok=10.00,
)
print(f"Cost per request: ${result['cost_per_request']:.4f}")
print(f"Daily cost: ${result['daily_cost']:,.2f}")
print(f"Monthly cost: ${result['monthly_cost']:,.2f}")
# Cost per request: $0.0115
# Daily cost: $57.50
# Monthly cost: $1,725.00