Skip to main content
CodeLint.Dev Dev Tools
AI Tools 11 min read

LLM API Cost Calculator: Pricing, Optimization, and Real-World Budget Planning

LLM API costs can range from a fraction of a cent per request to several dollars — for the exact same task, depending on the model chosen. For startups and enterprises building AI-powered products, API costs are often the single largest variable infrastructure expense and the primary factor in model selection. This guide covers how LLM pricing actually works (input vs output tokens, batch vs real-time, prompt caching), per-model rate comparison across GPT-4o, Claude, Gemini, and DeepSeek, practical cost optimization strategies, and how to build a budget model for your specific use case.

Try the tool
LLM Cost Calculator
Calculate LLM costs free →

How LLM API Pricing Works

Every major LLM API bills by the token, not by the API call. There are two categories of tokens, priced separately:

  • Input tokens — the entire content sent to the model: system prompt + conversation history + user message + any documents or retrieved context. Every API call re-sends the full context.
  • Output tokens — the tokens generated in the model's response. Output tokens are almost always 3–5× more expensive than input tokens.

Why the asymmetry? Processing input tokens uses the model's KV cache (efficient parallel computation). Generating output tokens requires sequential autoregressive sampling — each output token requires a full forward pass, making it computationally more expensive.

The unit of pricing is per million tokens (abbreviated as /MTok or /1M tokens). A request with 1,000 input tokens and 500 output tokens at $2.50/MTok input and $10/MTok output costs:

(1,000 / 1,000,000) × $2.50 = $0.0025 input
(500 / 1,000,000) × $10.00  = $0.0050 output
Total per request: $0.0075

At 10,000 such requests per day: $75/day, $2,250/month. Scale this to 100,000 requests/day and you are at $22,500/month — from a single LLM integration.

Model Pricing Comparison (2025)

Prices change frequently as competition intensifies. The figures below are accurate as of May 2025 — always verify against the provider's current pricing page before budgeting.

Model Input ($/MTok) Output ($/MTok) Context
GPT-4o$2.50$10.00128K
GPT-4o mini$0.15$0.60128K
Claude 3.5 Sonnet$3.00$15.00200K
Claude 3.5 Haiku$0.80$4.00200K
Gemini 1.5 Pro$1.25$5.002M
Gemini 2.0 Flash$0.10$0.401M
DeepSeek V3$0.27$1.1064K
Llama 3.3 70B (Groq)$0.59$0.79128K

The cost difference between frontier and economy tiers is stark: GPT-4o vs GPT-4o mini is a 16× difference on input tokens. For applications where the mini/flash tier performs adequately, running everything on frontier models is one of the most common (and correctable) cost mistakes in AI product development.

Prompt Caching: The Biggest Cost Reduction Nobody Uses

Prompt caching allows the model to cache the KV (key-value) attention state of a repeated prompt prefix and reuse it across multiple requests. If you have a 10,000-token system prompt that stays constant across requests, without caching you pay for 10,000 input tokens every single request. With caching, you pay a fraction for cache reads instead.

Both Anthropic and OpenAI support prompt caching:

Provider Cache write Cache read TTL
Anthropic (Claude)1.25× base input price0.10× base input price (90% discount)5 minutes
OpenAI (GPT-4o)Standard input price0.50× base input price (50% discount)~1 hour

Example impact: A RAG application with a 20,000-token system prompt + retrieved context, serving 10,000 requests/day at Claude 3.5 Sonnet input price ($3.00/MTok):

  • Without caching: 10,000 × 20,000 tokens × $3.00/MTok = $600/day
  • With caching (90% cache hit rate): write cost + 10% full reads + 90% cache reads = approximately $75/day
  • Savings: $525/day = $15,750/month

Prompt caching requires structuring your prompts so that the stable prefix (system prompt, few-shot examples, large document) comes first, followed by the variable user message. The cache is keyed on the exact token sequence — even a single token change invalidates the cache.

Batch Processing: 50% Off for Non-Real-Time Workloads

Both OpenAI and Anthropic offer batch APIs that process requests asynchronously (typically within 24 hours) at 50% of the standard per-token price. For workloads that do not require real-time responses, this is a straightforward cost halving.

Ideal batch processing use cases:

  • Nightly classification or labelling of new database records
  • Bulk content generation (product descriptions, summaries)
  • Embedding generation for search indices
  • Evaluation runs comparing prompt variations
  • Offline document processing pipelines

Not suitable for batch: user-facing chat, real-time autocomplete, any workflow where a human is waiting for the response.

Cost Estimation: Building a Budget Model

Accurate cost projection requires three measured inputs for your specific use case:

  1. Average input tokens per request — system prompt + average conversation history + average user message + average retrieved context
  2. Average output tokens per request — measure this from 100+ real requests; do not guess
  3. Daily request volume — use your current traffic or a growth projection
Python Python — cost projector
def estimate_cost(
    avg_input_tokens: int,
    avg_output_tokens: int,
    daily_requests: int,
    input_price_per_mtok: float,
    output_price_per_mtok: float,
) -> dict:
    """
    Returns daily and monthly cost estimates.
    Prices are per million tokens (MTok).
    """
    input_cost_per_req  = (avg_input_tokens / 1_000_000) * input_price_per_mtok
    output_cost_per_req = (avg_output_tokens / 1_000_000) * output_price_per_mtok
    total_per_req = input_cost_per_req + output_cost_per_req

    daily_cost   = total_per_req * daily_requests
    monthly_cost = daily_cost * 30

    return {
        'cost_per_request': round(total_per_req, 6),
        'daily_cost':        round(daily_cost, 2),
        'monthly_cost':      round(monthly_cost, 2),
        'input_fraction':    round(input_cost_per_req / total_per_req * 100, 1),
    }

# Example: RAG chatbot on GPT-4o
result = estimate_cost(
    avg_input_tokens=3_000,    # system + history + retrieved context + user msg
    avg_output_tokens=400,     # typical assistant reply
    daily_requests=5_000,
    input_price_per_mtok=2.50,  # GPT-4o
    output_price_per_mtok=10.00,
)
print(f"Cost per request: ${result['cost_per_request']:.4f}")
print(f"Daily cost:       ${result['daily_cost']:,.2f}")
print(f"Monthly cost:     ${result['monthly_cost']:,.2f}")
# Cost per request: $0.0115
# Daily cost:       $57.50
# Monthly cost:     $1,725.00

Frequently Asked Questions

How is LLM API pricing calculated?
LLM APIs charge per token, not per API call. There are two price tiers: input tokens (everything you send to the model — system prompt, conversation history, user message) and output tokens (the model's generated response). Output tokens are typically 3–5× more expensive than input tokens. The unit of pricing is "per million tokens" ($/MTok). Your cost = (input tokens / 1,000,000) × input price + (output tokens / 1,000,000) × output price.
Which LLM is cheapest for production use?
For cost per token, Gemini 2.0 Flash ($0.10/$0.40 per MTok) and GPT-4o mini ($0.15/$0.60 per MTok) are among the cheapest frontier-class options as of mid-2025. DeepSeek V3 offers competitive quality at $0.27/$1.10. For self-hosted open-source models (Llama 3, Mistral), compute cost depends on your GPU infrastructure but can reach $0.10–$0.50/MTok all-in at scale. The cheapest option is only the right answer if the model quality meets your application's requirements — test before committing.
What is prompt caching and how much does it save?
Prompt caching stores the computed attention state (KV cache) of a stable prompt prefix on the provider's servers. Subsequent requests that share the same prefix pay a reduced "cache read" rate instead of the full input token price. Anthropic offers a 90% discount on cached tokens (0.10× the base rate); OpenAI offers 50%. For applications with large, repeated system prompts or few-shot examples, prompt caching can reduce input costs by 50–90%.
Is there a free tier for LLM APIs?
Google Gemini offers a free tier with rate limits (15 RPM, 1 million tokens/minute, 1,500 requests/day as of 2025). OpenAI, Anthropic, and most others require a paid account from the start, though all offer trial credits for new users. For development and testing, the free Gemini tier is often sufficient. Groq offers very fast inference at low cost for open-source models. For self-hosting, Ollama lets you run Llama 3 and Mistral models locally with no per-token cost, limited by your hardware.
How do I reduce LLM API costs without degrading quality?
The highest-impact cost reduction strategies are: (1) use a smaller model for tasks that do not require frontier capability — GPT-4o mini vs GPT-4o is a 16× cost difference; (2) implement prompt caching if your system prompt is large and stable; (3) use batch processing (50% discount) for offline workloads; (4) shorten system prompts and strip unnecessary context from input; (5) use RAG instead of long-context stuffing — retrieve only relevant chunks; (6) set appropriate max_tokens limits to prevent runaway outputs.
How do I monitor and control LLM API spend?
All major providers offer spend dashboards and budget alerts: OpenAI has usage limits and email alerts at configured thresholds; Anthropic has usage dashboards with daily breakdowns. For application-level monitoring, track tokens per request (log them from API response metadata), set per-user rate limits in your backend, and alert on anomalous cost spikes (a prompt injection that causes 50,000-token outputs will show up immediately in spend monitoring). For multi-model applications, use a middleware layer like LangSmith, Helicone, or a custom logging solution to attribute costs to features or user segments.

Ready to try LLM Cost Calculator?

Free, private, and runs entirely in your browser — no sign-up, no server, no data sent anywhere.

Open LLM Cost Calculator