Updated June 2026. Real per-token prices for GPT, Claude, Gemini, DeepSeek and Llama — ranked by what you are actually doing: chat, RAG, content, batch and agents.
"Which LLM API is cheapest?" is the wrong question. The right one is cheapest for what you're actually doing. The model that wins a retrieval workload loses a content-generation workload, because input and output tokens are priced differently — output costs roughly 5× input on every frontier model. This guide ranks the cheapest option for each common workload, using live per-token pricing. Want an exact number for your own prompt? Drop it into the calculator — it shows the bill before you ship.
At the raw-price floor, three models are effectively tied:
DeepSeek is the cheapest of the three for output-heavy work because its output rate ($0.28) is the lowest on the market. For input-heavy work, the two $0.10 models edge ahead. See every rate side by side on the all-models page.
Chat is a balanced workload: a system prompt plus history on input, a few hundred tokens on output. Two levers decide the winner here — the output rate and prompt caching, because a chat app sends the same system prefix on every turn.
Winner: Claude Haiku 4.5 ($1 / $5) or Gemini 3.5 Flash ($1.50 / $9) once caching is on. Caching a static system prompt bills the cached prefix at ~10% of the input rate, which routinely cuts a chat app's bill 40–60%. The slightly higher headline rate buys noticeably better answer quality than the sub-$0.20 floor models, which matters when a user is reading every reply. Compare them directly: Gemini 3.5 Flash vs Claude Haiku 4.5.
RAG has a lopsided cost profile: long input (retrieved chunks + question), short output (the answer). That means the input rate dominates and the output rate barely matters.
Winner: Gemini 2.5 Flash-Lite or GPT-4.1 Nano ($0.10 input). At ten cents per million input tokens, stuffing 8K tokens of context into a query costs a fraction of a cent. One caveat from real testing: on hard questions the very cheapest model sometimes misses, and a slightly pricier model (Gemini 3.5 Flash) becomes the better economic choice once you count re-asks and wrong answers. Budget for quality, not just the sticker price.
Content generation flips RAG on its head: short input (a brief), long output (the article). Here the output rate is everything, and output is the expensive direction.
Winner: DeepSeek V3.2 ($0.28 output). It has the lowest output rate on the market by a wide margin. If you need stronger writing quality, Gemini 2.5 Flash ($2.50 output) is the next step up. Whatever you pick, cap max_tokens and prompt for terse drafts — a 50% cut in output length is a ~40% cut in total cost.
If a job can wait 24 hours — nightly summaries, data enrichment, eval runs, embedding refresh — every major vendor offers a flat 50% discount through their Batch API. That discount stacks on top of the rates above.
Winner: whatever you'd use for the synchronous version, halved. Run DeepSeek V3.2 output-heavy batch jobs and you are paying $0.14 per million output tokens. There is no cheaper way to process tokens at scale in 2026.
Autonomous agents burn tokens in long loops, so per-token price matters — but a cheap model that fails the task and loops again is not cheap. The economical pattern is two-tier routing: a budget model (Haiku, Flash-Lite, Nano) triages 70–80% of steps, and a flagship (GPT-5.4, Claude Sonnet 4.6) handles the hard 20%. This typically saves 60–85% versus running everything through the flagship, with no meaningful quality loss. For the hardest long-horizon coding, the Mythos-class models (Claude Fable 5) cost more per run but save engineer-hours — see the model list for current rates.
max_tokens is free money.By raw price, Gemini 2.5 Flash-Lite and GPT-4.1 Nano at $0.10/$0.40 per million tokens, with DeepSeek V3.2 ($0.14/$0.28) cheapest for output-heavy work. The best choice depends on your input/output ratio.
For output-heavy workloads, yes — DeepSeek V3.2's $0.28 output rate is the lowest among production models. For input-heavy RAG, the $0.10-input models are cheaper.
Anthropic, OpenAI and Google all support it, with different syntax. Cached input is billed at roughly 10% of the standard rate.
Paste your real prompt into the gpt-cost.com calculator, pick a model, and it shows per-call and at-scale cost with input and output priced separately.
Prices synced from each vendor's official pricing page, June 2026. For exact billing, use each provider's official tokenizer.