"The Subsidy I'm Living Inside: My Real Token Cost vs. What I Pay"

2026-06-23 [llm-economicsclaude-codecodexpricingstrategycash-flow]

The subsidy I'm living inside

0. TL;DR

14-day API-equivalent spend: $12,971 (~$927/day). Claude Code is 96% of it ($12,486); Codex is 4% ($485).
Inside Claude: Opus 4.8 $7,869, Fable 5 $3,916, Sonnet $502, Haiku $199.
95% of my tokens are cache reads — so the bill is not driven by model output. For Opus, cache-read ($3,716) and 1-hour cache-writes ($2,549) each individually cost more than all of Opus's actual output ($1,408).
Blended effective rate: ~$1.02 per million tokens, all-in. That number only exists because of caching; the same work uncached would be multiples higher.
Versus a ~$400/mo combined consumer subscription, two weeks of my consumption is worth ~32 months of subscription — and no single consumer plan would even permit this volume. The subsidy is real and enormous.
Why they do it: the data flywheel on verifiable-reward coding traces and switching-cost lock-in, enabled by a bet on ~10×/yr inference deflation. Confidence: moderate.
Three futures, each with a precedent: open-source converges (value → compute + distribution; Linux/Android), one model runs away to 70% (Google search / Wintel / TSMC), or a 5-way 20% split (airlines / cloud IaaS). Only the runaway case produces durable pure-play model-lab profits.

1. What I actually consumed (14 days, priced at API list)

Measured directly from ~/.claude/projects/**/*.jsonl (all projects, including subagent transcripts) and ~/.codex/sessions/**, timestamp-filtered to the last 14 days. Priced at June-2026 list rates (see Appendix).

Claude Code — $12,485.55

Model	Turns	Tokens	Output	API cost	Blended $/Mtok
claude-opus-4-8	28,581	7.78B	56.3M	$7,868.58	$1.012
claude-fable-5	11,494	2.48B	10.2M	$3,916.01	$1.582
claude-sonnet-4-6	13,231	0.85B	3.5M	$501.59	$0.593
claude-haiku-4-5	25,526	1.06B	2.6M	$199.36	$0.187
Total	78,832	12.17B	72.5M	$12,485.55	$1.026

Cost decomposition for the two that matter:

Model	input	output	cache-write 5m	cache-write 1h	cache-read
opus-4-8	$71	$1,408	$124	$2,549	$3,716
fable-5	$41	$512	$265	$693	$2,405

Codex — $485.42

In-window Codex is 100% gpt-5.5 (the older gpt-5.3-codex and the self-hosted -spark variants are all >14 days old, so they don't appear here).

Model	Sessions	Billed tokens	Breakdown	API cost	Blended $/Mtok
gpt-5.5	50	556.1M	31.4M uncached-in ($157) + 522.5M cached-in ($261) + 2.23M out ($67)	$485.42	$0.873

The two headline numbers

$12,971 over 14 days of API-equivalent value. ~$927/day. Claude 96% / Codex 4%.

2. The cost structure is the actual insight

Intuition says "you pay for what the model writes." Wrong, for this workload.

95.4% of my Claude tokens are cache reads (11.6B of 12.17B). At $0.10–$1.00/Mtok they're cheap per token, but the volume is so vast that cache-read is still the single largest line ($3,716 on Opus alone).
Output is a rounding error by volume (72.5M of 12.17B = 0.6%) but expensive per token. It's the second or third line, not the first.
The sleeper cost is the 1-hour cache write. Opus burned 254.9M tokens of 1h cache-write at $10/Mtok = $2,549 — more than its entire output bill. That's the price of keeping long-lived context (the 1M window, agent-fleet scaffolding) warm across a run. If I'm optimizing my own bill, that is the lever, not "write less."
Fable is the premium tax. Fable 5 lists at 2× Opus ($10/$50 vs $5/$25). 11,494 Fable turns cost $3,916 — nearly half of Opus's bill for a third of the tokens. "Fast mode" is not free; it's a deliberate 2× premium for latency.

Implication for me: my effective ~$1/Mtok is a caching artifact. The architecture (huge re-sent context, tiny diffs, heavy fan-out to subagents) is exactly the shape that caching subsidizes. If prompt caching ever got repriced toward true cost, my bill — not the median user's — would move the most.

3. Why are they torching margin to hand me this?

Ranked by how much I think each explains the discount, with confidence.

(A) The data flywheel on verifiable-reward coding traces. — Primary. Moderate-high confidence. Coding is the one frontier use case with a cheap, automatic ground-truth signal: does it compile, do the tests pass, did the human accept the diff. That turns every session into labeled RL data of a quality you cannot buy. Math and code are where RLHF/RLVR (reinforcement learning from verifiable rewards) actually works, because the reward isn't a fickle human rating — it's a green test suite. My fleet runs — thousands of agent turns with tool calls, file edits, and accept/reject outcomes — are precisely the trajectories that train the next coding model. They are not discounting compute to sell me compute; they are buying my trajectories with discounted compute. The price you see is a data-acquisition cost booked as COGS.

(B) Switching-cost land-grab. — Primary. Moderate-high confidence. Coding is the stickiest, highest-willingness-to-pay, highest-retention surface in all of software. Once a team's muscle memory, CI, and agent scaffolding are wired to one CLI, the switching cost is real. Whoever owns the developer's default in 2026 owns a recurring, expanding seat for a decade. That's worth losing money on now. This is textbook Day-1-unprofitable platform capture.

(C) An explicit bet on ~10×/yr inference deflation. — Enabler. Moderate confidence. Per-token inference cost has fallen roughly an order of magnitude per year (better kernels, quantization, speculative decoding, cheaper FLOPs, distillation to smaller serving models). If you believe today's loss-leading price is next year's healthy-margin price at constant list, then subsidizing usage now is just pulling demand forward into a cost curve you're confident collapses. The discount is a loan against your own roadmap.

(D) Capex utilization. — Secondary. Moderate. The clusters are built and depreciating whether or not they're busy. Marginal inference on owned, idle GPUs is near-free relative to the sunk training and capex. Filling capacity at a low price beats stranded silicon. This argues for low marginal pricing specifically.

(E) Subscription/API arbitrage on power users. — Secondary. Moderate. Flat-rate subscriptions cap the lab's downside while harvesting the most valuable users and — crucially — their data (see A). A $200 "unlimited-ish" plan is a customer-acquisition + data-acquisition instrument with a built-in loss ceiling, not a profit center.

My read: It's (A)+(B) — buy the irreplaceable coding-trajectory data and capture the stickiest seat — financed by the confidence in (C). The "turf war" framing (pure share grab) is real but downstream: share matters because share is what produces the data and the lock-in. The deepest reason is the flywheel, not the market-share scoreboard.

4. "If they charged real cost tomorrow"

"Real cost" is two different numbers, and conflating them is the usual mistake.

Marginal inference cost (COGS). For most frontier models the API list price is already at or above marginal serving cost — public teardowns put frontier API gross margin meaningfully positive. So "charge real cost" at the per-API-token level would, for a casual user, change little — list price already clears marginal cost. If anything caching is the one place I'm plausibly priced below a fully-attributed marginal cost, because the economics of keeping my multi-gigabyte context warm are favorable to them only at fleet scale.

Fully-loaded cost (the part that's actually subsidized). The real subsidy is amortizing R&D + frontier training runs ($100M–$1B+ per generation) + the data-acquisition discount + the all-you-can-eat subscription over a user base that is mostly free or flat-rate. That is what's not in my per-token price. If a lab had to recover full loaded cost at today's volume with no outside capital, prices would have to rise materially — or volume would have to 10× to amortize the fixed cost — or the capital markets keep funding the gap. They are choosing door #3 (capital) plus a bet on door #2 (volume).

Concretely for me: the per-token price is roughly honest at the margin; the thing I'm getting for free is a share of a multi-billion-dollar R&D program and a subscription that ignores how much I actually burn. If that vanished tomorrow, my $927/day stays roughly $927/day at the API level — but the $400/mo flat-rate door slams shut, and that's the real gift I'd lose.

5. Three end-states (cash-flow + precedent)

All three cash-flow sketches are illustrative orders of magnitude for a single representative frontier lab, normalized to a notional $10B revenue year so the shape is comparable. Confidence: low — the point is the structure, not the digits.

Scenario 1 — Open source asymptotes to closed; the model layer commoditizes

The capability gap (frontier closed vs. best open weights) keeps shrinking until "good enough" open models cover ~90% of jobs. Inference becomes a feature, not a product. Value migrates off the model layer to (a) compute/hardware and (b) distribution + workflow ownership.

Annual ($B, illustrative)	Closed pure-play lab	Compute owner (NVIDIA/cloud)	App/workflow owner
Revenue	10	10	10
COGS (inference/silicon)	6	4	6
Gross margin	40%	60%	40%
R&D / training	6	2	1.5
Operating FCF	(2)	+4	+2.5

The pure-play model lab is squeezed: it must keep funding frontier training to stay ahead of free, but can't price for it because open weights cap the ceiling. Margin accrues to whoever owns the GPUs and whoever owns the customer.

Precedents: Linux vs. proprietary UNIX — the OS went free and the money moved to Intel (hardware), Red Hat (support/services), and ultimately AWS (managed distribution). MySQL/Postgres commoditized the database engine; the profit pool moved to managed cloud DBs (RDS, Aurora, Snowflake). Android open-sourced the mobile OS and the value accrued to Google's services layer and Qualcomm/Samsung silicon. In every case the commoditized layer was real and useful — and captured almost none of the economic surplus.

My takeaway: in this world, I win as a buyer (cheap, swappable models) and the smart business is owning the workflow (which is, not coincidentally, what Digital Surface is) or owning compute — not being a model lab.

A capability + ecosystem lead compounds: best model → most usage → most (verifiable) data → best next model. Cash flow turns sharply positive and self-funds the moat — the leader pays for the next $1B training run out of operating cash while rivals raise dilutive capital.

Annual ($B, illustrative)	The 70% leader	A 10%-share also-ran
Sector revenue share	70	10
Gross margin	65%	45%
R&D / training (absolute)	18	9
R&D as % of revenue	26%	90%
Operating FCF	+27	(4.5)

The leader spends more on R&D in absolute dollars yet a smaller fraction of revenue — the definition of an unassailable flywheel. The also-ran spends nearly its whole revenue chasing and still bleeds. NPV of the sector concentrates almost entirely in the leader; runners-up are acqui-hired or become niche.

Precedents: Google in search (~90% share; the ad cash flow funds every other bet and a near-permanent data/quality moat). Wintel (Windows + Intel each ran away in their layer for ~two decades). TSMC in leading-edge foundry (one firm, structurally, because the capex + learning curve are self-reinforcing). The economics of "best → most usage → most data → best" is the most common big-tech end-state.

Risks to the leader: antitrust (Google is the cautionary tale), a genuine step-change from a rival (the thing that ended prior runaways), or Scenario 1 eating the floor from below while the leader defends the ceiling.

My takeaway: if I'm betting which lab, this is the world where picking right matters most — and where being locked into the eventual loser is most expensive. Hence: build router-agnostic.

Scenario 3 — ~20% each across five (Mistral, Gemini, OpenAI/Codex, Anthropic/Claude, Grok)

A fragmented oligopoly. Near-parity models, cheap switching (router layers, multi-model apps), Bertrand-style price competition compressing margins toward marginal cost. Differentiation is by niche and distribution, not raw capability: coding → Claude/Codex, multimodal/search → Gemini, sovereign/EU + open weights → Mistral, X-ecosystem + realtime → Grok.

Annual ($B, illustrative)	Pure-play (Anthropic/Mistral/xAI)	Platform-subsidized (Gemini/Copilot)
Revenue (20% share each)	4	4
Gross margin (price war)	35%	35%
R&D / training	5	5 (cross-subsidized)
Adjacent profit pool backstop	none	Google ads / MS Office
Standalone Operating FCF	(3.6)	(3.6) but absorbed

The brutal asymmetry: in a price war, the survivors are the ones whose model losses are cross-subsidized by an adjacent monopoly — Gemini funded by Google ads, Copilot funded by Office/Azure. The pure-plays (Anthropic, Mistral, xAI) face the hardest financing path because they have no other profit pool to bleed from; they must either reach Scenario-2 escape velocity in a niche (Anthropic is explicitly trying this in coding/agents) or consolidate.

Precedents: airlines (commodity service, capacity wars, chronic sub-cost-of-capital returns, periodic bankruptcies). Telco carriers (undifferentiated pipes, margin set by the most desperate competitor). The kinder precedent is cloud IaaS top-3 (AWS/Azure/GCP) — an oligopoly that held decent margins because switching costs and scale economies were high; whether LLMs look like airlines or like cloud depends entirely on how high switching costs end up being (today: low and falling, which points toward airlines).

My takeaway: this is the best world for me as a buyer and the worst for a standalone model lab. Aggregate sector profit is lowest here; the consumer surplus (cheap, interchangeable intelligence) is highest.

Which is most likely?

A blend, time-phased: Scenario 3 at the commodity tier now (5 near-parity models, falling prices — exactly what the pricing table already shows: Opus 4.8 cut 3× vs 4.1, Mistral Large 3 cut 75%, Grok output cut ~58%), with the labs sprinting to convert a niche into Scenario 2 before the floor of Scenario 1 rises to meet them. Coding is the chosen niche for that escape attempt — which is exactly why my coding tokens are the most aggressively subsidized ones I buy.

6. What this means for me (the buyer)

I am the subsidized party. Act like it. $13k/two-weeks of value for a flat fee is a transfer from venture/cloud balance sheets to me. Use it heavily now — the fan-out fleet workloads are precisely what's cheapest under the current caching + subscription regime and most likely to reprice later.
Stay router-agnostic. In Scenario 2 the cost of being locked to the loser is catastrophic; in Scenario 3 the cost of not arbitraging across five near-parity vendors is leaving money on the table. Either way, abstraction over a single provider is the hedge. (I already do this via getLlmProvider() — keep it.)
The 1h-cache-write line is my real cost lever, not output. If I want to cut spend without cutting work, target long-lived context warmth, not verbosity.
The durable business is the workflow, not the model. In two of three scenarios, the model layer commoditizes and the surplus accrues to whoever owns the customer's workflow. That is the thesis Digital Surface is already built on — this analysis is a reason to lean harder into owning the surface and treating models as swappable fuel.
Expect a bifurcation, not a single price. A cheap commodity tier (Scenario 3 dynamics) plus a premium frontier tier (Scenario 2 aspirants charging for the lead). Buy the commodity tier for breadth; pay up for the frontier only where the lead actually changes the outcome.

Appendix — methodology & prices

Sources. Claude: every ~/.claude/projects/**/*.jsonl (incl. **/subagents/agent-*.jsonl), assistant messages, summed .message.usage.{input_tokens,output_tokens,cache_creation.{ephemeral_5m,ephemeral_1h}_input_tokens,cache_read_input_tokens} by .message.model, timestamp ≥ now−14d. Validated no iteration double-count (top-level output 72,471,157 vs. Σ iterations 72,474,123 = 0.004% diff). Codex: ~/.codex/sessions/**/rollout-*.jsonl, token_count events; total_token_usage is cumulative per session, so took the last per session (validated monotonic across all 50 sessions; Σ-last-total 556.1M vs Σ-per-turn 558.0M = 0.3%). OpenAI cached_input_tokens is a discounted subset of input_tokens; uncached = input − cached. OpenAI output_tokens already includes reasoning tokens — priced once.

Prices used (USD per 1M tokens), June 2026 list. Claude — Opus 4.8: 5 / 25 / cw-5m 6.25 / cw-1h 10 / read 0.50. Sonnet 4.6: 3 / 15 / 3.75 / 6 / 0.30. Haiku 4.5: 1 / 5 / 1.25 / 2 / 0.10. Fable 5: 10 / 50 / 12.50 / 20 / 1.00 (from the on-machine claude-api skill reference). gpt-5.5: in 5 / cached-in 0.50 / out 30 (OpenAI docs).

Caveats. (1) Subscription comparison uses standard public tiers; I don't encode which plan/seats are actually billed — and notably no single consumer subscription would permit this volume, so heavy fleet load realistically runs on metered API or multiple seats. The per-token value stands regardless. (2) "Run-rate" projections are illustrative only — velocity is elastic and deliberately bursty. (3) Scenario cash-flows are structural sketches, not forecasts. (4) Prices move fast (this whole thesis is about that); re-pull before quoting.