"Neural Net Landing Page Optimization for OpenArcade"

"Using reinforcement learning and in-browser inference to maximize game click-throughs"

2026-02-17 openarcademachine-learningreinforcement-learninglanding-pageconversion-optimization

Neural Net Landing Page Optimization for OpenArcade

The Problem

OpenArcade is a collection of browser arcade games where every play session generates training data for a vision model learning to play from raw pixels. The landing page shows game cards and live stats. The goal: maximize the percentage of visitors who click into a game and actually play.

The question is whether a neural network — potentially running client-side in the browser — can observe user behavior in real time and dynamically adapt the page to increase click-throughs. This article surveys the state of the art, evaluates what's practical at different traffic levels, and proposes a phased implementation plan.

Bottom line: Full deep RL for landing page optimization is overkill and sample-starved at any realistic OpenArcade traffic level. The right approach is a phased progression: analytics-driven card reordering first, then Thompson Sampling bandits, then contextual bandits as traffic grows. Client-side inference is technically feasible with sub-millisecond latency for small behavioral models, but the bottleneck is training data, not inference speed. Start with instrumentation.

1. What Actually Works: RL and Bandits for Landing Pages

The Academic Landscape

True deep RL (DQN, PPO, SAC) applied to live landing page optimization is virtually nonexistent in production. The reasons are fundamental:

Sample complexity: Deep RL needs orders of magnitude more interactions than website traffic can provide
Sparse, delayed rewards: Conversions happen minutes to days after page load — temporal credit assignment is brutal
Non-stationary environment: User populations shift; policies overfit to historical distributions
Enormous action space: All possible page configurations form a mostly-discrete, combinatorial action space

What does work falls into two categories:

Evolutionary Computation (Evolv AI). The most rigorous production system, based on Risto Miikkulainen's work at UT Austin (arXiv:1703.00556, AI Magazine 2020). A human defines the search space (which elements, which variants), and an evolutionary algorithm breeds candidate designs across generations, evaluated on real users. Multi-armed bandits handle traffic allocation. Reported results: 20-200% improvements over human design. One case study showed 43.5% lift after 60 days with ~600K interactions.

Contextual Bandits. The practical RL variant that dominates production. Technically a single-step RL problem: the state is visitor context (device, referrer, time of day), the action is which page variant to serve, the reward is binary conversion. This sidesteps temporal credit assignment entirely. Statsig, Optimizely, Kameleoon, and the now-merged VWO/AB Tasty all implement this.

Multi-Armed Bandits vs. A/B Testing

Dimension	A/B Test	MAB (Thompson Sampling)
Traffic allocation	Fixed 50/50 split	Dynamic, shifts to winner
Statistical rigor	Frequentist, controlled error	Bayesian, probability-of-best
Sample size needed	Pre-calculated, fixed	Equal or greater for same guarantees
Opportunity cost	High (50% traffic on loser)	Lower (shifts away from losers)
Best for	Causal knowledge	Revenue optimization during test

Critical insight: MABs don't need less data to reach statistical certainty — they reduce regret (lost conversions) during the learning phase by directing traffic away from losers faster. Thompson Sampling is the best algorithm for this: it concentrates exploration where uncertainty is highest, handles delayed feedback well, and has logarithmic cumulative regret vs. linear for epsilon-greedy.

For detecting a 10% relative lift from a 30% baseline at 95% confidence / 80% power, you need ~8,600 visitors per variant regardless of method. Thompson Sampling just wastes fewer of those visitors on the loser.

Existing Tools and Platforms

Google Optimize is dead (sunset September 2023). Here's what replaced it:

Tool	Type	Under the Hood	Pricing	Best For
Statsig	MAB + contextual bandits	Thompson Sampling, automatic attribute encoding	Free: 2M events/mo. Pro: $0.05/1K above 5M	Best free tier
GrowthBook	A/B + MAB	Bayesian stats, self-hostable Docker	Open source (MIT). Bandits in Pro	Self-hosted
Evolv AI	Evolutionary + MAB	Population-based search with crossover/mutation	Enterprise (~$50-150K/yr)	Multivariate at scale
Webflow Optimize	ML multivariate	Continuous learning, per-segment optimization	Webflow add-on	Webflow users
VWO + AB Tasty	Full-stack experimentation	MAB + "Evi" AI agent for test setup	Enterprise ($10-50K/yr)	Full-stack
Vowpal Wabbit	Contextual bandits	IPS, direct method, doubly robust evaluation	Open source (Microsoft)	Build-your-own

For OpenArcade's scale and self-hosted constraint (Jetson Orin Nano), the practical options are GrowthBook (Docker, MIT license) or a homegrown Thompson Sampling implementation (~50 lines of JS).

2. Browser-Side Neural Net Inference: Current State

Can You Run a Model in the Browser?

Yes, and for behavioral prediction models, it's fast enough to be imperceptible.

Framework benchmarks (from ACM TOSEM 2024 study across 50 PCs + 20 mobile devices):

Model	Size	WebGL (ms)	WASM (ms)
MobileNetV2	14MB	20	89
ResNet-50	98MB	80	172
Small MLP (behavioral)	50KB	<1	<1

For behavioral prediction (15-30 input features → intent score), you need a 2-3 layer MLP with 32-64 neurons. Model size: 10-50KB. Inference: <1ms via WASM in a Web Worker. You don't even need TensorFlow.js or ONNX — a 3-layer MLP is ~50 lines of typed-array JS.

Backend Comparison

Backend	Best For	Watch Out
WebGPU	Large transformers, matmul-heavy	Chrome/Edge only, ~85-90% coverage
WebGL	Medium models, broad compat	Shader warmup: up to 64x first-inference penalty. GPU contention degrades UI by up to 62.7%
WASM (SIMD+threads)	Small models, CPU-only, mobile	4GB memory ceiling. Perfect for behavioral classifiers

Recommendation for OpenArcade: WASM backend in a Web Worker. The model is tiny, inference is sub-millisecond, and there's zero risk of UI jank from GPU contention.

Model Size Guidelines

Under 5MB: Loads instantly, negligible page-load impact. Ideal for behavioral classifiers.
5-50MB: Acceptable if lazy-loaded after interactive. MobileNet-class.
50-200MB: Must be cached in IndexedDB. Not suitable for embedded personalization.

3. What Signals to Capture

High-Predictive Signals (Research-Backed)

Based on the Smashing Magazine analysis by Eduard Kuric and SIGIR 2020 research on mouse movement representations:

Mouse/Cursor (highest signal-to-noise): - Cursor velocity (mean/max) — browsing style indicator; max velocity flags frustration - Hesitation time — average time from hover to click; measures decision difficulty - Cursor-to-CTA trajectory — directional movement toward call-to-action; predictive 3-5s before click - Direction changes — trajectory reversals indicate confusion or comparison - Path straightness ratio — direct distance / actual path; straighter = more decisive

Scroll: - Scroll depth — single most-tracked engagement metric - Scroll velocity — speed indicates skimming vs reading; pauses indicate interest - Reverse scrolls — scrolling back up = high engagement

Temporal: - Time-to-first-interaction — correlates with intent strength - Viewport dwell time per section (via IntersectionObserver)

Contextual (no tracking needed): - Device type, screen size, referrer source, time of day, connection speed, browser language

What's noise: Raw cursor position (layout-dependent), total click count without context, raw session duration (confounded by tab-away). Also: mouse DPI/OS acceleration curves differ across hardware — normalize velocity to percentiles, not absolute px/s.

Client-Side Architecture

 MAIN THREAD                          WEB WORKER
 ===========                          ==========

 Signal Collector         --transfer-->  Inference Engine
 (passive listeners,                     (50KB MLP via WASM)
  rAF batching)                               |
      |                   <--transfer--  Prediction
      v                                  {intent: 0.73}
 DOM Adapter
 (rAF-batched writes,
  CSS order/opacity only)

Latency budget: Signal collection 0ms (passive) → feature extraction <3ms every 500ms → inference <1ms → DOM update <5ms (CSS-only). Total: ~500ms perception-to-adaptation, dominated by the collection window.

Cold-start: First 0-3s use contextual defaults (referrer × device × time of day). First behavioral signal available at 3s (scroll velocity, cursor trajectory). Confident adaptation at 10s+.

Key technique: Use CSS order on flex/grid containers to reorder sections without DOM manipulation. Use opacity, transform, visibility for visual changes — these don't trigger layout reflow.

Privacy

Client-side-only processing with no data exfiltration is lower risk than server-side tracking, but not automatically exempt from ePrivacy/GDPR. The ePrivacy Directive (Article 5(3)) covers "accessing information on terminal equipment" — which technically includes reading mouse positions via JavaScript.

Practical position: keep all processing in-memory (no localStorage/cookies for behavioral data), respect navigator.globalPrivacyControl, disclose in privacy policy under "automated decision-making."

4. What to Optimize: Highest Leverage Elements

Research on what actually moves the needle for landing page conversion:

Ranked by Expected Impact for OpenArcade

1. Card ordering (15-40% relative lift)

The serial position effect (CXL research) shows position 1 gets 10.5% click-through vs 7.3% for position 5 — a 44% relative difference just from position. The first 3-4 cards above the fold capture disproportionate attention. Simply reordering by actual engagement data from the recorder is the single highest-impact change.

2. Reducing visible cards (10-30% lift)

Landingi research found pages with fewer than 10 elements convert at 2x the rate of pages with 40+. Showing 6-8 top games with a "Show All" expansion could increase engagement with visible games.

3. Adding explicit CTA (10-28% lift)

Cards are clickable <a> tags but have no explicit "Play Now" button. VWO research shows CTA buttons outperform text-based CTAs by up to 28%. Unbounce documented a case where a three-word CTA change produced 104% conversion lift.

4. Mission bar position (5-15% lift)

The mission bar sits between the hero and the game grid. On smaller screens it may push popular games below the fold. LandingPageFlow research found moving primary CTAs above the fold produced 101% increase in clicks.

5. Hero copy and card descriptions (5-10% lift each)

NEW: Card Visual Treatment (30-200% relative lift)

This is the highest-potential optimization dimension, but it was missing from the original analysis because the current cards are text-only. The question: does adding a visual preview — a static screenshot or a short looping gameplay video — dramatically increase the rate at which visitors click through and actually play?

The research strongly suggests yes:

Static images vs. text-only: - Landing pages with relevant images convert 21% higher than text-only layouts (Zebracat 2025) - Artist photos replacing text descriptions produced a 95% lift in one VWO case study (VWO) - On Steam, the capsule image (the small card thumbnail) is the single biggest driver of click-through — most users decide from the image alone, never reading the description (Phiture/ASOStack)

Animated previews vs. static images: - Animated GIF recipients clicked through 203% more than those shown static images in a MarketingSherpa A/B test (MarketingSherpa) - Cinemagraphs (subtle looping animations) drove 110% higher engagement than still photos in Microsoft's Twitter ad experiments, with cost per engagement dropping 45% (MarketingProfs 2018) - Animated ads averaged 7% higher conversion rate overall, but 6-9 second animations achieved 138% higher conversions (AdEspresso) - itch.io supports animated GIF thumbnails on hover, and indie developers consistently report these are critical for standing out in browse feeds (itch.io docs)

The counterexample (important): - Apple's own A/B test on App Store product pages found that the page without a video preview outperformed the page with one (Apple Developer). The lesson: in fast-scroll browse contexts, a well-designed static thumbnail can outperform video because it's more scannable. This matters for OpenArcade because the grid shows 100 games — scroll speed is high.

The performance governor: - Every additional second of page load drops conversions by 4.42% (Portent) - Pages with images over 1MB average 9.8% conversion vs 11.4% without — a 14% relative decrease (involve.me) - 53% of mobile visitors leave if load exceeds 3 seconds (Instapage)

Format and file size guidance:

Format	Size per card	Quality	Browser support
Animated GIF (320px)	200-500KB	Mediocre (256 colors)	Universal
WebP animation (320px)	80-150KB	Good	97%+
MP4 `<video>` (320px, 3s loop)	30-80KB	Excellent	Universal
WebM `<video>` (320px, 3s loop)	20-60KB	Excellent	96%+
Static WebP screenshot	30-80KB	Excellent	97%+

Recommendation: Use <video autoplay muted loop playsinline> with a WebP poster frame. This gives GIF-like behavior at 1/5th the bandwidth. The poster doubles as the static image arm. With lazy loading via IntersectionObserver, only above-fold cards load media initially, keeping initial page weight under 500KB.

Testing approach — two independent bandits:

The visual treatment question is independent of card ordering. Running them as a single combined test (5 orderings × 3 visual treatments = 15 arms) would require ~45 days to converge at 100 visitors/day. Instead, run two independent Thompson Sampling bandits:

Ordering bandit (existing): which games appear first (classics, casual, action, random, default)
Visual treatment bandit (new): how cards look (text-only, static image, autoplay video)

Each converges in ~10-15 days at 100 visitors/day. The interaction between ordering and visual treatment is unlikely to be significant at low traffic, and this approach gets answers 3x faster.

The three visual treatment arms: - text: Current cards. Title, description, controls. No image. - image: Same text content, but with a static WebP screenshot at the top of each card showing a few seconds into gameplay. - video: Same as image but the poster frame is replaced with a 3-5 second looping <video> that autoplays muted. Falls back to poster on mobile/slow connections.

5. Reward Function Design

Composite Reward

For a page where the goal is "user clicks into a game and actually plays":

reward = 0.2 * clicked
       + 0.4 * played_30s
       + 0.3 * min(play_duration / 300, 1.0)
       + 0.1 * returned_within_24h

The heavy weight on played_30s is deliberate: a click that leads to a bounce is worse than no click — it's a false positive that misleads the optimizer. The 30-second threshold filters accidental clicks and immediate bounces.

Two-Speed Feedback Loop

Fast loop (seconds): Use clicked signal only. Update bandit parameters immediately. ~100 signals/day at 100 visitors/day.
Slow loop (daily batch): Compute full composite reward by joining landing page events with recorder.js session data (same collector_id, same game, within 60s of click). Correct bandit parameters for clickbait failure mode.

Attribution is straightforward: recorder.js already assigns a persistent collector_id via localStorage that links landing page visits → card clicks → game sessions → play duration → return visits.

6. Phased Implementation Plan

Phase 0: Instrumentation (Build First, 0 Traffic Needed)

Add a lightweight event tracker to index.html: - Page load: {event: 'pageview', collector_id, viewport, referrer, timestamp, user_agent} - Card impressions via IntersectionObserver: which cards were visible - Card click: {event: 'card_click', collector_id, game, position, time_since_load} - Scroll depth: max scroll percentage reached

Route: POST /api/events/landing on the ingest hub. Store as JSONL on the SSD.

Time to build: 1-2 days. Prerequisite for everything else.

Phase 1: Analytics-Driven Static Ordering (100 visitors/day)

After 1-2 weeks of event data: 1. Python script reads JSONL log 2. Computes per-game quality = click_rate × avg(min(play_duration/300, 1)) 3. Joins with recorder.js session data via collector_id 4. Outputs card_order.json 5. 5 lines of JS in index.html fetches it and reorders .games-grid children 6. Cron runs daily

Expected lift: 10-20%. Zero ML, just analytics.

Phase 2: Thompson Sampling Bandit (100+ visitors/day)

Pure client-side JavaScript, no server changes beyond a cron job:

Define 5 card orderings as arms (by quality, by click rate, by play duration, by popularity, reverse/novelty)
Bandit state in bandit_state.json: {arms: [{alpha: 1, beta: 1}, ...], orderings: [...]}
Client JS samples from Beta(alpha_i, beta_i) for each arm, picks highest, reorders cards
Event payload includes which arm was shown
Hourly Python cron reads events, updates alpha/beta, writes JSON

Thompson Sampling for Bernoulli bandits converges to near-optimal arm selection within ~200-300 trials per arm. With 5 arms at 100 visitors/day, meaningful convergence in 10-15 days. At 1,000/day, convergence in 1-2 days.

Expected lift over Phase 1: 5-15% additional.

Phase 3: Contextual Bandit (500+ visitors/day)

Replace static JSON with a lightweight Python Flask endpoint on the Jetson: - Accepts context: {is_mobile, is_returning, hour_bucket, viewport_bucket} - Returns card ordering via LinUCB or logistic Thompson Sampling - Retrains hourly on batch of last 24h

Context-aware: mobile users might prefer simpler games (Snake, Flappy); returning visitors see games they haven't tried; evening visitors get different ordering than morning.

Expected lift over Phase 2: 5-15% additional.

Phase 4: In-Browser Behavioral Model (1,000+ visitors/day)

This is where the neural net enters: 1. Train a small MLP offline on accumulated behavioral data (mouse velocity, scroll patterns, hesitation → composite reward) 2. Export as 50KB ONNX model 3. Load in Web Worker via ONNX Runtime Web (WASM backend) 4. Capture behavioral signals passively for 3-10 seconds 5. Run inference (<1ms), get intent score 6. Adapt page: reorder cards, adjust visual emphasis, show/hide mission bar

This phase is only justified when you have enough labeled behavioral data to train a model that outperforms the bandit heuristics. At 1,000 visitors/day, you accumulate ~30K labeled sessions/month — potentially enough for a simple MLP.

Phase 5: Full RL (10,000+ visitors/day)

Formulate as MDP per Amazon's offline DQN approach: state = user context + page config, action = layout variant, reward = composite signal. Train offline on logged interactions, deploy policy as inference endpoint.

Only justified at massive scale. Marginal lift over contextual bandits is typically 5-10%.

Summary

Phase	Traffic	Complexity	Cumulative Lift	Time
0: Instrumentation	0	Low	Baseline	1-2 days
1: Analytics ordering	100/day	Low	10-20%	1 day
2: Thompson Sampling	100/day	Medium	15-30%	2-3 days
3: Contextual bandit	500/day	High	20-40%	1-2 weeks
4: Browser behavioral model	1,000/day	High	25-45%	2-4 weeks
5: Full RL	10,000/day	Very High	30-50%	Months

Recommendation

Start with Phase 0 immediately. Without landing page event tracking, nothing else is possible. The instrumentation is ~50 lines of JS and a new POST route on the ingest hub.

Phase 1 is the highest-ROI change. Reordering cards by actual engagement data will likely produce 10-20% lift with a day of work. This doesn't need ML — it needs analytics.

Phase 2 (Thompson Sampling) is the sweet spot for an HN launch. It's intellectually interesting ("our landing page uses multi-armed bandits to optimize in real time"), practically effective, and achievable in 2-3 days. It also generates a compelling narrative: "the same site that collects training data for our vision model also optimizes itself."

Skip straight to Phase 4 only if you're getting 1,000+ visitors/day consistently. Below that threshold, the bandit will outperform any neural net because it has better sample efficiency for this problem structure.

The in-browser neural net is the long-term play, but the bandit is the right tool for launch.

Open Questions

What's the current traffic level? The entire phasing depends on daily visitor count. If the HN launch brings a spike, Phase 2 could converge in hours rather than weeks.
Should the optimization be visible? An HN audience might appreciate transparency: "This page is optimizing itself — here's what arm you're seeing." This could be a feature, not a hidden trick.
Card ordering vs. card content vs. card visual treatment: The bandit now tests ordering AND visual treatment as independent dimensions. The remaining question is whether to also test different descriptions/CTAs per game — a third bandit. At low traffic, two bandits is the practical limit.
Interaction between landing page optimization and training data collection: If the optimizer concentrates traffic on 2-3 games, training data diversity drops. Should there be an exploration bonus that values training data diversity alongside click-throughs?
Reward attribution latency: The composite reward requires joining landing page events with recorder.js sessions. Is the ingest hub already storing both in a way that supports this join, or does the schema need work?