Research

Digital Surface Labs

"Screen-Self-Driving: Training Analysis and Architecture Roadmap"

"Deep analysis of current model performance, failure modes, and proposed improvements for visual imitation learning on arcade games"

Screen-Self-Driving: Training Analysis and Architecture Roadmap

1. Project Overview

Screen-Self-Driving (SSD) trains neural networks to play arcade games by learning from screen recordings paired with keyboard inputs. The system captures screenshots at 2 FPS, constructs 4-frame context windows at 84x84 pixels, and predicts which of 5 keys (UP, DOWN, LEFT, RIGHT, SPACE) should be pressed at each timestep. This is fundamentally a multi-label binary classification problem over visual sequences -- or, in the softmax formulation, a 6-class single-label problem (NOOP + 5 keys).

The pipeline spans data generation (browser-based game simulators running on GCP VMs), data processing (frame extraction, key-state alignment, memory-mapped dataset creation), training (7 model architectures, from 25K to 10M parameters), and planned live inference (screen capture, model prediction, and optional CGEvent injection on macOS).

This report analyzes the only completed training run at scale (the "l4-large" run on Feb 15, 2026), identifies systematic failure modes, and proposes concrete architectural improvements.


2. Training Results: The Complete Picture

2.1 Dataset Summary

Training data was generated using browser-based arcade game simulators with automated play scripts. Sessions are 3 minutes long at 10 FPS capture, then downsampled to 2 FPS with 4-frame context windows, yielding approximately 357 training windows per session.

Game Sessions Train/Eval Split Approx. Training Windows
Tetris 82 66 / 16 ~23,500
Tetris-Sim 60 48 / 12 ~17,100
Tetris-Sim-Test 1 1 / 0 (overfit) ~357

The pipeline has since scaled to ~6,400 processed sessions across 7 games (tetris, snake, space-invaders, flappy, breakout, pong, asteroids), with 4 additional data-generation VMs (gen7-10) producing ~200 more hours. A "fulldata0" training run has been launched but results are pending.

2.2 Model Performance Comparison

Tetris (82 sessions, multi-label sigmoid models)

Model Params Hamming Exact Match F1-UP F1-DOWN F1-LEFT F1-RIGHT F1-SPACE Epochs Time
cnn_lstm 9.8M 0.651 0.162 0.353 0.0 0.349 0.512 0.475 31 5.1min
small_vit 9.8M 0.624 0.134 0.348 0.0 0.367 0.515 0.484 24 28.6min

Tetris-Sim (60 sessions, multi-label sigmoid models)

Model Params Hamming Exact Match F1-UP F1-DOWN F1-LEFT F1-RIGHT F1-SPACE Epochs Time
cnn_lstm 9.8M 0.665 0.177 0.382 0.0 0.276 0.612 0.762 44 4.8min
small_vit 9.8M 0.642 0.047 0.560 0.0 0.154 0.625 0.762 44 34.5min

Tetris-Sim-Test (1 session, overfitting diagnostic)

Model Params Hamming Exact Match F1-UP F1-DOWN F1-LEFT F1-RIGHT F1-SPACE
cnn_lstm 9.8M 0.689 0.0 -- -- -- -- 1.0
small_vit 9.8M 0.756 0.222 0.941 -- 0.800 -- 1.0

Historical Tetris Results (earlier runs, smaller models, local training)

From the full experiment CSV, the scaling trajectory on tetris alone:

Sessions Model Hamming Exact Match F1-SPACE F1-LEFT F1-RIGHT
1 linear 0.774 0.057 0.0 0.0 0.107
25 medium_cnn 0.897 0.580 0.0 0.065 0.133
65 linear 0.635 0.141 0.451 0.367 0.499
107 medium_cnn 0.642 0.118 0.561 0.449 0.671
139 linear 0.629 0.098 0.673 0.431 0.523

2.3 Key Scaling Observations

  1. Hamming score degrades as data increases for simple models. The 25-session medium_cnn hit 0.897 hamming -- the highest across all runs -- but this was due to predicting "all zeros" (no keys pressed) and being correct most of the time. The model learned the prior distribution, not the actual mapping. As more varied data was added, this shortcut stopped working and hamming dropped.

  2. F1 scores improve with data, especially for high-frequency keys. F1-SPACE went from 0.0 at 1 session to 0.673 at 139 sessions. F1-LEFT and F1-RIGHT showed similar trajectories. This confirms that the models are learning meaningful features, not just the class prior.

  3. Training is extremely fast. Even the largest model (small_vit, 10M params) trains in under 35 minutes on the L4 GPU for 44 epochs. The bottleneck is data volume, not compute.


3. Failure Mode Analysis

3.1 The F1-DOWN = 0 Problem

The most striking pattern across every single training run is that F1-DOWN is exactly 0.0 -- no model, at any scale, has ever correctly predicted a DOWN key press.

Root cause: extreme class imbalance combined with contextual ambiguity.

In Tetris, DOWN is the "soft drop" key -- it accelerates the piece downward. Players use it sparingly compared to LEFT, RIGHT (piece positioning), SPACE (hard drop), and UP (rotation). Analysis of the training data shows DOWN is the rarest key press, occurring in roughly 2-5% of frames where any key is active. The model learns that predicting DOWN is almost never worth the loss penalty because:

  • The pos_weight compensation in BCEWithLogitsLoss does upweight DOWN positives, but the visual signal for "should soft drop now" is subtle -- it depends on the gap geometry below the current piece, which is hard to distinguish from the visual signal for "should hard drop" (SPACE).
  • In the softmax formulation (TinySoftmax, GameNet), DOWN competes directly with NOOP and SPACE. Since SPACE (hard drop) is almost always the better move when moving downward, the model converges to never predicting DOWN.
  • Even the 1-session overfitting test (small_vit, 0.756 hamming) did not learn DOWN -- suggesting the visual signal for "soft drop vs. hard drop vs. do nothing" is genuinely ambiguous at 84x84 resolution with 2 FPS capture.

Implications: To learn DOWN, the model likely needs either (a) much higher resolution to distinguish gap geometry, (b) frame differencing to detect piece velocity, or (c) explicit reward shaping that separates soft-drop scenarios from hard-drop scenarios in the training data. Alternatively, removing DOWN from the action space (since hard drop is almost always superior) may be the pragmatic choice for Tetris.

3.2 Overfitting Capacity vs. Generalization Gap

The 1-session overfitting test is diagnostic: small_vit reaches 0.756 hamming and 0.222 exact match with near-perfect F1 on UP (0.941), LEFT (0.800), and SPACE (1.0). This proves the architecture has sufficient capacity to memorize the input-output mapping for a single game session.

However, scaling to 82 sessions drops hamming to 0.624 and exact match to 0.134. The generalization gap is 0.132 hamming points. This gap tells us:

  • The model can represent the mapping (capacity is not the bottleneck).
  • The model cannot generalize across sessions with current features (the visual context at 84x84 is insufficient or the model needs better inductive biases for generalization).
  • cnn_lstm generalizes slightly better (0.651 vs. 0.624 hamming on 82 sessions), suggesting that explicit temporal modeling via LSTM provides a useful inductive bias even though both models have ~10M parameters.

3.3 small_vit vs. cnn_lstm: Different Failure Modes

Despite having the same parameter count (~10M), these models fail differently:

  • cnn_lstm achieves higher exact match (0.162 vs. 0.134) -- it is better at predicting the correct combination of keys for a given frame.
  • small_vit achieves higher per-key F1 on some keys (UP: 0.348 vs. 0.353 is close; LEFT: 0.367 vs. 0.349) -- it is slightly better at detecting individual key presses but worse at combining them correctly.
  • cnn_lstm trains 5.6x faster (5.1min vs. 28.6min) because the LSTM over 4 frame features is computationally cheaper than self-attention over 576 patch tokens.

The architectural implication: the CNN-LSTM's bottleneck through a fixed-size LSTM hidden state forces it to compress temporal information in a way that helps multi-label coherence. The ViT's full attention over all patches lacks this compression, leading to better individual key detection but worse combinatorial accuracy.

3.4 The Tetris-Sim Advantage

Tetris-Sim (programmatic simulator data) shows higher F1-SPACE (0.762 for both models vs. 0.475/0.484 on real Tetris) and higher F1-RIGHT (0.612/0.625 vs. 0.512/0.515). This is likely because:

  • Simulator games have more consistent visual appearance (no browser chrome variation, consistent colors).
  • Automated play scripts produce more deterministic action patterns, making the mapping easier to learn.
  • The simulator may generate more situations where SPACE is the clearly correct action.

This suggests that visual consistency in training data matters significantly -- preprocessing to remove non-game visual elements (browser chrome, OS UI) could improve real-game performance.


4. Architecture Analysis: Current Models

4.1 Architecture Taxonomy

The 7 current architectures can be grouped by their approach to the two core challenges: spatial feature extraction and temporal reasoning.

No temporal modeling: - linear (15K params): Adaptive pool + single FC layer. Baseline. No capacity for learning spatial features or temporal dynamics. - small_cnn (358K params): DQN-style 3-layer CNN. Learns spatial features but stacks all 4 frames as 12 input channels, treating time as just more channels. - medium_cnn (2.4M params): Deeper 4-block CNN with BatchNorm and GAP. Same temporal treatment as small_cnn but with more capacity and regularization. - tiny_softmax (25K params): Two FC layers. Minimal capacity. 6-class softmax output.

Explicit temporal modeling: - cnn_lstm (4.8M params): Per-frame CNN encoder with shared weights, 2-layer LSTM over frame features. The LSTM's sequential processing provides explicit temporal ordering. - small_vit (10M params): Patches from all 4 frames fed into a single transformer with positional embeddings. Attention can model arbitrary spatial-temporal relationships but has no explicit temporal bias. - gamenet (1.2M params): Per-frame ResNet encoder with shared weights, multi-head self-attention over frame features, mean pooling. 6-class softmax output. The lightest temporal model.

4.2 Architectural Gaps

Reviewing the current model set against the state of the art in video understanding and game-playing agents, several gaps are apparent:

  1. No frame differencing or motion features. All models operate on raw pixel frames. Computing optical flow or simple frame differences would provide explicit motion signals that are critical for predicting actions in fast-paced games. The model currently has to learn to compute motion differences internally.

  2. No multi-scale spatial features. All models process frames at a single resolution. Games like Tetris require both fine-grained features (individual block gaps, piece shape) and coarse features (overall board state, height profile).

  3. No action history conditioning. The models predict actions from visual input only. In practice, the correct action often depends on what the player has been doing -- if you just moved left, you are more likely to continue left or press SPACE. Conditioning on previous predicted actions would reduce ambiguity.

  4. Coarse temporal resolution. At 2 FPS with 4-frame context, the model sees 2 seconds of gameplay. For fast games (Flappy Bird, Space Invaders), this may be insufficient. For Tetris, it covers one or two piece placements, which is reasonable.

  5. No game-conditioned prediction. When training across multiple games, the model has no explicit signal for which game is being played. The model must infer game identity from visual features, wasting capacity. A simple game-embedding or game-conditioned head would help.


5. Proposed Architecture Improvements

5.1 MotionNet: Frame-Differencing CNN-LSTM

Motivation: The biggest information gap in current models is the lack of explicit motion features. A piece falling in Tetris, a bullet moving in Space Invaders, and the bird's velocity in Flappy Bird are all encoded in frame-to-frame differences, which current architectures must learn implicitly.

Architecture:

Input: 4 frames x 3 x 84 x 84

Frame Differencing Branch:
  Compute 3 difference frames: diff[i] = frame[i+1] - frame[i]  (3 x 3 x 84 x 84)
  Compute abs + signed channels: (3 x 6 x 84 x 84)
  CNN encoder (shared weights): Conv(6,32,5,2) -> Conv(32,64,3,2) -> Conv(64,128,3,1) -> GAP
  Output: 3 x 128-dim motion features

Spatial Branch:
  Per-frame CNN encoder (shared weights): Conv(3,32,5,2) -> Conv(32,64,3,2) -> Conv(64,128,3,1) -> GAP
  Output: 4 x 128-dim appearance features

Fusion:
  Concatenate: [appearance_1..4, motion_1..3] -> 7 x 128-dim
  2-layer LSTM(128, 256) over the 7-step sequence
  Last hidden state -> FC(256, 128) -> FC(128, 5)

Total params: ~2.5M

Key design choices: - Both signed and absolute differences capture direction and magnitude of motion. - Shared CNN weights across frames and across motion/appearance branches reduce parameters while maintaining expressiveness. - Interleaving appearance and motion features in the temporal sequence lets the LSTM learn which motion corresponds to which visual state.

Expected improvement: F1 on all keys should improve because motion features directly encode piece velocity (DOWN prediction), lateral movement (LEFT/RIGHT), and rotation (UP). The motion branch should be especially helpful for F1-DOWN, which requires detecting whether a piece is falling slowly (soft drop active) vs. stationary.

5.2 EfficientGameNet: Multi-Scale Feature Extraction with Temporal Attention

Motivation: The current gamenet uses a ResNet-style encoder that processes frames at a single scale. Game screenshots contain information at multiple scales: individual pixels matter for detecting piece boundaries, but the overall board shape and fill level are global features. EfficientNet-style compound scaling provides a principled way to extract multi-scale features efficiently.

Architecture:

Input: 4 frames x 3 x 84 x 84

Per-Frame Multi-Scale Encoder (shared weights):
  Stage 1: MBConv(3, 24, expand=1, stride=2)  [42x42]  -> features_s1
  Stage 2: MBConv(24, 48, expand=6, stride=2) [21x21]  -> features_s2
  Stage 3: MBConv(48, 96, expand=6, stride=2) [11x11]  -> features_s3
  Stage 4: MBConv(96, 192, expand=6, stride=2) [6x6]   -> features_s4

  FPN-style fusion:
    f4 = GAP(features_s4)                      (192,)
    f3 = GAP(features_s3) + Proj(Upsample(f4)) (96,)
    f2 = GAP(features_s2) + Proj(Upsample(f3)) (48,)
    frame_feature = concat([f2, f3, f4])        (336,)

Per-frame features: 4 x 336-dim

Temporal Transformer:
  4 learned temporal position embeddings
  2-layer TransformerEncoder(d=336, heads=4, ff=672)
  Mean pool over time -> 336-dim

Head:
  FC(336, 128) -> GELU -> Dropout(0.2) -> FC(128, 6)  [softmax]

Total params: ~1.8M

Key design choices: - MBConv (Mobile Inverted Bottleneck) blocks are more parameter-efficient than standard convolutions, achieving better accuracy per FLOP. - FPN-style multi-scale fusion brings fine-grained (block-level) and coarse (board-level) features to the same representation. - Temporal transformer with only 4 tokens (one per frame) is efficient and lets the model learn which frame contains the most decision-relevant information.

Expected improvement: Multi-scale features should help with actions that depend on fine spatial structure (soft drop decisions, piece rotation alignment). The FPN fusion is particularly relevant for Tetris, where both the immediate piece position and the overall board topology determine the correct action.

5.3 ActionConditionedViT: Vision Transformer with Action History

Motivation: Current models treat each frame window independently, with no memory of past predictions. In practice, gameplay actions are highly autocorrelated -- a player moving left will often continue pressing left for several frames. Conditioning on recent action history reduces ambiguity and should improve exact match accuracy.

Architecture:

Input:
  frames: 4 x 3 x 84 x 84
  action_history: 4 x 5 (binary key states from previous 4 predictions)

Visual Encoding:
  Extract 12x12 patches from each frame: 4 frames x 49 patches = 196 tokens
  Patch embedding: Linear(3*7*7, 256) per patch
  Add spatial position embeddings (49 learned positions, shared across frames)
  Add temporal embeddings (4 learned, added per-frame)

Action Encoding:
  action_history -> Linear(5, 256) per timestep -> 4 action tokens
  Add temporal position embeddings (same as visual)

Token Assembly:
  [CLS] + 196 visual tokens + 4 action tokens = 201 tokens

Transformer:
  6-layer TransformerEncoder(d=256, heads=4, ff=512, GELU, pre-norm)
  LayerNorm on CLS output -> 256-dim

Head:
  FC(256, 5) [sigmoid, multi-label]

Total params: ~5M

Key design choices: - Action tokens attend to visual tokens and vice versa via self-attention, letting the model learn correlations like "if I pressed LEFT last frame and the piece hasn't reached the wall yet, press LEFT again." - Smaller patch size (7x7) and embedding dim (256) compared to current small_vit (384 dim, 7x7 patches) to accommodate the additional action tokens without increasing compute. - During inference, predicted actions are fed back as the action history for the next prediction, creating an autoregressive loop.

Expected improvement: Exact match should improve significantly because the action history resolves multi-modal ambiguity (when the visual context alone is compatible with multiple actions, the recent action context disambiguates). This is especially important for sustained key presses (holding LEFT/RIGHT during piece positioning).


6. What Would Improve Performance: Ranked by Expected Impact

6.1 More Data (HIGH IMPACT)

The scaling curve from 1 to 139 sessions shows monotonic improvement in per-key F1 scores. The project is scaling from 82 to 6,400+ sessions (78x increase). Based on log-linear scaling trends in the existing data:

  • F1-SPACE: Currently 0.475 at 82 sessions. Projecting the log-linear trend, 6,400 sessions should yield F1-SPACE > 0.85.
  • F1-LEFT/RIGHT: Currently 0.35-0.51. Should reach 0.7-0.8 with 6,400 sessions.
  • F1-UP: Currently 0.35. Should reach 0.6-0.7.
  • F1-DOWN: Unlikely to improve from data alone -- the issue is signal ambiguity, not sample count.
  • Exact match: Currently 0.16. May reach 0.3-0.4 with more data, but the ceiling is limited by multi-label independence assumptions.

Cost: Already underway. The 4 gen VMs plus fulldata0 training run represent ~$25 in cloud compute.

6.2 Multi-Game Training (HIGH IMPACT)

Training on all 7 games simultaneously (tetris, snake, space-invaders, flappy, breakout, pong, asteroids) provides two benefits:

  1. Cross-game transfer: All games use the same 5-key control scheme. A model that learns "object approaching from left means press RIGHT to dodge" from Space Invaders can transfer that to Pong and Breakout.
  2. Regularization through diversity: Multi-game data acts as a natural regularizer, preventing overfitting to game-specific visual patterns.

The run_cross_game() function in the experiment runner already supports this. The fulldata0 run trains per-game, but a cross-game run should be queued next.

6.3 Frame Differencing (MEDIUM-HIGH IMPACT)

As discussed in the MotionNet proposal, adding explicit frame differences as input features would directly address the motion-blindness that causes F1-DOWN = 0. This is a preprocessing change that benefits all architectures:

# In dataset __getitem__:
diffs = frames[1:] - frames[:-1]  # (3, 3, 84, 84)
abs_diffs = np.abs(diffs)
motion_input = np.concatenate([diffs, abs_diffs], axis=1)  # (3, 6, 84, 84)

This doubles the input channel count for difference frames but provides the model with explicit velocity and acceleration signals.

Cost: One-time code change in ArcadeDataset.__getitem__() and model input channel adjustments. No additional data needed.

6.4 Game-Conditioned Heads (MEDIUM IMPACT)

When training across 7 games, a single output head must predict actions for games with fundamentally different dynamics. A game-conditioned architecture adds a game embedding:

game_embed = self.game_embedding(game_id)  # (batch, 64)
features = torch.cat([visual_features, game_embed], dim=-1)
output = self.head(features)

This tells the model "this is Tetris, so UP means rotate" vs. "this is Snake, so UP means move up." The game ID can be stored in the dataset metadata (the tag field already exists).

Cost: Minor architectural change. Requires adding game ID to dataset output.

6.5 Better Augmentation (MEDIUM IMPACT)

Current augmentation is limited to brightness jitter, horizontal flip with LEFT/RIGHT swap, and random crop. Additional augmentations that would help:

  • Color jitter (hue, saturation, contrast): Makes models robust to game theme variations.
  • Cutout/random erasing: Forces the model to use multiple spatial cues rather than relying on a single pixel region.
  • Temporal jitter: Randomly dropping or duplicating frames in the context window to simulate variable FPS.
  • Mixup/CutMix: Blend two training examples to smooth the decision boundary.

The horizontal flip augmentation is well-designed (swapping LEFT/RIGHT labels), but it only applies to laterally symmetric games. For games like Snake (where flipping is valid in all directions), a vertical flip with UP/DOWN swap should also be added.

6.6 Curriculum Learning (LOW-MEDIUM IMPACT)

Start training on simple games (Snake, Pong -- only 2-4 keys, simple dynamics) before introducing complex games (Tetris, Asteroids -- 4-5 keys, complex dynamics). This provides:

  1. A pre-trained encoder that understands basic visual-motor relationships.
  2. Gradual increase in action space complexity.
  3. Better gradient signal early in training (simple games have clearer visual-action mappings).

Implementation: Sort training data by estimated difficulty (Snake < Pong < Breakout < Flappy < Tetris < Space Invaders < Asteroids) and use a curriculum that introduces games progressively.

6.7 Higher Resolution Input (LOW-MEDIUM IMPACT)

The current 84x84 input resolution is inherited from the classic DQN paper. For Tetris specifically, the 10x20 game board at 84x84 resolution means each Tetris cell is approximately 4x4 pixels -- enough to detect piece presence but possibly insufficient to distinguish piece shapes or precise gap geometry needed for soft-drop decisions.

Increasing to 128x128 (already the frame_size used in processing, as shown by the training log metadata: "frame_size": 128) or 160x160 would improve spatial precision at the cost of increased compute. The existing ViT and CNN architectures handle variable resolution through adaptive pooling and interpolation.


7. Evaluation Methodology Critique

7.1 Metric Limitations

Hamming score (per-bit accuracy across the 5-key vector) is inflated by true negatives. When no key is pressed (the most common case), predicting all zeros gets ~80% hamming accuracy. This is why the 25-session medium_cnn hit 0.897 hamming -- it learned the trivial solution.

Exact match (all 5 keys correct simultaneously) is too strict. A prediction of [UP, LEFT] when the ground truth is [UP, LEFT, SPACE] gets 0 exact match despite being 80% correct. This makes exact match noisy for measuring progress.

Per-key F1 is the most informative metric. It rewards true positive predictions and penalizes both false positives and false negatives, making it robust to class imbalance. F1-weighted (macro average across the 5 keys) would be a better single summary metric than either hamming or exact match.

Recommendation: Adopt macro-averaged F1 across the 5 keys as the primary metric, with per-key F1 breakdown for diagnostics. Report exact match and hamming only as secondary context.

7.2 Missing Evaluation Dimensions

The current evaluation does not measure:

  1. Temporal coherence: Are predictions stable across frames, or does the model oscillate between actions? A model that predicts LEFT, RIGHT, LEFT, RIGHT across consecutive frames is unusable even if per-frame accuracy is reasonable.
  2. Game score: The ultimate metric is "can the model actually play the game?" None of the training runs have evaluated model predictions in a live game loop. The arcade_inference.py module exists but has not been integrated into the evaluation pipeline.
  3. Reaction time: For time-critical games (Flappy Bird, Space Invaders), the correct action changes rapidly. Evaluation should measure whether the model predicts the correct action within an acceptable temporal window, not just at the exact frame.

8. Concrete Next Steps

8.1 Immediate (This Week)

  1. Wait for fulldata0 results. The L4 training run on 6,400+ sessions across 7 games and 7 models will provide the first real multi-game, at-scale results. Expect results within 24-48 hours unless the spot instance is preempted.

  2. Implement frame differencing. Add frame-difference channels to the ArcadeDataset as a configurable option. Test with cnn_lstm on tetris to measure F1-DOWN impact.

  3. Add macro-F1 as primary metric. Update compute_metrics() and compute_softmax_metrics() to report macro-averaged F1 across keys. Update the CSV output to include this metric.

8.2 Short-Term (Next 2 Weeks)

  1. Implement MotionNet. The frame-differencing CNN-LSTM described in Section 5.1 is the lowest-risk architectural improvement. It can reuse the existing training pipeline and dataset format with minimal changes.

  2. Run cross-game training. Use run_cross_game() with the full 6,400+ session dataset. Compare per-game performance of cross-game models vs. per-game specialists.

  3. Add game-conditioned output heads. Modify the cross-game training to use game embeddings. Compare against unconditioned cross-game models.

  4. Integrate live game evaluation. Modify arcade_inference.py to run predictions through the browser simulator and record game scores. This is the only way to measure whether improved metrics translate to improved gameplay.

8.3 Medium-Term (Next Month)

  1. Implement EfficientGameNet. The multi-scale architecture described in Section 5.2 requires more engineering but should provide the best accuracy-per-parameter ratio.

  2. Implement ActionConditionedViT. The action-history conditioning described in Section 5.3 is the most ambitious architectural change and should be attempted after the simpler improvements are validated.

  3. Scale data to 20,000+ sessions. Based on the scaling trends, achieving 90%+ exact match on easy games and 70%+ on hard games requires roughly 3,000-5,000 sessions per game, or ~25,000 total sessions across 7 games.

  4. Explore self-play. Once a model can play at a basic level, use its gameplay to generate additional training data, creating a self-improvement loop. This is particularly relevant for games where automated play scripts are suboptimal (they play randomly, not intelligently).


9. Summary

The Screen-Self-Driving project has demonstrated that visual imitation learning for arcade games is feasible with current neural network architectures. The key findings from the Feb 15 training run are:

  1. Models can learn meaningful spatial features -- F1 scores for LEFT, RIGHT, SPACE, and UP are well above random, proving the models extract game-relevant visual information.
  2. Temporal models outperform static models -- cnn_lstm and small_vit (both with explicit temporal processing) outperform the stacked-frame approaches on per-key F1.
  3. Scale matters -- every increase in training data has improved per-key F1 scores log-linearly.
  4. F1-DOWN = 0 is a systematic failure requiring architectural intervention (frame differencing) rather than just more data.
  5. The overfitting test proves capacity is sufficient -- the gap is in generalization, which data volume, better augmentation, and inductive biases (motion features, action history) should address.

The most impactful next steps are: (1) analyzing the fulldata0 results when available, (2) implementing frame differencing as a preprocessing step, and (3) building the MotionNet architecture that makes motion a first-class input feature. The project is well-positioned to achieve competitive gameplay on easy/medium games within the next training cycle and on hard games within the next month.