Unifying MacroHard and Screen-Self-Driving Training Data

Can desktop recording sessions and arcade gameplay feed the same model?

2026-02-17 macrohardscreen-self-drivingopenarcadedata-pipeline

The Question

MacroHard records desktop screen + keystrokes. Screen-Self-Driving (SSD) collects arcade gameplay via OpenArcade in-browser and native macOS capture. Both produce screen frames paired with input events. Can we unify these into a single training pipeline?

Current State: Two Separate Pipelines

MacroHard (Desktop Recording)

Property	Value
Location	`~/dev/macro-hard/`, data at `~/.macrohard/sessions/`
Capture	ffmpeg h264_videotoolbox, 1 FPS, 1512x982
Video codec	H.264 (.mp4)
Input format	JSONL (`keys.jsonl`) with fields: `ts` (ms epoch), `type`, `keycode`, `flags`, `key_name`
Processed frames	float16, shape `(n, 3, 224, 224)`
Processed labels	int32, shape `(n, 3)` -- up to 3 simultaneous key class indices
Model	ScreenKeyNet (ViT + 3 softmax heads), predicts 3 keys per frame
Sessions	0 recorded so far (system ready, never started)

Screen-Self-Driving (Native + Browser)

Property	Value
Location	`~/dev/Screen-Self-Driving/`, Jetson at `/ssd/screen-self-driving/`
Native capture	CGEventTap + H.265, configurable FPS
Browser capture	OpenArcade `recorder.js`, 2 FPS JPEG + keyboard events
Video codec	H.265 (.mp4 or raw .h265)
Input format	Binary, 27 bytes/event: `<QBHIffhh>` (ts_ns, type, keycode, flags, mouse_xy, scroll)
Processed frames	float16, shape `(n_windows, context_frames, 3, H, W)` where H,W = 128 or 384
Processed actions	int32, shape `(n_windows, max_actions)` -- token sequences from 4,327-token vocab
Model	Vision-Action Transformer (VAT), predicts action token sequences
Sessions	148 collected (5.6 GB), mostly Tetris
Hub ingest	FastAPI on Jetson port 8090, processes both native and browser uploads

OpenArcade (Browser Feeder into SSD)

Property	Value
Location	`~/dev/openarcade/`
Capture	Canvas frames at 2 FPS (JPEG) + keyboard events with macOS keycodes
Upload	POST to `/api/ingest/browser` on Jetson hub (port 8090)
Processing	Hub converts JPEG stream to H.265 MP4, writes binary events, feeds into SSD pipeline
Games	Tetris, Snake, Pong, Breakout, Flappy Bird, Asteroids, Space Invaders

Key Differences

Dimension	MacroHard	Screen-Self-Driving
Video codec	H.264	H.265
Input logging	JSONL (text)	Binary 27-byte structs
Keycodes	macOS virtual keycodes	macOS virtual keycodes (same!)
Mouse tracking	No	Yes (normalized x,y + scroll)
Temporal model	Single frame -> 3 keys	Sliding window (4-8 frames) -> token sequence
Action space	~128 key classes (multi-label)	4,327 discrete tokens (keys + mouse + position + waits)
Frame size	224x224	128x128 (arcade) or 384x384 (desktop)
FPS	1	2 (configurable)
Target domain	General desktop	Arcade games (expanding to desktop)

Compatibility Analysis

What Already Aligns

Keycodes are identical. Both use macOS virtual keycodes (e.g., Space=49, ArrowLeft=123). OpenArcade's recorder.js explicitly maps browser key names to macOS keycodes. No translation needed.
Frame processing is similar. Both decode video, resize, normalize to float16 (C, H, W) arrays. The resize target differs but that's a config parameter.
Both produce memory-mapped output. frames.mmap + labels.mmap/actions.mmap with metadata JSON. PyTorch Datasets load from mmap.
Timestamp precision is sufficient. MacroHard uses millisecond epoch; SSD uses nanosecond. Both can be aligned to frame boundaries at 1-2 FPS granularity.

What Needs Bridging

1. Input format conversion (easy). MacroHard's JSONL needs conversion to SSD's binary format, or both need to feed into a common tokenizer. The simplest path: write a macrohard_to_ssd() adapter that reads keys.jsonl and outputs SSD-compatible binary events. The fields map directly:

JSONL ts (ms) * 1_000_000  →  binary timestamp_ns
JSONL keycode               →  binary keycode (identical)
JSONL type "key_down"       →  binary event_type 7
JSONL type "key_up"         →  binary event_type 8
JSONL flags                 →  binary flags (same macOS modifier bitmask)
mouse_x, mouse_y            →  0.0, 0.0 (MacroHard doesn't track mouse)
scroll_dx, scroll_dy        →  0, 0

2. Video codec (trivial). Both are standard video containers. PyAV decodes H.264 and H.265 identically. The SSD decode_video_frames() function works on any codec ffmpeg supports. No change needed.

3. Action representation (the real question). MacroHard's ScreenKeyNet predicts 3 simultaneous keys as a multi-label classification. SSD's VAT predicts a variable-length sequence of action tokens. These are fundamentally different output heads. Options:

Option A: Retokenize MacroHard data through SSD's tokenizer. Convert JSONL events to binary, run through processing/tokenizer.py. MacroHard sessions become SSD training data natively. Desktop frames + key-only tokens train the same VAT model. Mouse/position tokens would just be absent (NOOP-padded).
Option B: Keep separate models, shared backbone. Use a common ViT encoder pretrained on both datasets, with task-specific heads. More complex, less immediate value.
Option C: Reduce SSD data to MacroHard's format. Strip mouse/position tokens, keep only key tokens, train ScreenKeyNet on combined data. Loses SSD's richer action space.

4. Frame rate and resolution (configurable). MacroHard records at 1 FPS, SSD at 2 FPS. Both can be resampled at processing time. Resolution is just a resize parameter. Standardizing on 2 FPS / 224x224 for all data is straightforward.

Recommendation: Option A -- Feed MacroHard Into the SSD Pipeline

The SSD pipeline is strictly more capable. Its tokenizer handles everything MacroHard captures (keystrokes) plus more (mouse, scroll, position, waits). The processing infrastructure is already deployed on the Jetson with automatic ingestion. The path of least resistance:

Implementation Plan

Step 1: MacroHard JSONL-to-Binary Adapter (~50 lines of Python) Write macrohard_adapter.py in the SSD hub that converts a MacroHard session directory into SSD raw format: - Read keys.jsonl, convert each event to 27-byte binary struct - Symlink or transcode the H.264 .mp4 (PyAV handles it as-is, so symlink works) - Generate session.json metadata with tag: "desktop" and source: "macrohard" - Write to /ssd/ssd-data/raw/macrohard-mac/{session_id}/

Step 2: Add /ingest/macrohard Endpoint to Hub (~30 lines) Accept a MacroHard session upload (zip or multipart with .mp4 + keys.jsonl), run the adapter, trigger processing. Or simply run the adapter locally and use the existing /ingest endpoint.

Step 3: Tag-Based Dataset Filtering The SSD dataset.py already reads from /ssd/ssd-data/ready/. Add a tag filter so training can select arcade, desktop, or all. The metadata.json in each processed session carries the tag from session.json.

Step 4: Adjust MacroHard Config - Change FPS from 1 to 2 (match SSD standard) - This is a one-line config change: python3 -m macrohard.cli config --set fps 2

What This Gets You

Unified training data across desktop productivity and arcade gameplay
One model architecture (VAT) that learns from both domains
Leverage existing infra -- Jetson hub processes everything, dashboard shows all sessions
Domain diversity improves generalization -- desktop sessions have complex UIs, arcade sessions have clear action-reward signals
MacroHard becomes a collector feeding the SSD pipeline, same as OpenArcade

Data Flow After Unification

MacroHard (Mac)                OpenArcade (Browser)         Native Collector (Mac)
  H.264 + JSONL                 JPEG + JS events             H.265 + binary events
       |                              |                              |
       v                              v                              v
  adapter.py              POST /ingest/browser              POST /ingest
  (JSONL→binary)          (JPEG→H.265, JS→binary)          (already native)
       |                              |                              |
       +----------→  /ssd/ssd-data/raw/  ←-----------------------+
                              |
                     processor.py (Jetson)
                     decode → tokenize → align → mmap
                              |
                     /ssd/ssd-data/ready/
                     frames.mmap + actions.mmap
                              |
                     VAT training (any GPU)

Effort Estimate

Step	Lines of Code	Complexity
JSONL-to-binary adapter	~50	Low -- direct field mapping
Hub endpoint (optional)	~30	Low -- wraps adapter
Tag-based filtering	~10	Trivial -- add filter param to dataset
MacroHard FPS config	1	Trivial
Total	~90	Afternoon of work

Open Questions

Should MacroHard upload to the Jetson hub automatically? Could add a cron job or post-session hook that rsyncs new sessions to the Jetson, similar to the memex sync. Or add a direct HTTP upload mode.
Desktop frame complexity vs arcade simplicity. Desktop screens have far more visual information than a Tetris board. The 128x128 arcade resolution may lose too much desktop detail. May need 224x224 or 384x384 for desktop sessions, with the dataset handling mixed resolutions (resize at load time).
Mouse data gap. MacroHard doesn't capture mouse events. Desktop tasks heavily involve mouse interaction. Consider adding mouse capture to MacroHard (CGEventTap already captures mouse in screen-pirate and SSD's native recorder -- the code exists). Without mouse data, desktop sessions can only train keyboard prediction.
Should ScreenKeyNet be retired? If MacroHard data flows through SSD's VAT pipeline, the separate ScreenKeyNet model in MacroHard becomes redundant. The VAT architecture subsumes it. Could keep ScreenKeyNet as a lightweight baseline for comparison.
Privacy tagging. Desktop recordings may contain sensitive content (passwords, emails, financial data). Need a privacy: sensitive tag in session metadata so these sessions can be excluded from any shared training or kept local-only.