Unifying MacroHard and Screen-Self-Driving Training Data
The Question
MacroHard records desktop screen + keystrokes. Screen-Self-Driving (SSD) collects arcade gameplay via OpenArcade in-browser and native macOS capture. Both produce screen frames paired with input events. Can we unify these into a single training pipeline?
Current State: Two Separate Pipelines
MacroHard (Desktop Recording)
| Property | Value |
|---|---|
| Location | ~/dev/macro-hard/, data at ~/.macrohard/sessions/ |
| Capture | ffmpeg h264_videotoolbox, 1 FPS, 1512x982 |
| Video codec | H.264 (.mp4) |
| Input format | JSONL (keys.jsonl) with fields: ts (ms epoch), type, keycode, flags, key_name |
| Processed frames | float16, shape (n, 3, 224, 224) |
| Processed labels | int32, shape (n, 3) -- up to 3 simultaneous key class indices |
| Model | ScreenKeyNet (ViT + 3 softmax heads), predicts 3 keys per frame |
| Sessions | 0 recorded so far (system ready, never started) |
Screen-Self-Driving (Native + Browser)
| Property | Value |
|---|---|
| Location | ~/dev/Screen-Self-Driving/, Jetson at /ssd/screen-self-driving/ |
| Native capture | CGEventTap + H.265, configurable FPS |
| Browser capture | OpenArcade recorder.js, 2 FPS JPEG + keyboard events |
| Video codec | H.265 (.mp4 or raw .h265) |
| Input format | Binary, 27 bytes/event: <QBHIffhh> (ts_ns, type, keycode, flags, mouse_xy, scroll) |
| Processed frames | float16, shape (n_windows, context_frames, 3, H, W) where H,W = 128 or 384 |
| Processed actions | int32, shape (n_windows, max_actions) -- token sequences from 4,327-token vocab |
| Model | Vision-Action Transformer (VAT), predicts action token sequences |
| Sessions | 148 collected (5.6 GB), mostly Tetris |
| Hub ingest | FastAPI on Jetson port 8090, processes both native and browser uploads |
OpenArcade (Browser Feeder into SSD)
| Property | Value |
|---|---|
| Location | ~/dev/openarcade/ |
| Capture | Canvas frames at 2 FPS (JPEG) + keyboard events with macOS keycodes |
| Upload | POST to /api/ingest/browser on Jetson hub (port 8090) |
| Processing | Hub converts JPEG stream to H.265 MP4, writes binary events, feeds into SSD pipeline |
| Games | Tetris, Snake, Pong, Breakout, Flappy Bird, Asteroids, Space Invaders |
Key Differences
| Dimension | MacroHard | Screen-Self-Driving |
|---|---|---|
| Video codec | H.264 | H.265 |
| Input logging | JSONL (text) | Binary 27-byte structs |
| Keycodes | macOS virtual keycodes | macOS virtual keycodes (same!) |
| Mouse tracking | No | Yes (normalized x,y + scroll) |
| Temporal model | Single frame -> 3 keys | Sliding window (4-8 frames) -> token sequence |
| Action space | ~128 key classes (multi-label) | 4,327 discrete tokens (keys + mouse + position + waits) |
| Frame size | 224x224 | 128x128 (arcade) or 384x384 (desktop) |
| FPS | 1 | 2 (configurable) |
| Target domain | General desktop | Arcade games (expanding to desktop) |
Compatibility Analysis
What Already Aligns
-
Keycodes are identical. Both use macOS virtual keycodes (e.g., Space=49, ArrowLeft=123). OpenArcade's
recorder.jsexplicitly maps browser key names to macOS keycodes. No translation needed. -
Frame processing is similar. Both decode video, resize, normalize to float16
(C, H, W)arrays. The resize target differs but that's a config parameter. -
Both produce memory-mapped output.
frames.mmap+labels.mmap/actions.mmapwith metadata JSON. PyTorch Datasets load from mmap. -
Timestamp precision is sufficient. MacroHard uses millisecond epoch; SSD uses nanosecond. Both can be aligned to frame boundaries at 1-2 FPS granularity.
What Needs Bridging
1. Input format conversion (easy). MacroHard's JSONL needs conversion to SSD's binary format, or both need to feed into a common tokenizer. The simplest path: write a macrohard_to_ssd() adapter that reads keys.jsonl and outputs SSD-compatible binary events. The fields map directly:
JSONL ts (ms) * 1_000_000 → binary timestamp_ns
JSONL keycode → binary keycode (identical)
JSONL type "key_down" → binary event_type 7
JSONL type "key_up" → binary event_type 8
JSONL flags → binary flags (same macOS modifier bitmask)
mouse_x, mouse_y → 0.0, 0.0 (MacroHard doesn't track mouse)
scroll_dx, scroll_dy → 0, 0
2. Video codec (trivial). Both are standard video containers. PyAV decodes H.264 and H.265 identically. The SSD decode_video_frames() function works on any codec ffmpeg supports. No change needed.
3. Action representation (the real question). MacroHard's ScreenKeyNet predicts 3 simultaneous keys as a multi-label classification. SSD's VAT predicts a variable-length sequence of action tokens. These are fundamentally different output heads. Options:
-
Option A: Retokenize MacroHard data through SSD's tokenizer. Convert JSONL events to binary, run through
processing/tokenizer.py. MacroHard sessions become SSD training data natively. Desktop frames + key-only tokens train the same VAT model. Mouse/position tokens would just be absent (NOOP-padded). -
Option B: Keep separate models, shared backbone. Use a common ViT encoder pretrained on both datasets, with task-specific heads. More complex, less immediate value.
-
Option C: Reduce SSD data to MacroHard's format. Strip mouse/position tokens, keep only key tokens, train ScreenKeyNet on combined data. Loses SSD's richer action space.
4. Frame rate and resolution (configurable). MacroHard records at 1 FPS, SSD at 2 FPS. Both can be resampled at processing time. Resolution is just a resize parameter. Standardizing on 2 FPS / 224x224 for all data is straightforward.
Recommendation: Option A -- Feed MacroHard Into the SSD Pipeline
The SSD pipeline is strictly more capable. Its tokenizer handles everything MacroHard captures (keystrokes) plus more (mouse, scroll, position, waits). The processing infrastructure is already deployed on the Jetson with automatic ingestion. The path of least resistance:
Implementation Plan
Step 1: MacroHard JSONL-to-Binary Adapter (~50 lines of Python)
Write macrohard_adapter.py in the SSD hub that converts a MacroHard session directory into SSD raw format:
- Read keys.jsonl, convert each event to 27-byte binary struct
- Symlink or transcode the H.264 .mp4 (PyAV handles it as-is, so symlink works)
- Generate session.json metadata with tag: "desktop" and source: "macrohard"
- Write to /ssd/ssd-data/raw/macrohard-mac/{session_id}/
Step 2: Add /ingest/macrohard Endpoint to Hub (~30 lines)
Accept a MacroHard session upload (zip or multipart with .mp4 + keys.jsonl), run the adapter, trigger processing. Or simply run the adapter locally and use the existing /ingest endpoint.
Step 3: Tag-Based Dataset Filtering
The SSD dataset.py already reads from /ssd/ssd-data/ready/. Add a tag filter so training can select arcade, desktop, or all. The metadata.json in each processed session carries the tag from session.json.
Step 4: Adjust MacroHard Config
- Change FPS from 1 to 2 (match SSD standard)
- This is a one-line config change: python3 -m macrohard.cli config --set fps 2
What This Gets You
- Unified training data across desktop productivity and arcade gameplay
- One model architecture (VAT) that learns from both domains
- Leverage existing infra -- Jetson hub processes everything, dashboard shows all sessions
- Domain diversity improves generalization -- desktop sessions have complex UIs, arcade sessions have clear action-reward signals
- MacroHard becomes a collector feeding the SSD pipeline, same as OpenArcade
Data Flow After Unification
MacroHard (Mac) OpenArcade (Browser) Native Collector (Mac)
H.264 + JSONL JPEG + JS events H.265 + binary events
| | |
v v v
adapter.py POST /ingest/browser POST /ingest
(JSONL→binary) (JPEG→H.265, JS→binary) (already native)
| | |
+----------→ /ssd/ssd-data/raw/ ←-----------------------+
|
processor.py (Jetson)
decode → tokenize → align → mmap
|
/ssd/ssd-data/ready/
frames.mmap + actions.mmap
|
VAT training (any GPU)
Effort Estimate
| Step | Lines of Code | Complexity |
|---|---|---|
| JSONL-to-binary adapter | ~50 | Low -- direct field mapping |
| Hub endpoint (optional) | ~30 | Low -- wraps adapter |
| Tag-based filtering | ~10 | Trivial -- add filter param to dataset |
| MacroHard FPS config | 1 | Trivial |
| Total | ~90 | Afternoon of work |
Open Questions
-
Should MacroHard upload to the Jetson hub automatically? Could add a cron job or post-session hook that rsyncs new sessions to the Jetson, similar to the memex sync. Or add a direct HTTP upload mode.
-
Desktop frame complexity vs arcade simplicity. Desktop screens have far more visual information than a Tetris board. The 128x128 arcade resolution may lose too much desktop detail. May need 224x224 or 384x384 for desktop sessions, with the dataset handling mixed resolutions (resize at load time).
-
Mouse data gap. MacroHard doesn't capture mouse events. Desktop tasks heavily involve mouse interaction. Consider adding mouse capture to MacroHard (CGEventTap already captures mouse in screen-pirate and SSD's native recorder -- the code exists). Without mouse data, desktop sessions can only train keyboard prediction.
-
Should ScreenKeyNet be retired? If MacroHard data flows through SSD's VAT pipeline, the separate ScreenKeyNet model in MacroHard becomes redundant. The VAT architecture subsumes it. Could keep ScreenKeyNet as a lightweight baseline for comparison.
-
Privacy tagging. Desktop recordings may contain sensitive content (passwords, emails, financial data). Need a
privacy: sensitivetag in session metadata so these sessions can be excluded from any shared training or kept local-only.