Research

Digital Surface Labs

Dgx Orin Cloudflare Audit 2026 02 25

DGX + Orin Cloudflare/Deployment Audit

Date: 2026-02-25 Scope: DGX Spark (spark), Jetson Orin Nanos (prometheus, atlas, epimetheus), top active repos, deployment standardization to git push origin main -> host auto-pull.

Executive Summary

  • Cloudflare ingress is concentrated on Prometheus (buddy-tunnel) and Atlas (atlas-tunnel), with Prometheus acting as the edge router for many services, including DGX Spark API exposure.
  • DGX Spark GPU services are not healthy right now: spark-llm, spark-image-gen, and spark-video-gen are crash-looping; spark-api is stopped.
  • Repo hygiene issues are concentrated on deploy hosts (Prometheus/Atlas) where runtime data is co-located in git worktrees; the largest drift is in openarcade, modern-news, task-processor, and screen-self-driving.
  • Deployment approach is mixed: some repos use robust 60s systemd timers, while others still rely on cron scripts or manual rsync workflows.
  • Epimetheus is provisioned (Docker, /ssd structure) but not yet configured with cloudflared/nginx service hosting.

Visual Topology

flowchart LR
  Internet((Internet)) --> CF[Cloudflare Edge]

  subgraph P[Prometheus 192.168.0.18]
    BT[buddy-tunnel]
    NGINX[nginx :80]
    Apps[apps :8001/:8082/:8091-8105/:8200/:3000]
  end

  subgraph S[DGX Spark 192.168.0.234]
    SAP[spark-api :8090]
    SLLM[spark-llm :8081]
    SIMG[spark-image-gen :8082]
    SVID[spark-video-gen :8083]
    AUDIO[voice-server :9050]
  end

  subgraph A[Atlas 192.168.0.101]
    AT[atlas-tunnel]
    AApps[buddy :8001, health :8104]
  end

  subgraph E[Epimetheus 192.168.0.202]
    ED[Docker only]
  end

  CF --> BT
  CF --> AT
  BT --> SAP
  BT --> Apps
  AT --> AApps
flowchart TD
  push[git push origin main] --> fetch[auto-pull timer or cron fetch]
  fetch --> cmp{new commit?}
  cmp -- no --> exit[exit]
  cmp -- yes --> pull[git pull --rebase origin main]
  pull --> deps[install deps if lock/requirements changed]
  deps --> restart[systemctl restart service]
  restart --> health[local health check + public check]

Cloudflare Routing Audit

Prometheus buddy-tunnel

  • Routes Spark/DGX: spark.digitalsurfacelabs.com, audio.digitalsurfacelabs.com -> 192.168.0.234:8090.
  • Routes major local services: forge, tasks, arcade, global, ramp, directory, alice, ssd, buddy, memex, news, prospector, storybook, health, pirate, soothsayer, moria.
  • Catch-all sends unknown hostnames to http://localhost:8100.

Atlas atlas-tunnel

  • Routes buddy.digitalsurfacelabs.com -> localhost:8001 and health.digitalsurfacelabs.com -> localhost:8104.
  • Catch-all is http_status:404.

Epimetheus

  • No /etc/cloudflared/config.yml present.
  • No externally exposed application ports detected.

Runtime Health Snapshot (critical)

Spark

  • spark-llm.service: activating (auto-restart) with exit status 1.
  • spark-image-gen.service: activating (auto-restart) with exit status 2.
  • spark-video-gen.service: activating (auto-restart) with exit status 2.
  • spark-api.service: inactive (dead) since 2026-02-21.

Prometheus and Atlas

  • Prometheus: cloudflared/nginx and core app ports are live; mixed timer+cron orchestration is active.
  • Atlas: cloudflared/nginx/buddy/health active and externally routed.

Repo Hygiene Audit

Primary dataset: data/2026-02-25/repo_hygiene.csv.

High-risk drift clusters: - Prometheus /ssd/openarcade: large untracked generated game directories + analytics db state. - Prometheus /ssd/modern-news: subscriber runtime state + analytics in repo tree. - Prometheus /ssd/task-processor: branch drift (meme-track-development) with large untracked set. - Spark /opt/training/screen-self-driving: source modifications mixed with many training artifacts.

Patterns detected: - Runtime data and generated artifacts are frequently inside deploy git worktrees. - Several repos are clean locally but dirty on host clones (host-local state leakage). - Some host repos are behind origin (e.g., /ssd/moria behind by 7 commits).

Last-Month Activity (interpreting "Code Two" as active pushed repos)

Window: 2026-01-25 to 2026-02-25

Top activity counts: - sync 167 - openarcade 95 - Screen-Self-Driving 88 - modern-news 52 - memex 49 - buddy 43

Source: data/2026-02-25/local/commits-since-2026-01-25.txt.

Deployment Standardization Audit

Target standard: git push origin main -> systemd timer every 60s -> conditional pull -> conditional restart.

Current state by pattern: - Already aligned (good): spark-api, forge, ramp, screen-pirate, soothsayer. - Partially aligned (needs cleanup): moria, prospector (cron deploy-check.sh pattern exists but inconsistent with timer/service template). - Not aligned/legacy/manual: modern-news, task-processor, some rsync-based deploy paths.

Recommended canonical deploy package for each service repo: - deploy/auto-pull.sh (change-detect + pull + restart + log) - deploy/<service>-autodeploy.service (oneshot) - deploy/<service>-autodeploy.timer (OnBootSec=30s, OnUnitActiveSec=60s) - .gitignore runtime policy (logs/, *.db-wal, *.db-shm, generated dirs, subscriber/runtime state)

Device-by-Device Improvement Plan

DGX Spark

  1. Restore Spark service health before adding new endpoints.
  2. Split training artifact output from source checkout (/opt/training/.../results out of git tree or ignored).
  3. Keep Spark API exposure through Prometheus tunnel only if intentional; otherwise move to dedicated Spark tunnel for reduced blast radius.

Prometheus

  1. Standardize all hosted repos to one timer template; retire ad hoc cron deploy scripts.
  2. Move runtime data out of repo roots (subscriber data, analytics db, generated assets).
  3. Add nightly repo hygiene job producing a local report: behind/ahead + dirty counts.

Atlas

  1. Convert config drift (settings.yaml, schedules.yaml, onboarding docs) into explicit policy:
  2. either commit as canonical defaults,
  3. or externalize to host-only config path.
  4. Mirror Prometheus deploy template for consistency.

Epimetheus

  1. Decide role: standby or active host.
  2. If active: install nginx + cloudflared + baseline deploy template.
  3. If standby: keep minimal and document failover bootstrap script.

Prioritized Actions

P0 (today)

  1. Repair DGX Spark app stack (spark-llm, spark-image-gen, spark-video-gen, spark-api) and verify end-to-end via spark.digitalsurfacelabs.com.
  2. Clean host repo drift for externally served services (openarcade, modern-news, task-processor, screen-self-driving).
  3. Fast-forward /ssd/moria to origin/main and remove host-only untracked deploy files from repo tree.

P1 (this week)

  1. Roll out canonical auto-pull timer template to all active services.
  2. Normalize .gitignore policies for runtime artifacts and generated data across active repos.
  3. Add per-service healthcheck.sh invoked after restart.

P2 (next week)

  1. Promote Epimetheus into active service host or formal standby runbook.
  2. Add centralized inventory generation (topology.json + hygiene CSV) on a scheduled cadence.
  3. Add CI guardrails to block accidental commits of runtime/generated paths.

Deliverables Created

  • research/scripts/collect_host_audit.sh
  • research/data/2026-02-25/topology.json
  • research/data/2026-02-25/repo_hygiene.csv
  • research/data/2026-02-25/{spark,prometheus,atlas,epimetheus}/... raw snapshots
  • research/data/2026-02-25/local/commits-since-2026-01-25.txt

Assumptions

  • "Code Two" was interpreted as your active/pushed repo work over the last month (2026-01-25..2026-02-25) based on local git and sync logs.
  • Walmart Alaska MacBook was intentionally excluded in this run.