Dgx Orin Cloudflare Audit 2026 02 25
DGX + Orin Cloudflare/Deployment Audit
Date: 2026-02-25
Scope: DGX Spark (spark), Jetson Orin Nanos (prometheus, atlas, epimetheus), top active repos, deployment standardization to git push origin main -> host auto-pull.
Executive Summary
- Cloudflare ingress is concentrated on Prometheus (
buddy-tunnel) and Atlas (atlas-tunnel), with Prometheus acting as the edge router for many services, including DGX Spark API exposure. - DGX Spark GPU services are not healthy right now:
spark-llm,spark-image-gen, andspark-video-genare crash-looping;spark-apiis stopped. - Repo hygiene issues are concentrated on deploy hosts (Prometheus/Atlas) where runtime data is co-located in git worktrees; the largest drift is in
openarcade,modern-news,task-processor, andscreen-self-driving. - Deployment approach is mixed: some repos use robust 60s systemd timers, while others still rely on cron scripts or manual rsync workflows.
- Epimetheus is provisioned (Docker,
/ssdstructure) but not yet configured with cloudflared/nginx service hosting.
Visual Topology
flowchart LR
Internet((Internet)) --> CF[Cloudflare Edge]
subgraph P[Prometheus 192.168.0.18]
BT[buddy-tunnel]
NGINX[nginx :80]
Apps[apps :8001/:8082/:8091-8105/:8200/:3000]
end
subgraph S[DGX Spark 192.168.0.234]
SAP[spark-api :8090]
SLLM[spark-llm :8081]
SIMG[spark-image-gen :8082]
SVID[spark-video-gen :8083]
AUDIO[voice-server :9050]
end
subgraph A[Atlas 192.168.0.101]
AT[atlas-tunnel]
AApps[buddy :8001, health :8104]
end
subgraph E[Epimetheus 192.168.0.202]
ED[Docker only]
end
CF --> BT
CF --> AT
BT --> SAP
BT --> Apps
AT --> AApps
flowchart TD
push[git push origin main] --> fetch[auto-pull timer or cron fetch]
fetch --> cmp{new commit?}
cmp -- no --> exit[exit]
cmp -- yes --> pull[git pull --rebase origin main]
pull --> deps[install deps if lock/requirements changed]
deps --> restart[systemctl restart service]
restart --> health[local health check + public check]
Cloudflare Routing Audit
Prometheus buddy-tunnel
- Routes Spark/DGX:
spark.digitalsurfacelabs.com,audio.digitalsurfacelabs.com->192.168.0.234:8090. - Routes major local services:
forge,tasks,arcade,global,ramp,directory,alice,ssd,buddy,memex,news,prospector,storybook,health,pirate,soothsayer,moria. - Catch-all sends unknown hostnames to
http://localhost:8100.
Atlas atlas-tunnel
- Routes
buddy.digitalsurfacelabs.com->localhost:8001andhealth.digitalsurfacelabs.com->localhost:8104. - Catch-all is
http_status:404.
Epimetheus
- No
/etc/cloudflared/config.ymlpresent. - No externally exposed application ports detected.
Runtime Health Snapshot (critical)
Spark
spark-llm.service:activating (auto-restart)with exit status1.spark-image-gen.service:activating (auto-restart)with exit status2.spark-video-gen.service:activating (auto-restart)with exit status2.spark-api.service:inactive (dead)since 2026-02-21.
Prometheus and Atlas
- Prometheus: cloudflared/nginx and core app ports are live; mixed timer+cron orchestration is active.
- Atlas: cloudflared/nginx/buddy/health active and externally routed.
Repo Hygiene Audit
Primary dataset: data/2026-02-25/repo_hygiene.csv.
High-risk drift clusters:
- Prometheus /ssd/openarcade: large untracked generated game directories + analytics db state.
- Prometheus /ssd/modern-news: subscriber runtime state + analytics in repo tree.
- Prometheus /ssd/task-processor: branch drift (meme-track-development) with large untracked set.
- Spark /opt/training/screen-self-driving: source modifications mixed with many training artifacts.
Patterns detected:
- Runtime data and generated artifacts are frequently inside deploy git worktrees.
- Several repos are clean locally but dirty on host clones (host-local state leakage).
- Some host repos are behind origin (e.g., /ssd/moria behind by 7 commits).
Last-Month Activity (interpreting "Code Two" as active pushed repos)
Window: 2026-01-25 to 2026-02-25
Top activity counts:
- sync 167
- openarcade 95
- Screen-Self-Driving 88
- modern-news 52
- memex 49
- buddy 43
Source: data/2026-02-25/local/commits-since-2026-01-25.txt.
Deployment Standardization Audit
Target standard: git push origin main -> systemd timer every 60s -> conditional pull -> conditional restart.
Current state by pattern:
- Already aligned (good): spark-api, forge, ramp, screen-pirate, soothsayer.
- Partially aligned (needs cleanup): moria, prospector (cron deploy-check.sh pattern exists but inconsistent with timer/service template).
- Not aligned/legacy/manual: modern-news, task-processor, some rsync-based deploy paths.
Recommended canonical deploy package for each service repo:
- deploy/auto-pull.sh (change-detect + pull + restart + log)
- deploy/<service>-autodeploy.service (oneshot)
- deploy/<service>-autodeploy.timer (OnBootSec=30s, OnUnitActiveSec=60s)
- .gitignore runtime policy (logs/, *.db-wal, *.db-shm, generated dirs, subscriber/runtime state)
Device-by-Device Improvement Plan
DGX Spark
- Restore Spark service health before adding new endpoints.
- Split training artifact output from source checkout (
/opt/training/.../resultsout of git tree or ignored). - Keep Spark API exposure through Prometheus tunnel only if intentional; otherwise move to dedicated Spark tunnel for reduced blast radius.
Prometheus
- Standardize all hosted repos to one timer template; retire ad hoc cron deploy scripts.
- Move runtime data out of repo roots (subscriber data, analytics db, generated assets).
- Add nightly repo hygiene job producing a local report: behind/ahead + dirty counts.
Atlas
- Convert config drift (
settings.yaml,schedules.yaml, onboarding docs) into explicit policy: - either commit as canonical defaults,
- or externalize to host-only config path.
- Mirror Prometheus deploy template for consistency.
Epimetheus
- Decide role: standby or active host.
- If active: install nginx + cloudflared + baseline deploy template.
- If standby: keep minimal and document failover bootstrap script.
Prioritized Actions
P0 (today)
- Repair DGX Spark app stack (
spark-llm,spark-image-gen,spark-video-gen,spark-api) and verify end-to-end viaspark.digitalsurfacelabs.com. - Clean host repo drift for externally served services (
openarcade,modern-news,task-processor,screen-self-driving). - Fast-forward
/ssd/moriato origin/main and remove host-only untracked deploy files from repo tree.
P1 (this week)
- Roll out canonical auto-pull timer template to all active services.
- Normalize
.gitignorepolicies for runtime artifacts and generated data across active repos. - Add per-service
healthcheck.shinvoked after restart.
P2 (next week)
- Promote Epimetheus into active service host or formal standby runbook.
- Add centralized inventory generation (
topology.json+ hygiene CSV) on a scheduled cadence. - Add CI guardrails to block accidental commits of runtime/generated paths.
Deliverables Created
research/scripts/collect_host_audit.shresearch/data/2026-02-25/topology.jsonresearch/data/2026-02-25/repo_hygiene.csvresearch/data/2026-02-25/{spark,prometheus,atlas,epimetheus}/...raw snapshotsresearch/data/2026-02-25/local/commits-since-2026-01-25.txt
Assumptions
- "Code Two" was interpreted as your active/pushed repo work over the last month (2026-01-25..2026-02-25) based on local git and sync logs.
- Walmart Alaska MacBook was intentionally excluded in this run.