"Is sip-audio Rebuilding OpenClaw?"

2026-02-27 [sip-audioopenclawmacroclawarchitecturejetson]

Is sip-audio Rebuilding OpenClaw?

You have three systems that touch phone calling. OpenClaw orchestrates tasks and can invoke Twilio/Telnyx VoIP calls. MacroClaw places calls from your actual iPhone number via macOS Continuity. sip-audio drives a SIM7600G-H cellular modem on a Jetson Orin Nano, handling raw PCM audio byte-by-byte with VAD, barge-in, and sub-second latency tracing.

The honest question: is sip-audio redundant? You've already integrated OpenClaw into sip-audio (SMS routing through openclaw message send) and into Buddy (WebSocket gateway connection). You have a detailed OpenClaw + MacroClaw integration plan. Are you building the same thing three times?

The answer is no — but parts of sip-audio are duplicating work that OpenClaw does better, and other parts are irreplaceable. This article draws the line.

The Three Systems — What Each Actually Does

	OpenClaw	MacroClaw	sip-audio
Calling method	Twilio/Telnyx VoIP number	iPhone Continuity (your real number)	SIM7600G-H cellular modem / SIP trunk
Runs on	Mac (Node.js daemon, port 18789)	Mac (macOS-specific, BlackHole audio)	Jetson (headless ARM, serial ports)
LLM	Any (via config, Claude default)	Claude	Claude API or Ollama local
TTS	N/A (action-invocation model)	Chatterbox (voice clone)	Piper / F5-TTS (voice clone)
STT	N/A	MLX Whisper	faster-whisper
Orchestration	Full: cron, skills, memory, multi-channel, approval workflows	Minimal: goal `.md` files drive calls	CallOps API + SQLite + FastAPI dashboard
Hardware independent	No (needs Mac)	No (needs Mac + iPhone)	Yes (runs on Jetson with SIM card)
Audio handling	None (delegates to provider)	macOS audio routing	Direct PCM I/O over serial at 8kHz

The table makes the distinction clear. These are three different things solving three different problems:

OpenClaw decides what to do. It's the brain — task management, scheduling, multi-channel routing, memory, learning.
MacroClaw calls as you. Your real number, your cloned voice, caller ID preserved. For when identity matters.
sip-audio is the phone call. It owns the audio pipeline end-to-end: modem AT commands, PCM frame reading, VAD energy thresholds, barge-in detection, TTS synthesis, silence timeouts. It's a phone appliance.

Where sip-audio Overlaps with OpenClaw

An honest accounting of what's duplicated in sip-audio's ~3,300 lines of Python.

Conversation context management — ollama_engine.py maintains a conversation_history list, appending user/assistant messages and passing them with a system prompt to Ollama's /api/chat. OpenClaw's agent runtime does this with better memory: SQLite + vector embeddings for semantic search, MEMORY.md for curated long-term facts, daily logs in memory/YYYY-MM-DD.md, and auto-compaction. sip-audio's conversation memory is ephemeral — it resets every call via reset_conversation(). OpenClaw's persists and learns across sessions.

Task and call orchestration — callops_store.py implements a full task management system: tasks table (id, status, owner, priority), calls table, events timeline, turns transcript, metrics, actions, artifacts. call_agent_api.py wraps this in a FastAPI server with endpoints for creating tasks (POST /v1/tasks), starting calls (POST /v1/calls/start), sending SMS/email/fax/print, and real-time WebSocket event streaming. This is a miniature OpenClaw. OpenClaw has cron jobs, skill-based task execution, approval workflows, a Canvas dashboard, and persistent learning. The CallOps task system is simpler and phone-call-specific, but it's still a task orchestrator competing with an existing one.

Channel actions — channels.py already delegates SMS to OpenClaw! The send_sms_openclaw() function shells out to openclaw message send. It also handles email (SMTP), fax (Sinch API), printing (CUPS), and letter queuing. This is the integration seam — sip-audio recognizes that OpenClaw is the messaging layer. But the pattern isn't complete: call orchestration and task management still live locally.

Dashboard and monitoring — sip-audio serves a Vue dashboard at GET / via call_agent_api.py, with real-time WebSocket updates for call events, latency metrics, and transcripts. OpenClaw has Canvas (port 18793) for agent-generated interactive HTML. Two dashboards, partially overlapping purposes.

Where sip-audio Does NOT Overlap

The irreplaceable core — the parts no framework provides.

modem.py — Direct AT command control of the SIM7600G-H. This is 341 lines of serial port I/O that no orchestration framework will ever implement. It opens /dev/ttyUSB2 for AT commands and /dev/ttyUSB4 for PCM audio. It sends ATD+1XXXXXXXXXX; to dial, ATA to answer, AT+CPCMREG=1 to enable USB PCM streaming. It monitors for unsolicited result codes (RING, VOICE CALL: BEGIN, NO CARRIER) in a background thread. It manages TX/RX audio buffers with frame-aligned reads and writes. It sends silence frames (b'\x00' * BYTES_PER_FRAME) to keep the PCM stream alive. No framework does this. No framework will.

cellular_agent.py — Real-time PCM audio loop with VAD and barge-in. This is not action-invocation. This is a continuous audio processing loop running at 8kHz, computing energy per frame, detecting speech onset, tracking silence duration, and triggering the STT-LLM-TTS pipeline on utterance boundaries. Barge-in detection counts consecutive loud frames during TTS playback and calls modem.clear_tx() to interrupt. Post-dial delay skips carrier greeting audio. Silence timeout hangs up after prolonged quiet. LLM-driven hangup watches for [HANGUP] tags. This is the kind of code that runs in telephony engines, not in task orchestrators.

spark_relay.py — Binary TCP protocol for GPU offload. A custom binary framing protocol ([type:1][length:4][payload]) that sends PCM utterances to a DGX Spark server and receives streamed audio chunks back. Frame types: UTTERANCE (0x01), STT_TEXT (0x02), AUDIO_CHUNK (0x03), DONE (0x04), RESET (0x05). This enables per-sentence TTS streaming — the modem starts playing audio before the full response is synthesized. OpenClaw has nothing like this; it doesn't operate at the audio-frame level.

latency_trace.py — Sub-second voice quality metrics. Tracks STT duration, LLM inference time, TTS synthesis time, time-to-first-audio (TTFA), and end-to-end latency per conversation turn. Computes p50, p90, and mean across a call. Writes JSONL logs per call. These metrics matter for voice quality in ways that task-level tracking doesn't capture. A 200ms difference in TTFA is the difference between natural conversation and awkward pauses.

Runs headless on Jetson. No Mac required, no iPhone required. A Jetson Orin Nano with a SIM card is a phone appliance — always on, no laptop lid to close, no iPhone battery to die. It can answer calls 24/7 for voicemail, screening, or triage. It runs fully local with Ollama and Piper — it works without internet. This is a fundamentally different deployment model from "an agent running on your Mac."

The distinction that matters: OpenClaw invokes phone calls as actions. sip-audio IS the phone call. OpenClaw says "call this number." sip-audio handles what happens after the call connects — every PCM frame, every VAD decision, every barge-in, every silence timeout.

The "Dedicated Hardware" Vision

Why sip-audio matters even if OpenClaw handles everything else.

A Jetson with a SIM card and a cellular modem is a phone appliance. It doesn't need your Mac to be open. It doesn't need your iPhone to be nearby. It doesn't need WiFi. It has its own cellular connection, its own compute, its own voice models. It is the phone.

This creates capabilities that OpenClaw and MacroClaw can't match:

24/7 availability — Answer calls at 3am without your laptop running. Screen spam calls. Triage voicemail. Greet callers with a custom voice.
Fully local inference — Ollama (llama3.1) + Piper TTS + faster-whisper STT, all on-device. No API keys, no internet dependency, no per-minute billing. Private by architecture.
Hardware control — Signal strength monitoring, SIM management, PCM audio routing, modem power states. These are physical-world interfaces that software frameworks can't abstract away.
Execution layer — This is what OpenClaw or MacroClaw delegates to. OpenClaw decides "call the dentist at 2pm." sip-audio makes the call, handles the conversation, and returns a transcript.

The modem code, the audio loop, the latency tracing, the Spark relay — these are the moat. No one else is building this stack on a Jetson with a cellular modem. The orchestration layer on top of it is the commodity part.

The Markdown Computer Observation

Your system already operates as a markdown computer — you just haven't named it that.

CLAUDE.md / MEMORY.md / SYNC.md — Instructions, persistent memory, multi-instance coordination. Claude Code reads these on every session start.
OpenClaw uses SKILL.md files for contextual playbooks, MEMORY.md for curated long-term facts, daily logs in memory/YYYY-MM-DD.md for session history.
sip-audio generates transcript.md, latency.md, and report.md per call via callops_store.py's render_call_markdown() and render_task_markdown().
MacroClaw uses goal .md files to drive phone calls — the conversation objective is a markdown document.
The OpenClaw + MacroClaw integration plan itself lives at ~/dev/sync/research/openclaw-macroclaw-integration.md.

Markdown files are the instruction set, the memory, the coordination protocol, and the output format. This is already the markdown computer. The question is whether to formalize it into a framework or keep it organic.

Recommendation: keep it organic. The power is in the simplicity. A markdown file is readable by humans, parseable by LLMs, diffable by git, and editable by any text editor. The moment you add a schema layer or a custom DSL, you lose the universality. Every tool in the stack — Claude Code, OpenClaw, sip-audio, Buddy — already speaks markdown natively. Don't fix what isn't broken.

The Logging and Observability Gap

Both systems log extensively, but separately.

OpenClaw: Persistent memory (SQLite + vector embeddings), daily logs (memory/YYYY-MM-DD.md), session history, skill execution logs. Task-level tracking — what was decided, what actions were taken, what the outcomes were.

sip-audio: Latency JSONL per call (call-{mode}-{timestamp}.jsonl), SQLite event timeline (events table with call_id, event_type, stage, status, payload), conversation turns with timestamps and confidence scores, per-turn metrics (STT/LLM/TTS/TTFA/E2E), markdown reports and transcripts. Call-level tracking — what was said, how long each step took, what went wrong.

The gap: No unified view. If OpenClaw triggers a call via sip-audio, OpenClaw knows it asked for a call but doesn't know the latency profile, the barge-in count, or the exact transcript timing. sip-audio knows every PCM frame but doesn't know the broader task context — why the call was made, what task it serves, what the next step should be.

Recommendation: sip-audio should POST call outcomes to OpenClaw's gateway (or write to a shared location that OpenClaw reads). The natural seam is the CallOps API — after a call completes, render_call_markdown() already generates a transcript and latency report. Push those to OpenClaw's memory so it can reason about call quality, learn from outcomes, and adjust future task execution. The channels.py pattern (shelling out to openclaw message send) already exists for SMS — extend it to call reporting.

Recommended Architecture

Three layers, clearly separated.

┌─────────────────────────────────────────────────────────┐
│  LAYER 1: ORCHESTRATION (OpenClaw)                      │
│                                                         │
│  Decides WHAT to do:                                    │
│  - Which calls to make, which emails to send            │
│  - Task list, schedules, cron jobs, approval workflows  │
│  - Persistent memory — learns preferences over time     │
│  - Multi-channel: WhatsApp, iMessage, email, SMS        │
│  - Skills: markdown playbooks for repeatable tasks      │
└──────────────────────────┬──────────────────────────────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
┌──────────────────────┐  ┌──────────────────────────────┐
│  LAYER 2a: MacroClaw │  │  LAYER 2b: sip-audio         │
│                      │  │                              │
│  When caller ID      │  │  When hardware independence  │
│  matters (your real  │  │  matters (Jetson, 24/7,      │
│  iPhone number)      │  │  local-only, no Mac needed)  │
│                      │  │                              │
│  Returns: transcript │  │  Returns: transcript         │
│  + outcome to L1     │  │  + outcome + latency to L1   │
└──────────────────────┘  └──────────────┬───────────────┘
                                         │
                          ┌──────────────┴───────────────┐
                          │  LAYER 3: HARDWARE            │
                          │  (sip-audio only)             │
                          │                              │
                          │  SIM7600G-H modem (AT cmds)  │
                          │  PCM audio I/O (8kHz serial) │
                          │  Piper TTS / faster-whisper  │
                          │  Spark GPU offload (TCP)     │
                          │  Latency tracing (JSONL)     │
                          └──────────────────────────────┘

Concrete Next Steps

1. Wire sip-audio as an OpenClaw skill. Create skills/cellular-call/SKILL.md that POSTs to the CallOps API (POST /v1/calls/start). OpenClaw decides when to call; sip-audio handles the call. The skill reads back GET /v1/calls/{id}/transcript and GET /v1/calls/{id}/metrics to report the outcome. This is a clean integration — sip-audio's API already exists, and OpenClaw's skill system is purpose-built for this.

2. Keep building sip-audio's voice layer. The modem driver, the audio loop, the VAD, the barge-in logic, the Spark relay, the latency tracing — this is the moat. Invest here. Improve voice quality, reduce TTFA, add wake-word detection, build better echo cancellation. This code has no equivalent anywhere.

3. Stop building sip-audio's orchestration layer. The CallOps task management system (callops_store.py tasks table, call_agent_api.py task endpoints) duplicates OpenClaw's stronger version. Keep the call and metrics tracking — that's call-level instrumentation that OpenClaw doesn't do. But stop building task-level orchestration features (task creation, status management, report generation) and let OpenClaw own that.

4. Unify logging. After a call completes, sip-audio should push the transcript and latency summary to OpenClaw's memory. Simplest path: the CallOps API already renders markdown reports — have the OpenClaw skill read those and store them in OpenClaw's memory. sip-audio call outcomes flow back to OpenClaw so it can learn and adjust.

Answering the Concerns

"Am I reinventing the wheel?" — Yes, at the orchestration layer. The CallOps task management is a less capable version of OpenClaw's task loop. No, at the voice/hardware layer. Nobody has a SIM7600G-H modem driver with PCM audio streaming and real-time VAD in a framework you can install.

"Flexibility and control?" — sip-audio as an OpenClaw skill preserves full control. You own the implementation end-to-end. OpenClaw just triggers it via HTTP — the same way channels.py already delegates SMS to openclaw message send. The integration is a function call, not a framework lock-in.

"Do I understand the stack?" — You understand the voice stack deeply. modem.py is 341 lines of AT command handling you wrote. cellular_agent.py is 482 lines of real-time audio processing. spark_relay.py is a custom binary protocol. You understand every byte. OpenClaw's orchestration is well-documented, you're already using it for SMS, and Buddy already has a WebSocket integration with its gateway.

"Am I overbuilding?" — The CallOps API task management endpoints are the redundant part. The modem/audio/latency code is the essential part. The dashboard is a nice-to-have that could eventually be replaced by OpenClaw's Canvas or coexist as the call-specific monitoring view.

"What about logging?" — Better together. OpenClaw for task-level tracking (what was the goal, was it achieved, what's next). sip-audio for call-level metrics (TTFA, barge-in count, silence timeouts, per-turn latency). Push call summaries up to OpenClaw so it has the full picture.

"What about the hardware?" — No framework supports the SIM7600 modem. This is your moat. The Jetson + SIM card + modem + local models stack is a unique thing you've built. The orchestration on top is commodity. Invest in the moat.