Voice Pipeline
The voice pipeline turns “Hey GLaDOS” into a response played through the living room speakers — entirely on local hardware, with no cloud services in the loop. Three distinct pipelines handle different invocation contexts: the hardware satellite, the Mac terminal, and the web dashboard.
Architecture Overview
Section titled “Architecture Overview”All voice processing runs on two machines: the Pi 5 (Caroline) hosts the Wyoming STT/TTS/wake-word containers and orchestrates the HA Assist pipeline; Atlas (the M4 Pro) runs Ollama for LLM inference. nightwatch (the AMD GPU machine) provides specialty TTS backends on demand, woken via Wake-on-LAN when needed.
[Satellite hardware] |[openWakeWord: "Hey GLaDOS"] |[Whisper: speech to text] |[n8n webhook → Ollama on Atlas] |[Piper TTS: text to speech] |[Sonos speakers: audio output]The Three Pipelines
Section titled “The Three Pipelines”Pipeline 1: The Satellite (Primary)
Section titled “Pipeline 1: The Satellite (Primary)”The Satellite1 is a custom ESPHome device with a microphone array that listens continuously for wake words. It runs server-side wake word detection, meaning the raw audio stream is forwarded to the Pi’s openWakeWord container rather than running detection on-device. This keeps the hardware simple and the models upgradeable.
Once the wake word fires:
- The audio stream goes to Whisper (faster-whisper
small-int8, English) for transcription - The transcript reaches Home Assistant’s Assist pipeline, which routes it through the
m_agentcustom component to n8n via a local webhook - n8n calls Ollama on Atlas for response generation
- The response goes to Piper TTS for synthesis
- Audio is returned to the Satellite, which has no built-in speaker — playback routes to the nearest Sonos
The satellite is on the IoT network VLAN, isolated from the main LAN. The wake word detection, STT, and TTS containers listen only on loopback ports; HA reaches them via localhost because it runs in host network mode.
Pipeline 2: The Terminal (Mac)
Section titled “Pipeline 2: The Terminal (Mac)”scripts/glados-say.sh is a command-line script that sends text to any of the voice backends on nightwatch and plays the audio locally. It selects the backend by name and logs the interaction to the dashboard API.
| Backend | Technology | Approximate Latency |
|---|---|---|
glados | Forward Tacotron + HiFiGAN (Wyoming) | ~1s |
kokoro | Kokoro-82M (OpenAI-compat HTTP) | ~0.2s |
xtts | XTTS v2, GLaDOS fine-tune (Wyoming) | 2-5s |
m | Chatterbox Turbo (Judi Dench voice, Wyoming) | varies |
peter | Peter Griffin RVC v2 (HTTP) | varies |
peter2 | Peter Griffin GPT-SoVITS (HTTP) | varies |
Pipeline 3: The Dashboard (Web)
Section titled “Pipeline 3: The Dashboard (Web)”The dashboard’s /chat page uses the Web Speech API for voice input and useTTS for synthesis output. Transcribed speech goes to n8n, which routes to Ollama or Claude depending on the request, and the response plays back in-browser via Web Audio.
Voice Components on the Pi
Section titled “Voice Components on the Pi”Five containers make up the on-Pi voice stack, co-deployed with Home Assistant.
| Container | Image | Role |
|---|---|---|
wyoming-whisper | rhasspy/wyoming-whisper | STT — faster-whisper small-int8 (loopback only) |
wyoming-piper | rhasspy/wyoming-piper | TTS — Piper en_US-lessac-medium (loopback only) |
wyoming-openwakeword | rhasspy/wyoming-openwakeword | Wake word detection, TFLite (loopback only) |
homeassistant | ghcr.io/home-assistant/home-assistant | Pipeline orchestrator (port 8123) |
esphome | ghcr.io/esphome/esphome | Satellite firmware management |
Wake Word Models
Section titled “Wake Word Models”Custom wake word models are TFLite format, trained on nightwatch’s AMD GPU using tools/wake-words/train_all.sh. They live in ha-data/openwakeword-custom/.
| Model | Type |
|---|---|
hey_glados | Custom (primary active wake word) |
glados | Custom |
claude | Custom |
hudson | Custom |
maude / hey_maude | Custom |
jarvis | Community |
computer | Community |
ok_computer | Community |
okay_nabu, hey_jarvis, hey_mycroft, alexa, hey_rhasspy | Built-in (always available) |
The Satellite Hardware
Section titled “The Satellite Hardware”The Satellite1 is a FutureProofHomes ESPHome device with a microphone array. It connects to the IoT VLAN and communicates with the Pi over the Wyoming protocol.
Key properties:
- Wake word processing: server-side (audio streamed to Pi; no on-device inference)
- Speaker: none built in — all TTS audio routes to Sonos
- Active wake word:
hey_glados - Firmware config:
ha-config/esphome/satellite1-voice-patch.yamland device config - OTA flashing: via ESPHome dashboard (on-demand only, not always running)
The ESPHome repository also manages three BLE proxy devices (bathroom and kitchen) and three MTR1 presence and temperature sensors (bedroom, garage, living room).
nightwatch: On-Demand TTS
Section titled “nightwatch: On-Demand TTS”nightwatch (the AMD Radeon 7900 XTX machine) hosts all the specialty TTS backends. It is not running continuously — a Wake-on-LAN automation pre-wakes it when the satellite detects a wake word, with a 12-second budget from suspend to service-ready.
Idle management:
- An activity monitor script stamps
input_datetime.nightwatch_last_activeevery 60 seconds via a systemd timer - If idle for 5 minutes with no session active, HA fires the idle shutdown automation
- A
nightwatch_keep_aliveboolean overrides the idle timeout when needed
Voice Interaction Archival
Section titled “Voice Interaction Archival”Every satellite voice interaction is automatically archived. The push_voice_interaction HA automation sends the dialogue to an n8n webhook, which writes it to two PostgreSQL tables: voice_interactions (session metadata) and dialogue (the full transcript). This creates a permanent record of all voice interactions for analysis and memory retrieval.
The M Voice Project
Section titled “The M Voice Project”The M voice is a custom voice clone built with Chatterbox Turbo, targeting a Judi Dench-inspired voice for a personalized assistant experience. A dataset of 1,137 audio clips is prepared and ready. The M voice backend is already deployed on Nightwatch and accessible from glados-say.sh. Full integration into the satellite pipeline is the next milestone.
Key Configuration Files
Section titled “Key Configuration Files”| File | Purpose |
|---|---|
pi/docker-compose.ha.yml | Pi HA stack: HA + all Wyoming containers |
docker-compose.voice.yml | Mac dev voice stack (Wyoming only) |
scripts/glados-say.sh | Terminal TTS script, all backends |
ha-config/custom_components/m_agent/ | Custom n8n-routing conversation agent |
ha-config/esphome/satellite1-voice-patch.yaml | Server-side wake word patch for Satellite1 |
nightwatch/scripts/activity-monitor.sh | Idle detection for nightwatch power management |
tools/wake-words/train_all.sh | Wake word model training runner |