Monitoring & Observability
The observability stack is the full LGTM pattern: Loki (logs), Grafana (visualization and alerting), Tempo (distributed traces), and Prometheus (metrics). Everything runs as Docker containers on the production host alongside the services it monitors. Configuration is fully provisioned from the repository — dashboards, datasources, alert rules, and contact points are all version-controlled.
Container Inventory (12 containers)
Section titled “Container Inventory (12 containers)”| Container | Image | Role |
|---|---|---|
| grafana | grafana/grafana | Visualization, dashboards, alerting UI |
| loki | grafana/loki | Log aggregation and querying |
| prometheus | prom/prometheus | Metrics TSDB and scrape engine |
| tempo | grafana/tempo | Distributed tracing (OTLP receiver) |
| alloy | grafana/alloy | Log and syslog collector (replaced Promtail) |
| cadvisor | gcr.io/cadvisor/cadvisor | Per-container CPU, memory, network, OOM metrics |
| node-exporter | prom/node-exporter | Production host system metrics |
| postgres-exporter | prometheuscommunity/postgres-exporter | PostgreSQL metrics |
| blackbox-exporter | prom/blackbox-exporter | HTTP/HTTPS/TCP endpoint probing |
| otel-collector | otel/opentelemetry-collector-contrib | OTLP fan-out (Claude Code telemetry) |
| unpoller | ghcr.io/unpoller/unpoller | UniFi network metrics |
| docker-socket-proxy-grafana | tecnativa/docker-socket-proxy | Read-only Docker API proxy for Alloy/cAdvisor |
All 12 containers run in net-monitoring, which is isolated from the application and data networks. Grafana and Prometheus are also on net-frontend so Caddy can proxy their web UIs.
Data Flow
Section titled “Data Flow”Containers (stdout/stderr) └─> Alloy (Docker socket discovery via proxy, log tailing) └─> Loki → Grafana
Containers (cgroup stats via /sys) └─> cAdvisor └─> Prometheus (scrapes every 30s)
Production host (proc/sys mounts) └─> node-exporter └─> Prometheus (scrapes every 15s)
PostgreSQL └─> postgres-exporter (read-only role) └─> Prometheus (scrapes every 15s)
UniFi Controller API └─> unpoller (polls every 30s) └─> Prometheus
Remote hosts (nightwatch GPU node, inference host) └─> node-exporter, AMD GPU exporter └─> Prometheus (scrapes every 30s)
Claude Code (OTLP over gRPC or HTTP) └─> otel-collector ├─> Prometheus (metrics) ├─> Tempo (traces) └─> Loki (logs)
UniFi syslog └─> Alloy (UDP 514) └─> LokiPrometheus storage: local TSDB, 2-year (730-day) / 50GB max retention.
Prometheus Scrape Jobs (13)
Section titled “Prometheus Scrape Jobs (13)”| Job | Target | Interval |
|---|---|---|
| prometheus | Prometheus self | 15s |
| node-exporter-pi | Production host metrics | 15s |
| node-exporter-nightwatch | GPU node host metrics | 30s |
| node-exporter-atlas | Inference host metrics | 30s |
| cadvisor | Container metrics | 30s (timeout 25s) |
| postgres-exporter | PostgreSQL metrics | 15s |
| caddy | Caddy metrics (admin API) | 15s |
| grafana | Grafana metrics | 15s |
| alloy | Alloy collector metrics | 15s |
| tempo | Tempo metrics | 15s |
| amd-gpu | GPU node AMD GPU metrics | 30s |
| unpoller | UniFi network metrics | 30s |
| blackbox-http | 5 internal service probes | 5m |
| blackbox-http-auth | Auth-gated MCP endpoint probe (accepts 401) | 5m |
| blackbox-https | External endpoint probe (Cloudflare path) | 5m |
| loki | Loki metrics | 15s |
| otel-collector | OTEL collector metrics | 15s |
Blackbox Probing
Section titled “Blackbox Probing”The blackbox exporter probes internal services to verify reachability from within the Docker network — separate from whether the container itself reports healthy.
Probe modules:
| Module | Use case |
|---|---|
http_2xx | Unauthenticated internal services (expects 2xx) |
https_2xx_3xx | External Cloudflare endpoint (allows Authelia redirects) |
http_2xx_or_401 | Auth-gated services with no public health path (accepts 401 as “reachable”) |
https_2xx | TLS-verified probes |
tcp_connect | Raw port connectivity check |
Probed targets include: dashboard API, Home Assistant, n8n, Grafana, Authelia, and the MCP postgres proxy endpoint.
Alert Rules (16 rules across 3 groups)
Section titled “Alert Rules (16 rules across 3 groups)”Infrastructure (eval every 1 minute)
Section titled “Infrastructure (eval every 1 minute)”| Rule | Condition | Severity |
|---|---|---|
| Disk Space Critical | Root filesystem > 90% for 10 minutes | critical |
| Container Stopped | Time since last seen > 300s for 5 minutes | critical |
| High Memory Usage | Available memory < 10% for 5 minutes | critical |
| PostgreSQL Connections High | Active connections > 80 for 5 minutes | warning |
| Sustained High CPU | CPU > 85% for 15 minutes | warning |
| Loki Log Ingestion Stopped | Ingestion rate = 0 for 10 minutes | warning |
| Memory Pressure Critical | Available memory < 500MB for 5 minutes | critical |
| Swap Usage High | Swap > 80% for 5 minutes | warning |
| Container OOM Killed | OOM event increase in last 5 minutes | critical |
| Container Restart Loop | More than 3 restarts in 10 minutes | critical |
| PostgreSQL Down | pg_up < 1 for 1 minute | critical |
| Disk Space Warning | Root filesystem > 85% for 10 minutes | warning |
| Container Docker Unhealthy | Docker health state == 0 for 5 minutes | critical |
Container scope for stopped/OOM/restart/unhealthy rules: matches chris-os-* and homeassistant and wyoming-* containers. Excluded from stopped alert: Alloy (expected to run continuously) and Piston (on-demand code execution only).
SSL and Endpoints (eval every 5 minutes)
Section titled “SSL and Endpoints (eval every 5 minutes)”| Rule | Condition | Severity |
|---|---|---|
| SSL Certificate Expiry Warning | Days remaining < 14 for 10 minutes | warning |
| SSL Certificate Expiry Critical | Days remaining < 3 for 5 minutes | critical |
| Endpoint Unreachable | Probe success < 1 for 5 minutes | critical |
Endpoint probes use noDataState: Alerting — if the blackbox exporter itself stops reporting, the rule fires rather than resolving silently.
Backup Freshness (eval every 30 minutes)
Section titled “Backup Freshness (eval every 30 minutes)”| Rule | Condition | Severity |
|---|---|---|
| Offsite R2 Backup Stale | Backup age > 26 hours for 30 minutes | critical |
Notification Routing
Section titled “Notification Routing”Two contact points:
| Contact Point | Channel | Delivery |
|---|---|---|
| pushover | Pushover mobile push | All devices, normal priority |
| discord-alerts | Discord webhook | #vital-apparatus channel (“Aperture Science Monitoring”) |
Routing policy:
- Default receiver:
pushover criticalseverity:pushover(group wait 10s) thendiscord-alerts(dual-path,continue=true)warningseverity:pushoveronly (group wait 1 minute)- Repeat intervals: critical 4h (Pushover) / 8h (Discord), warning 4h
Dashboards (10 provisioned)
Section titled “Dashboards (10 provisioned)”All dashboards are provisioned from grafana/dashboards/ and load automatically on container startup. They live in the chris-os folder in Grafana.
| Dashboard | Contents |
|---|---|
| system-overview | Production host CPU, memory, disk, network |
| docker-containers | Per-container cAdvisor metrics |
| postgresql | Query latency, connections, table sizes |
| n8n-workflows | n8n execution metrics |
| pipeline-health | Data pipeline health |
| voice-pipeline | Voice pipeline metrics |
| health-wellness | Health and wellness data |
| brewery | Inkbird temperature, TP-Link power monitoring (uses HA PostgreSQL datasource) |
| claude-code | Claude Code OTLP telemetry |
| glados-telemetry | GLaDOS framework telemetry |
Datasources (5 provisioned)
Section titled “Datasources (5 provisioned)”| Name | Type | Target |
|---|---|---|
| Prometheus | prometheus | Prometheus (default, 15s scrape) |
| Loki | loki | Loki log aggregation |
| PostgreSQL | grafana-postgresql-datasource | Primary database, read-only role |
| HomeAssistant | grafana-postgresql-datasource | Home Assistant database, HA role |
| Tempo | tempo | Tempo trace backend |
Grafana authentication uses Authelia OIDC directly (client_id: grafana, PKCE S256). Members of the admins group receive the Grafana Admin role. Grafana’s own database uses PostgreSQL (not SQLite).
Key Configuration Files
Section titled “Key Configuration Files”| File | Purpose |
|---|---|
grafana/docker-compose.grafana.yml | Compose stack for the full monitoring group |
grafana/alloy-config.alloy | Alloy log collection and routing config |
grafana/prometheus.yml | Prometheus scrape config |
grafana/alerts.yml | Alert rule definitions (provisioned) |
grafana/blackbox.yml | Blackbox exporter probe module definitions |
grafana/otel/collector-config.yml | OTLP collector fan-out config |
grafana/dashboards/*.json | Dashboard definitions (provisioned) |
Notable Behaviors
Section titled “Notable Behaviors”Container “stopped” alerting: Uses last_over_time and a threshold pipeline rather than absence detection. The rule tracks the most recent timestamp any metric was seen from a container. noDataState: Alerting ensures alerts fire even if cAdvisor stops reporting entirely.
Loki WAL: A prior crash corrupted the Loki WAL. Recovery required wiping the WAL directory. min_ready_duration: 0s is set in Loki config — a value of 15s (the default) causes an infinite readiness loop on this version.
cAdvisor ARM overlay: cAdvisor disk scanning on ARM is slow (2-8 minutes). --docker_only=true flag limits cAdvisor to Docker containers only, avoiding host filesystem scanning.
OTLP from Claude Code: Claude Code emits OTLP telemetry to the collector. Metrics flow to Prometheus, traces to Tempo, logs to Loki. The claude-code and glados-telemetry dashboards visualize this data.