Monitoring & Observability

The observability stack is the full LGTM pattern: Loki (logs), Grafana (visualization and alerting), Tempo (distributed traces), and Prometheus (metrics). Everything runs as Docker containers on the production host alongside the services it monitors. Configuration is fully provisioned from the repository — dashboards, datasources, alert rules, and contact points are all version-controlled.

Container Inventory (12 containers)

Container	Image	Role
grafana	grafana/grafana	Visualization, dashboards, alerting UI
loki	grafana/loki	Log aggregation and querying
prometheus	prom/prometheus	Metrics TSDB and scrape engine
tempo	grafana/tempo	Distributed tracing (OTLP receiver)
alloy	grafana/alloy	Log and syslog collector (replaced Promtail)
cadvisor	gcr.io/cadvisor/cadvisor	Per-container CPU, memory, network, OOM metrics
node-exporter	prom/node-exporter	Production host system metrics
postgres-exporter	prometheuscommunity/postgres-exporter	PostgreSQL metrics
blackbox-exporter	prom/blackbox-exporter	HTTP/HTTPS/TCP endpoint probing
otel-collector	otel/opentelemetry-collector-contrib	OTLP fan-out (Claude Code telemetry)
unpoller	ghcr.io/unpoller/unpoller	UniFi network metrics
docker-socket-proxy-grafana	tecnativa/docker-socket-proxy	Read-only Docker API proxy for Alloy/cAdvisor

All 12 containers run in net-monitoring, which is isolated from the application and data networks. Grafana and Prometheus are also on net-frontend so Caddy can proxy their web UIs.

Data Flow

Containers (stdout/stderr)
  └─> Alloy (Docker socket discovery via proxy, log tailing)
        └─> Loki → Grafana

Containers (cgroup stats via /sys)
  └─> cAdvisor
        └─> Prometheus (scrapes every 30s)

Production host (proc/sys mounts)
  └─> node-exporter
        └─> Prometheus (scrapes every 15s)

PostgreSQL
  └─> postgres-exporter (read-only role)
        └─> Prometheus (scrapes every 15s)

UniFi Controller API
  └─> unpoller (polls every 30s)
        └─> Prometheus

Remote hosts (nightwatch GPU node, inference host)
  └─> node-exporter, AMD GPU exporter
        └─> Prometheus (scrapes every 30s)

Claude Code (OTLP over gRPC or HTTP)
  └─> otel-collector
        ├─> Prometheus (metrics)
        ├─> Tempo (traces)
        └─> Loki (logs)

UniFi syslog
  └─> Alloy (UDP 514)
        └─> Loki

Prometheus storage: local TSDB, 2-year (730-day) / 50GB max retention.

Prometheus Scrape Jobs (13)

Job	Target	Interval
prometheus	Prometheus self	15s
node-exporter-pi	Production host metrics	15s
node-exporter-nightwatch	GPU node host metrics	30s
node-exporter-atlas	Inference host metrics	30s
cadvisor	Container metrics	30s (timeout 25s)
postgres-exporter	PostgreSQL metrics	15s
caddy	Caddy metrics (admin API)	15s
grafana	Grafana metrics	15s
alloy	Alloy collector metrics	15s
tempo	Tempo metrics	15s
amd-gpu	GPU node AMD GPU metrics	30s
unpoller	UniFi network metrics	30s
blackbox-http	5 internal service probes	5m
blackbox-http-auth	Auth-gated MCP endpoint probe (accepts 401)	5m
blackbox-https	External endpoint probe (Cloudflare path)	5m
loki	Loki metrics	15s
otel-collector	OTEL collector metrics	15s

Blackbox Probing

The blackbox exporter probes internal services to verify reachability from within the Docker network — separate from whether the container itself reports healthy.

Probe modules:

Module	Use case
`http_2xx`	Unauthenticated internal services (expects 2xx)
`https_2xx_3xx`	External Cloudflare endpoint (allows Authelia redirects)
`http_2xx_or_401`	Auth-gated services with no public health path (accepts 401 as “reachable”)
`https_2xx`	TLS-verified probes
`tcp_connect`	Raw port connectivity check

Probed targets include: dashboard API, Home Assistant, n8n, Grafana, Authelia, and the MCP postgres proxy endpoint.

Alert Rules (16 rules across 3 groups)

Infrastructure (eval every 1 minute)

Rule	Condition	Severity
Disk Space Critical	Root filesystem > 90% for 10 minutes	critical
Container Stopped	Time since last seen > 300s for 5 minutes	critical
High Memory Usage	Available memory < 10% for 5 minutes	critical
PostgreSQL Connections High	Active connections > 80 for 5 minutes	warning
Sustained High CPU	CPU > 85% for 15 minutes	warning
Loki Log Ingestion Stopped	Ingestion rate = 0 for 10 minutes	warning
Memory Pressure Critical	Available memory < 500MB for 5 minutes	critical
Swap Usage High	Swap > 80% for 5 minutes	warning
Container OOM Killed	OOM event increase in last 5 minutes	critical
Container Restart Loop	More than 3 restarts in 10 minutes	critical
PostgreSQL Down	`pg_up < 1` for 1 minute	critical
Disk Space Warning	Root filesystem > 85% for 10 minutes	warning
Container Docker Unhealthy	Docker health state == 0 for 5 minutes	critical

Container scope for stopped/OOM/restart/unhealthy rules: matches chris-os-* and homeassistant and wyoming-* containers. Excluded from stopped alert: Alloy (expected to run continuously) and Piston (on-demand code execution only).

SSL and Endpoints (eval every 5 minutes)

Rule	Condition	Severity
SSL Certificate Expiry Warning	Days remaining < 14 for 10 minutes	warning
SSL Certificate Expiry Critical	Days remaining < 3 for 5 minutes	critical
Endpoint Unreachable	Probe success < 1 for 5 minutes	critical

Endpoint probes use noDataState: Alerting — if the blackbox exporter itself stops reporting, the rule fires rather than resolving silently.

Backup Freshness (eval every 30 minutes)

Rule	Condition	Severity
Offsite R2 Backup Stale	Backup age > 26 hours for 30 minutes	critical

Notification Routing

Two contact points:

Contact Point	Channel	Delivery
pushover	Pushover mobile push	All devices, normal priority
discord-alerts	Discord webhook	`#vital-apparatus` channel (“Aperture Science Monitoring”)

Routing policy:

Default receiver: pushover
critical severity: pushover (group wait 10s) then discord-alerts (dual-path, continue=true)
warning severity: pushover only (group wait 1 minute)
Repeat intervals: critical 4h (Pushover) / 8h (Discord), warning 4h

Dashboards (10 provisioned)

All dashboards are provisioned from grafana/dashboards/ and load automatically on container startup. They live in the chris-os folder in Grafana.

Dashboard	Contents
system-overview	Production host CPU, memory, disk, network
docker-containers	Per-container cAdvisor metrics
postgresql	Query latency, connections, table sizes
n8n-workflows	n8n execution metrics
pipeline-health	Data pipeline health
voice-pipeline	Voice pipeline metrics
health-wellness	Health and wellness data
brewery	Inkbird temperature, TP-Link power monitoring (uses HA PostgreSQL datasource)
claude-code	Claude Code OTLP telemetry
glados-telemetry	GLaDOS framework telemetry

Datasources (5 provisioned)

Name	Type	Target
Prometheus	prometheus	Prometheus (default, 15s scrape)
Loki	loki	Loki log aggregation
PostgreSQL	grafana-postgresql-datasource	Primary database, read-only role
HomeAssistant	grafana-postgresql-datasource	Home Assistant database, HA role
Tempo	tempo	Tempo trace backend

Grafana authentication uses Authelia OIDC directly (client_id: grafana, PKCE S256). Members of the admins group receive the Grafana Admin role. Grafana’s own database uses PostgreSQL (not SQLite).

Key Configuration Files

File	Purpose
`grafana/docker-compose.grafana.yml`	Compose stack for the full monitoring group
`grafana/alloy-config.alloy`	Alloy log collection and routing config
`grafana/prometheus.yml`	Prometheus scrape config
`grafana/alerts.yml`	Alert rule definitions (provisioned)
`grafana/blackbox.yml`	Blackbox exporter probe module definitions
`grafana/otel/collector-config.yml`	OTLP collector fan-out config
`grafana/dashboards/*.json`	Dashboard definitions (provisioned)

Notable Behaviors

Container “stopped” alerting: Uses last_over_time and a threshold pipeline rather than absence detection. The rule tracks the most recent timestamp any metric was seen from a container. noDataState: Alerting ensures alerts fire even if cAdvisor stops reporting entirely.

Loki WAL: A prior crash corrupted the Loki WAL. Recovery required wiping the WAL directory. min_ready_duration: 0s is set in Loki config — a value of 15s (the default) causes an infinite readiness loop on this version.

cAdvisor ARM overlay: cAdvisor disk scanning on ARM is slow (2-8 minutes). --docker_only=true flag limits cAdvisor to Docker containers only, avoiding host filesystem scanning.

OTLP from Claude Code: Claude Code emits OTLP telemetry to the collector. Metrics flow to Prometheus, traces to Tempo, logs to Loki. The claude-code and glados-telemetry dashboards visualize this data.