Skip to content

Observability

The backend writes structured JSON logs and exposes Prometheus metrics. The frontend ships page-view + error events to a telemetry endpoint that fans out to the same log + metric stream.

Everything is metadata only. No request payloads, no PII unless the user puts it in a route or an error message themselves.

Logs

Rotated JSONL on the painscaler_data volume:

/data/logs/painscaler.log
/data/logs/painscaler-2026-04-15T10-22-31.000.log.gz
...

Errors mirror to stderr regardless of log level, so docker logs painscaler-api always surfaces them.

Tunable env vars

vardefaultmeaning
LOG_DIR/data/logslog directory
LOG_LEVELinfodebug / info / warn / error
LOG_MAX_SIZE_MB50rotate when current file exceeds this
LOG_MAX_BACKUPS10keep this many rotated files
LOG_MAX_AGE_DAYS30delete rotated files older than this
LOG_COMPRESStruegzip rotated files

Per-request log shape

Every HTTP request produces one record after completion:

{
"time": "2026-04-16T20:11:42.331Z",
"level": "INFO",
"msg": "http request",
"service": "painscaler",
"version": "0.5.0",
"commit": "4a57559",
"request_id": "5f9e...",
"route": "/api/v1/segment/:segmentID/policies",
"method": "GET",
"status": 200,
"duration_ms": 12,
"bytes_out": 4218,
"client_ip": "10.0.1.42",
"user_agent": "Mozilla/5.0 ...",
"user": "alice"
}

route is c.FullPath() (the Gin route template) so path params do not explode the cardinality on log aggregators or Prometheus labels.

Common queries

Terminal window
# All errors
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
jq -c 'select(.level=="ERROR")'
# Top routes by request count
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
jq -r 'select(.msg=="http request") | .route' | \
sort | uniq -c | sort -rn | head
# Slow requests (p95-ish, naive)
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
jq -r 'select(.msg=="http request" and .duration_ms > 500) | [.route, .duration_ms] | @tsv'
# Browser-side errors only
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
jq -c 'select(.source=="frontend" and .type=="error")'

(distroless image has no jq. Copy the file out, pipe locally.)

Metrics

http://painscaler-api:8080/metrics. Scraped from inside the compose network only - Caddy does not proxy /metrics.

MetricTypeLabels
painscaler_http_requests_totalcounterroute, method, status
painscaler_http_request_duration_secondshistogramroute, method
painscaler_frontend_events_totalcountertype (page_view, error)
painscaler_build_infogauge=1version, commit, date

Routes use the Gin template (/api/v1/segment/:segmentID/policies), so cardinality is bounded by the route count, not by your tenant’s segment count.

Adding a Prometheus container

Drop this into deploy/docker-compose.yml:

prometheus:
image: prom/prometheus
expose: ["9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
networks: [painscaler]

with deploy/prometheus.yml:

scrape_configs:
- job_name: painscaler
static_configs:
- targets: ["painscaler-api:8080"]

Then expose Prometheus through Caddy if you want the UI from outside.

Frontend telemetry

The browser buffers events and POSTs them to /api/v1/telemetry. Two event types right now:

  • page_view - fired on every route change in the SPA
  • error - fired by the React ErrorBoundary when a render throws

Buffering rules:

  • Flushed every 30 seconds via fetch.
  • Flushed on visibilitychange (tab hidden) via navigator.sendBeacon.
  • Flushed on pagehide via sendBeacon.
  • Flushed immediately if the buffer hits 100 events.

Failures are dropped silently. We do not loop on telemetry errors.

Server side

POST /api/v1/telemetry walks the batch, emits one slog line per event with source=frontend, and increments painscaler_frontend_events_total{type=...}. The Remote-User header (when present and trusted) is attached to each log line so you can attribute browser errors to specific users.

Batch size is capped at 100 events. Larger batches are truncated.

Correlating browser to backend

Both sides log the same request_id for any backend call the browser made (the server sets X-Request-Id on the response, the browser does not yet propagate it back into telemetry events - that is on the roadmap). Until that lands, correlation is by route + time.

Why JSONL plus Prometheus, not OpenTelemetry

Two reasons. First, the on-disk JSONL is the system of record - it survives Prometheus going down. Second, OTel adds operational complexity that this project does not need yet. If the use case appears, the metrics package is a 50-line replacement.