Observability

The backend writes structured JSON logs and exposes Prometheus metrics. The frontend ships page-view + error events to a telemetry endpoint that fans out to the same log + metric stream.

Everything is metadata only. No request payloads, no PII unless the user puts it in a route or an error message themselves.

Logs

Rotated JSONL on the painscaler_data volume:

/data/logs/painscaler.log
/data/logs/painscaler-2026-04-15T10-22-31.000.log.gz
...

Errors mirror to stderr regardless of log level, so docker logs painscaler-api always surfaces them.

Tunable env vars

var	default	meaning
`LOG_DIR`	`/data/logs`	log directory
`LOG_LEVEL`	`info`	`debug` / `info` / `warn` / `error`
`LOG_MAX_SIZE_MB`	`50`	rotate when current file exceeds this
`LOG_MAX_BACKUPS`	`10`	keep this many rotated files
`LOG_MAX_AGE_DAYS`	`30`	delete rotated files older than this
`LOG_COMPRESS`	`true`	gzip rotated files

Per-request log shape

Every HTTP request produces one record after completion:

{
  "time": "2026-04-16T20:11:42.331Z",
  "level": "INFO",
  "msg": "http request",
  "service": "painscaler",
  "version": "0.5.0",
  "commit": "4a57559",
  "request_id": "5f9e...",
  "route": "/api/v1/segment/:segmentID/policies",
  "method": "GET",
  "status": 200,
  "duration_ms": 12,
  "bytes_out": 4218,
  "client_ip": "10.0.1.42",
  "user_agent": "Mozilla/5.0 ...",
  "user": "alice"
}

route is c.FullPath() (the Gin route template) so path params do not explode the cardinality on log aggregators or Prometheus labels.

Common queries

# All errors
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
  jq -c 'select(.level=="ERROR")'

# Top routes by request count
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
  jq -r 'select(.msg=="http request") | .route' | \
  sort | uniq -c | sort -rn | head

# Slow requests (p95-ish, naive)
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
  jq -r 'select(.msg=="http request" and .duration_ms > 500) | [.route, .duration_ms] | @tsv'

# Browser-side errors only
docker compose cp painscaler-api:/data/logs/painscaler.log - | \
  jq -c 'select(.source=="frontend" and .type=="error")'

(distroless image has no jq. Copy the file out, pipe locally.)

Metrics

http://painscaler-api:8080/metrics. Scraped from inside the compose network only - Caddy does not proxy /metrics.

Metric	Type	Labels
`painscaler_http_requests_total`	counter	`route`, `method`, `status`
`painscaler_http_request_duration_seconds`	histogram	`route`, `method`
`painscaler_frontend_events_total`	counter	`type` (`page_view`, `error`)
`painscaler_build_info`	gauge=1	`version`, `commit`, `date`

Routes use the Gin template (/api/v1/segment/:segmentID/policies), so cardinality is bounded by the route count, not by your tenant’s segment count.

Adding a Prometheus container

Drop this into deploy/docker-compose.yml:

prometheus:
  image: prom/prometheus
  expose: ["9090"]
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
  networks: [painscaler]

with deploy/prometheus.yml:

scrape_configs:
  - job_name: painscaler
    static_configs:
      - targets: ["painscaler-api:8080"]

Then expose Prometheus through Caddy if you want the UI from outside.

Frontend telemetry

The browser buffers events and POSTs them to /api/v1/telemetry. Two event types right now:

page_view - fired on every route change in the SPA
error - fired by the React ErrorBoundary when a render throws

Buffering rules:

Flushed every 30 seconds via fetch.
Flushed on visibilitychange (tab hidden) via navigator.sendBeacon.
Flushed on pagehide via sendBeacon.
Flushed immediately if the buffer hits 100 events.

Failures are dropped silently. We do not loop on telemetry errors.

Server side

POST /api/v1/telemetry walks the batch, emits one slog line per event with source=frontend, and increments painscaler_frontend_events_total{type=...}. The Remote-User header (when present and trusted) is attached to each log line so you can attribute browser errors to specific users.

Batch size is capped at 100 events. Larger batches are truncated.

Correlating browser to backend

Both sides log the same request_id for any backend call the browser made (the server sets X-Request-Id on the response, the browser does not yet propagate it back into telemetry events - that is on the roadmap). Until that lands, correlation is by route + time.

Why JSONL plus Prometheus, not OpenTelemetry

Two reasons. First, the on-disk JSONL is the system of record - it survives Prometheus going down. Second, OTel adds operational complexity that this project does not need yet. If the use case appears, the metrics package is a 50-line replacement.