Observability
The backend writes structured JSON logs and exposes Prometheus metrics. The frontend ships page-view + error events to a telemetry endpoint that fans out to the same log + metric stream.
Everything is metadata only. No request payloads, no PII unless the user puts it in a route or an error message themselves.
Logs
Rotated JSONL on the painscaler_data volume:
/data/logs/painscaler.log/data/logs/painscaler-2026-04-15T10-22-31.000.log.gz...Errors mirror to stderr regardless of log level, so docker logs painscaler-api always surfaces them.
Tunable env vars
| var | default | meaning |
|---|---|---|
LOG_DIR | /data/logs | log directory |
LOG_LEVEL | info | debug / info / warn / error |
LOG_MAX_SIZE_MB | 50 | rotate when current file exceeds this |
LOG_MAX_BACKUPS | 10 | keep this many rotated files |
LOG_MAX_AGE_DAYS | 30 | delete rotated files older than this |
LOG_COMPRESS | true | gzip rotated files |
Per-request log shape
Every HTTP request produces one record after completion:
{ "time": "2026-04-16T20:11:42.331Z", "level": "INFO", "msg": "http request", "service": "painscaler", "version": "0.5.0", "commit": "4a57559", "request_id": "5f9e...", "route": "/api/v1/segment/:segmentID/policies", "method": "GET", "status": 200, "duration_ms": 12, "bytes_out": 4218, "client_ip": "10.0.1.42", "user_agent": "Mozilla/5.0 ...", "user": "alice"}route is c.FullPath() (the Gin route template) so path params do not
explode the cardinality on log aggregators or Prometheus labels.
Common queries
# All errorsdocker compose cp painscaler-api:/data/logs/painscaler.log - | \ jq -c 'select(.level=="ERROR")'
# Top routes by request countdocker compose cp painscaler-api:/data/logs/painscaler.log - | \ jq -r 'select(.msg=="http request") | .route' | \ sort | uniq -c | sort -rn | head
# Slow requests (p95-ish, naive)docker compose cp painscaler-api:/data/logs/painscaler.log - | \ jq -r 'select(.msg=="http request" and .duration_ms > 500) | [.route, .duration_ms] | @tsv'
# Browser-side errors onlydocker compose cp painscaler-api:/data/logs/painscaler.log - | \ jq -c 'select(.source=="frontend" and .type=="error")'(distroless image has no jq. Copy the file out, pipe locally.)
Metrics
http://painscaler-api:8080/metrics. Scraped from inside the compose
network only - Caddy does not proxy /metrics.
| Metric | Type | Labels |
|---|---|---|
painscaler_http_requests_total | counter | route, method, status |
painscaler_http_request_duration_seconds | histogram | route, method |
painscaler_frontend_events_total | counter | type (page_view, error) |
painscaler_build_info | gauge=1 | version, commit, date |
Routes use the Gin template (/api/v1/segment/:segmentID/policies), so
cardinality is bounded by the route count, not by your tenant’s segment
count.
Adding a Prometheus container
Drop this into deploy/docker-compose.yml:
prometheus: image: prom/prometheus expose: ["9090"] volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro networks: [painscaler]with deploy/prometheus.yml:
scrape_configs: - job_name: painscaler static_configs: - targets: ["painscaler-api:8080"]Then expose Prometheus through Caddy if you want the UI from outside.
Frontend telemetry
The browser buffers events and POSTs them to /api/v1/telemetry. Two event
types right now:
page_view- fired on every route change in the SPAerror- fired by the ReactErrorBoundarywhen a render throws
Buffering rules:
- Flushed every 30 seconds via
fetch. - Flushed on
visibilitychange(tab hidden) vianavigator.sendBeacon. - Flushed on
pagehideviasendBeacon. - Flushed immediately if the buffer hits 100 events.
Failures are dropped silently. We do not loop on telemetry errors.
Server side
POST /api/v1/telemetry walks the batch, emits one slog line per event with
source=frontend, and increments painscaler_frontend_events_total{type=...}.
The Remote-User header (when present and trusted) is attached to each log
line so you can attribute browser errors to specific users.
Batch size is capped at 100 events. Larger batches are truncated.
Correlating browser to backend
Both sides log the same request_id for any backend call the browser
made (the server sets X-Request-Id on the response, the browser does
not yet propagate it back into telemetry events - that is on the
roadmap). Until that lands, correlation is by route + time.
Why JSONL plus Prometheus, not OpenTelemetry
Two reasons. First, the on-disk JSONL is the system of record - it survives Prometheus going down. Second, OTel adds operational complexity that this project does not need yet. If the use case appears, the metrics package is a 50-line replacement.