Observability (Prometheus + structured logs)
gocdnext emits Prometheus metrics out of the box and writes
structured JSON logs via slog. Wiring them into your stack is
one scrape config (or one Helm flag if you run kube-prometheus-stack).
OpenTelemetry trace export is on the roadmap
but not yet wired into the binary.
Prometheus
What’s exposed
/metrics on the HTTP listener (default :8153). The Helm
chart’s service exposes this on the http port.
# HELP gocdnext_jobs_scheduled_total Total jobs the scheduler dispatched.# TYPE gocdnext_jobs_scheduled_total countergocdnext_jobs_scheduled_total{pipeline="<uuid>",project="<uuid>"} 142
# HELP gocdnext_jobs_running Jobs currently in flight on this replica.# TYPE gocdnext_jobs_running gaugegocdnext_jobs_running 3
# HELP gocdnext_job_duration_seconds Wall-clock job duration by status.# TYPE gocdnext_job_duration_seconds histogramgocdnext_job_duration_seconds_bucket{status="success",le="10"} 41…
# HELP gocdnext_queue_depth Jobs/runs in non-terminal status.# TYPE gocdnext_queue_depth gaugegocdnext_queue_depth{stage_status="queued"} 0gocdnext_queue_depth{stage_status="pending"} 2
# HELP gocdnext_agents_online Agents with an active session on this replica.# TYPE gocdnext_agents_online gaugegocdnext_agents_online 4
# HELP gocdnext_log_archive_jobs_total Cold-archive job outcomes by result.# TYPE gocdnext_log_archive_jobs_total countergocdnext_log_archive_jobs_total{result="success"} 18gocdnext_log_archive_jobs_total{result="skipped"} 3
# HELP gocdnext_retention_dropped_log_partitions_total log_lines partitions dropped by retention sweeper.# TYPE gocdnext_retention_dropped_log_partitions_total countergocdnext_retention_dropped_log_partitions_total 6
# HELP gocdnext_webhook_deliveries_total Inbound webhook deliveries by provider and outcome.# TYPE gocdnext_webhook_deliveries_total countergocdnext_webhook_deliveries_total{provider="github",outcome="accepted"} 412Plus the standard Go runtime metrics (go_*, process_*).
Scrape config
- job_name: gocdnext metrics_path: /metrics scrape_interval: 30s kubernetes_sd_configs: - role: service namespaces: { names: [gocdnext] } relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: gocdnext-server - source_labels: [__meta_kubernetes_service_port_name] action: keep regex: httpOr, if you run kube-prometheus-stack, flip the chart’s
server.serviceMonitor.enabled flag and Helm will render the
ServiceMonitor for you:
server: serviceMonitor: enabled: true interval: 30s # Match the release: label your Prometheus instance selects on. labels: release: kube-prometheus-stackUseful alerts
- alert: GocdnextHighQueueDepth expr: sum(gocdnext_queue_depth{stage_status="queued"}) > 10 for: 5m annotations: summary: "Run queue stuck above 10 for 5+ minutes"
- alert: GocdnextJobFailureSpike expr: | sum(rate(gocdnext_job_duration_seconds_count{status="failed"}[10m])) / sum(rate(gocdnext_job_duration_seconds_count[10m])) > 0.3 for: 15m annotations: summary: "30%+ of jobs failed in the last 15 minutes"
- alert: GocdnextNoAgents expr: sum(gocdnext_agents_online) == 0 for: 2m annotations: summary: "No agents online — runs are queueing"
- alert: GocdnextLogArchiveFailing expr: | increase(gocdnext_log_archive_jobs_total{result="failed"}[1h]) > 5 for: 15m annotations: summary: "Cold archive failed 5+ times in the last hour"OpenTelemetry traces (roadmap)
OTel trace export is not yet wired in 0.2.0. The platform
already stamps trace_id / span_id slots in its slog handler so
that switching on tracing later doesn’t require touching the call
sites — but the OTLP exporter is not initialised. Track progress
on #otel-traces
or wait for the release notes to mention it.
If you need request-flow visibility today, the structured logs
below carry run_id, job_id, agent_id, pipeline — those
correlate the same flows traces would, just without the waterfall
view.
Logs
The platform emits structured JSON logs to stdout via slog:
{ "time": "2026-04-28T13:00:00Z", "level": "INFO", "msg": "agent job result", "run_id": "...", "job_id": "...", "job_name": "compile", "status": "success", "exit_code": 0, "trace_id": "...", "span_id": "..."}trace_id + span_id are stamped automatically when OTel is
configured — your log backend can correlate logs with traces.
Field consistency
Every relevant span/log carries:
pipeline(string, slug)job(string, name)agent_id(UUID)run_id(UUID)trace_id/span_id(when OTel is on)
Search across logs/traces with these labels and you get the full picture of any run.
Dashboards
A starter Grafana dashboard ships in
docs/grafana/gocdnext.json.
It covers:
- Jobs in flight + agents online + queue stat tiles
- Dispatch rate by pipeline
- Completion rate by outcome (stacked)
- p50 / p95 / p99 job duration
- Webhook deliveries (provider × outcome)
- Log archive outcomes
- Daily partition drops + server RSS
Import via Dashboards → New → Import and paste the JSON; pick your Prometheus datasource on the variables panel and you’re done.
Common pitfalls
- Cardinality blowup: don’t add
commit_shaas a label on metrics. Every commit becomes a unique time series, Prometheus storage explodes. The platform’s built-in series keep cardinality bounded (no commit_sha, no per-pipeline labels on histograms) — be careful when adding your own. - Per-replica gauges:
gocdnext_jobs_runningandgocdnext_agents_onlineare process-local. Usesum()across replicas for the cluster total, notmax(). /readyzvs/healthz: wire/readyzto the readiness probe (it pings the DB, returns 503 when Postgres is down) and/healthzto the liveness probe (always 200, just proves the process is alive). Wiring them backwards traps a starting replica in a CrashLoopBackoff.