Trunk-based release flow
A complete shape — four pipelines, four humans involved end-to-end — that delivers safety without the parallel-branch overhead of git-flow. Teams moving from git-flow without giving up safety adopt these files mostly as-is.
The trick: merging to main is NOT deploying to production.
The model below has four mandatory stops between main and prod.
Every stop has a clear gate. Nothing is automatic where it
shouldn’t be.
Why this exists
git-flow’s develop + release/* + hotfix/* branches give the
illusion of safety: code parks somewhere “not main” until a
release manager decides it’s ready. The cost is hidden — every
day of “not on main” is a day of merge conflicts, drift, and
lost context. Trunk-based collapses those into one branch (main)
and moves the safety into the pipeline shape instead.
This page is the cookbook and the mental-model document. The four pipelines are the technical answer; the section Selling this to skeptics is what you bring to the team meeting.
The mental model
PR opened → pr.yaml (lint + security + tests + build + AI review) ↓ humans approve the PRpush main → main.yaml (CI + main-<sha> image + release-candidate notes) ↓ release manager clicks "Run latest" on release.yaml with TAG=vX.Y.Zmanual → release.yaml (validate + push tag + build + scan + sign + deploy stage + smoke) ↓ release-approvers click Approve on prod.yaml with TAG=vX.Y.Z (quorum 2)manual → prod.yaml (preflight + deploy prod + smoke + auto-rollback)Four pipelines, one production system. Every step records a run in the gocdnext audit log; every approval records who clicked.
Why this recipe ships one pipeline for “cut tag” + “deploy
stage” (and what changed in v0.10.0): the release flow is one
manually-triggered pipeline that runs end to end — the release
manager passes TAG=v1.2.3 at trigger time; the pipeline cuts
the tag, builds the image with that tag, scans + signs +
publishes, and finally deploys to stage. ~15 minutes wall-clock
on a typical project; one human decision (“cut release”); one
button press. Since v0.10.0 gocdnext also supports
event: [tag] + CI_TAG_NAME, so this CAN be split into a slim
release.yaml (cuts the tag) + a tag.yaml that auto-fires on
the resulting tag push to do the build/scan/sign/deploy — see
Variant: split release + tag.yaml
for that shape. Pick whichever fits your team; both are valid.
Production-readiness prerequisites
The four-pipeline shape below moves safety from branches to pipelines, but it has five hard prerequisites that aren’t optional if you’re going to claim “this is production-ready.” Without them, the shape is provenance-for-audit, not enforcement — a cosign signature nobody verifies, a digest nobody pins, a rollback that breaks on the first DB migration, a deploy that takes 100% blast radius before the first smoke probe fires, an image whose contents nobody recorded.
Each prereq is one knob the operator has to set up ONCE on the cluster (Kyverno install + policies, Helm digest values, migration shape review, Argo Rollouts install, Syft + cosign attest in the release pipeline). They’re cluster-level or release-pipeline-level, not per-application, so the cost amortises across every project. Walking past them gives you the provenance for an audit conversation; meeting them gives you the enforcement that an audit conversation is asking about.
1. Cluster verifies signatures at admission
The release pipeline signs the image with cosign. Nothing the pipeline does makes the signature load-bearing unless the cluster checks it at admission time. Without verification, signing is a sticker on a glass jar — the lid still opens.
Install Kyverno (or sigstore policy-controller, or Connaisseur — pick one; we use Kyverno here because the ClusterPolicy DSL is the most operator-readable) and apply a ClusterPolicy that fails-closed for any image referenced in the production namespace that lacks a valid cosign signature anchored to your team’s identity:
apiVersion: kyverno.io/v1kind: ClusterPolicymetadata: name: require-cosign-signaturespec: validationFailureAction: Enforce webhookConfiguration: failurePolicy: Fail # fail-closed: missing webhook = deny rules: - name: verify-image-signature match: any: - resources: kinds: [Pod] namespaces: [prod, prod-system] verifyImages: - imageReferences: - "ghcr.io/my-org/*" attestors: - entries: - keys: publicKeys: |- -----BEGIN PUBLIC KEY----- <your cosign public key> -----END PUBLIC KEY----- mutateDigest: true # rewrite tag→digest so the verified bytes are what actually runs required: trueTwo things this gives you:
mutateDigest: truerewrites everyimage: …@sha256:abcbefore the container starts. Even if the deploying pipeline used a tag, the cluster runs the digest. That closes the mutable-tag-window gap (next prereq).required: true+failurePolicy: Failmeans a fresh pod with NO signature fails admission. Critical for hotfixes that bypass the normal release flow — the cluster still refuses them.
Verify the policy is live before the first prod deploy goes through it:
kubectl run --rm -it --image=ghcr.io/my-org/scratch:no-sig --restart=Never test-deny -n prod# Error from server: admission webhook "...kyverno..." denied the request: image not signedIf that command succeeds, the policy isn’t enforcing —
re-check validationFailureAction: Enforce and the namespace
selector. A trunk-based release flow without this check is a
trunk-based release flow that pretends to verify provenance.
2. Deploy by digest, not tag
OCI tags are mutable. Even with signature verification, if the
deploy references :v1.2.3 and someone repushes that tag
between stage smoke and prod deploy (another release pipeline,
a manual docker push, a registry mirror replay), the
production cluster runs a different binary than the one stage
tested + signed. The cosign signature is anchored to the
digest, not the tag — verification of the new digest may
even succeed (different bytes, but still a signed image
from your registry).
Fix: resolve the tag to a digest the moment the image lands in
the registry, and propagate the digest all the way to prod. The
Helm chart’s image.digest value (or whatever your chart names
it) is what guarantees “what was tested is what runs.”
In release.yaml’s build job, after the push:
build-and-push: stage: build uses: ghcr.io/klinux/gocdnext-plugin-buildx@v1 outputs: digest: IMAGE_DIGEST # `crane digest` against the just-pushed tag with: image: ghcr.io/my-org/app:${TAG} push: "true" # The plugin runs `crane digest <image>:<tag>` after push and # captures the sha256:… line into IMAGE_DIGEST. Subsequent jobs # consume it via outputs.Then prod.yaml’s deploy passes image.digest (not image.tag):
deploy-prod: stage: deploy needs: [preflight] secrets: [PROD_KUBECONFIG] uses: ghcr.io/klinux/gocdnext-plugin-helm@v1 with: kubeconfig: ${{ PROD_KUBECONFIG }} command: | upgrade --install my-app charts/my-app --namespace prod --set image.repository=ghcr.io/my-org/app --set image.digest=${{ needs.preflight.outputs.digest }} --wait --timeout 10mAnd the preflight resolves the digest from the upstream release
run’s outputs (instead of just verifying the tag exists). The
check-pipeline-run plugin returns outputs.IMAGE_DIGEST from
the matched release run; if the digest doesn’t match what was
just queried with crane digest …:${TAG}, that’s a mutable-tag
event between release and prod — fail loud.
preflight: stage: preflight needs: [approve-prod] secrets: [GOCDNEXT_API_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-check-pipeline-run@v1 outputs: digest: IMAGE_DIGEST with: api-url: https://gocdnext.example.com api-token: ${{ GOCDNEXT_API_TOKEN }} project: my-org pipeline: release tag: ${TAG} expected-status: success require-digest-match: true # plugin re-runs `crane digest` and aborts on mismatch max-age: 7dAfter this prereq, “what was tested is what runs” is enforceable
end-to-end. The digest flows through release.yaml outputs →
preflight check → Helm value → Kyverno’s mutateDigest →
container.
3. Migrations are forward-only and backward-compatible (expand/contract)
“Rollback is one click” only holds for stateless apps. helm rollback reverts the Helm release; it does NOT revert a database
migration. Trunk-based with frequent deploys requires
expand/contract migrations as a HARD prerequisite — without
them, every prod deploy with a DDL change is a one-way door, and
“rollback” is a lie.
The full operational guide — the prohibition table, lock hygiene, pipeline patterns with the flyway/liquibase/goose plugins — lives in Database migrations in pipelines.
The contract:
- Forward-only. No
goose down, noflyway undo. Rollback fixes forward via a corrective migration (fix(db): roll back column rename). - Backward-compatible. A migration deploys BEFORE the code that depends on it. Old code (the version we’d rollback to) must still work against the new schema.
- Expand/contract for renames + drops. Three deploys:
- Expand: add new column, dual-write from app code, leave old column in place. Both schemas work for both code versions.
- Backfill + cut over: stop reading the old column, drop dual-write. Old code still works (reads from the column it knows).
- Contract: drop the old column. Only safe AFTER you’re confident you won’t roll back across the cut-over.
For destructive changes (column drop, table drop, NOT NULL without default), the expand/contract dance MUST cross at least ONE release boundary — typically a week — so the rollback target is always a deploy with the old column still present.
Document this in your repo so it’s NOT tribal:
# Migration contract
- Forward-only. No `goose down`. Use corrective migrations.- Every migration must work against the PREVIOUS deploy's code.- Column drops + renames use expand/contract over ≥ 2 releases.- Destructive DDL inside a deploy = manual gate. Tag the PR `migration:destructive`; release approvers verify the expand/contract sequence before approving.The fourth pipeline (prod.yaml) doesn’t enforce this — there’s
no automated way to check “is this migration backward-
compatible?”. It enforces the gate: a PR labeled
migration:destructive triggers a higher approval quorum via
quorum_by_label (destructive: 3 instead of the default 2).
That makes the operator think twice; the contract above does
the rest.
4. Progressive delivery on prod (canary, not big-bang)
helm upgrade is a big-bang swap: the new ReplicaSet scales up,
the old scales down, and 100% of prod traffic is on the new
image within minutes. If the new image has a bug that only shows
up under real prod load (a thundering-herd race, a memory leak
that takes 90 seconds to manifest, a hot-path SQL query on a
prod-sized table the stage cluster doesn’t have), the smoke
probe at the end of the deploy will catch it — after the bug has
already burned every active user’s session.
Smoke catches dead apps. Progressive delivery catches degraded but alive apps before everyone sees them. Trunk-based commits to “many small deploys”, and the math works only if each deploy is cheap to fail.
Install Argo Rollouts (or Flagger if you already run a
service mesh) and replace the prod Deployment with a Rollout
resource. A canary strategy that ramps 10% → 25% → 50% → 100%
with a 30s pause + analysis between steps gives you four
opportunities to halt and auto-rollback per deploy, each on a
slice of traffic small enough that the user-visible blast
radius is bounded.
# Operator runs once per cluster:# kubectl create namespace argo-rollouts# helm install argo-rollouts argo/argo-rollouts -n argo-rollouts# Then enforces the resource shape via a ClusterPolicy:
apiVersion: kyverno.io/v1kind: ClusterPolicymetadata: name: prod-must-use-rolloutspec: validationFailureAction: Enforce rules: - name: forbid-deployment-in-prod match: any: - resources: kinds: [Deployment] namespaces: [prod] validate: message: "prod workloads must be Rollouts (argo-rollouts), not Deployments" deny: {}The Helm chart for the application then renders a Rollout
instead of a Deployment when --set rollout.enabled=true:
apiVersion: argoproj.io/v1alpha1kind: Rolloutmetadata: name: my-appspec: replicas: {{ .Values.replicas }} strategy: canary: # Each step pauses for analysis before advancing. If any # AnalysisRun comes back Failed, Argo Rollouts halts and # auto-aborts → previous ReplicaSet stays at 100%. steps: - setWeight: 10 - pause: { duration: 30s } - analysis: templates: [{ templateName: error-rate }] - setWeight: 25 - pause: { duration: 60s } - analysis: templates: [{ templateName: error-rate }] - setWeight: 50 - pause: { duration: 60s } - analysis: templates: [{ templateName: error-rate }] - setWeight: 100 selector: matchLabels: { app: my-app } template: spec: containers: - name: my-app image: "ghcr.io/my-org/my-app@{{ .Values.image.digest }}"The AnalysisTemplate evaluates a Prometheus query (or any other metrics provider Argo supports). The standard shape is “error rate over the last 60s must be < 1%”:
apiVersion: argoproj.io/v1alpha1kind: AnalysisTemplatemetadata: name: error-ratespec: metrics: - name: error-rate interval: 30s successCondition: result[0] < 0.01 # < 1% error rate failureLimit: 1 # one failed sample → abort provider: prometheus: address: http://prometheus.monitoring:9090 query: | sum(rate(http_requests_total{namespace="prod",app="my-app",status=~"5.."}[1m])) / sum(rate(http_requests_total{namespace="prod",app="my-app"}[1m]))prod.yaml’s deploy then becomes a helm upgrade that creates
the Rollout, followed by a watch step that polls the Rollout
status until it reaches Healthy (full ramp completed) or
Degraded (analysis aborted, traffic on the old ReplicaSet):
deploy-prod: stage: deploy needs: [preflight] secrets: [PROD_KUBECONFIG] uses: ghcr.io/klinux/gocdnext-plugin-helm@v1 with: kubeconfig: ${{ PROD_KUBECONFIG }} # --set rollout.enabled=true => chart renders Rollout, not Deployment. # helm returns AFTER the Rollout object is applied — it does NOT # wait for the canary to complete (that's the watch job below). command: | upgrade --install my-app charts/my-app --namespace prod --set image.digest=${{ needs.preflight.outputs.digest }} --set rollout.enabled=true --wait --timeout 5m
watch-canary: stage: deploy needs: [deploy-prod] secrets: [PROD_KUBECONFIG] image: alpine:3.20 script: - apk add --no-cache curl - | curl -fsSL -o /usr/local/bin/kubectl-argo-rollouts \ https://github.com/argoproj/argo-rollouts/releases/download/v1.7.2/kubectl-argo-rollouts-linux-amd64 chmod +x /usr/local/bin/kubectl-argo-rollouts # `kubectl argo rollouts get rollout --watch` blocks until terminal # status. Exit code is 0 if Healthy, non-zero if Degraded/Paused. # Combined with `prod.yaml`'s auto-rollback job (next stage), # this turns a canary abort into a structured pipeline failure. kubectl argo rollouts get rollout my-app -n prod --watch --timeout 600sTwo halts the operator never had with helm upgrade alone:
- Pre-prod halts stay in stage (
stage smokecatches dead apps). - In-prod halts stay at ≤ 10% traffic share for ≥ 30s (Rollout analysis catches degraded apps before they ramp).
Walking past this prereq is honest if your business genuinely accepts 100% blast radius per deploy (low-traffic internal tools, SLOs that don’t budget for canary). Anything user-facing with non-trivial traffic — installing Argo Rollouts is a 10-minute operator action that pays for itself the first time a flaky new image only hits 10% of prod instead of 100%.
5. SBOM + SLSA attestation
A cosign signature proves “the team’s identity signed these bytes.” It does NOT answer the two questions an auditor (or your own incident-response on the day a transitive dep gets a CVE) actually asks: what’s INSIDE the image (SBOM) and how was it built (SLSA provenance). Both belong on the same attestation chain the signature does — cosign verifies them the same way.
The pattern is three additions to release.yaml:
- Syft generates a CycloneDX SBOM from the just-pushed digest. CycloneDX over SPDX because the toolchain (Trivy, Grype, Dependency-Track) consumes CycloneDX natively — same SBOM the scanner you already run reads.
cosign attest --predicate sbom.cdx.jsonattaches the SBOM to the digest as an in-toto attestation. Thegocdnext/cosign@v1plugin already supportsaction: attestwithpredicate:+predicate-type:— no new plugin needed.- SLSA L3 in-toto provenance generated by the build job itself. The build environment (image digest of the builder, ref/SHA of the source, parameters) is recorded as a second predicate. cosign attests it the same way.
stages: [validate, tag, build, scan, sign, attest, promote, stage]# ^^^^^^ new stage between sign and promote
jobs: # ... build, trivy-image, cosign-sign as before ...
# ---------- generate CycloneDX SBOM -------------------------- # Syft inspects the pushed manifest list and walks every # platform's layers. Output is one CycloneDX JSON document # that lists every OS package + language dep. Costs ~20s for # a typical Go service. sbom: stage: attest needs: [cosign-sign] docker: true secrets: [GHCR_USERNAME, GHCR_TOKEN] image: anchore/syft:v1.18.0 script: - | echo "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin # Pin by digest — same bytes the signature anchors to. # `syft ... -o cyclonedx-json` writes the doc to stdout; # redirect to a workspace path the attest job reads. syft "ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }}" \ -o cyclonedx-json=dist/sbom.cdx.json artifacts: paths: [dist/sbom.cdx.json]
# ---------- attach SBOM as a cosign attestation -------------- # `cosign attest --predicate ...` uploads a signed in-toto # statement to the registry. Verification path (Kyverno # `verifyImages.attestations:`) checks BOTH the cosign # signature AND the SBOM attestation's predicate type before # admitting a pod. Same key as the image signature — the # auditor walks one trust anchor, not two. attest-sbom: stage: attest needs: [sbom] needs_artifacts: - from_job: sbom paths: [dist/sbom.cdx.json] secrets: [COSIGN_PRIVATE_KEY, COSIGN_PASSWORD, GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-cosign@v1 with: image: ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} action: attest predicate: dist/sbom.cdx.json predicate-type: cyclonedx key-content: ${{ COSIGN_PRIVATE_KEY }} key-password: ${{ COSIGN_PASSWORD }} registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }}
# ---------- SLSA L3 provenance -------------------------------- # The provenance predicate records: builder identity (the # gocdnext run's URL + run id), source repo + commit, build # invocation (image, args, env). slsa-github-generator is the # standard tool when you run on GitHub Actions; on gocdnext we # render the predicate ourselves from CI_* variables since the # agent IS the builder. The shape below maps to SLSA v1.0 # `https://slsa.dev/provenance/v1`. Promote to L3 by running # this job on an isolated builder profile (separate runner # pool, no shared workspace with build) — see # [runner profile scheduling](/gocdnext/docs/concepts/runner-profiles/). slsa-provenance: stage: attest needs: [build] image: alpine:3.20 script: - apk add --no-cache jq - | mkdir -p dist jq -n \ --arg digest "${{ needs.build.outputs.digest }}" \ --arg ref "$CI_COMMIT_SHA" \ --arg repo "$CI_PROJECT_URL" \ --arg run "$CI_RUN_URL" \ --arg tag "$TAG" \ '{ "_type": "https://in-toto.io/Statement/v1", "predicateType": "https://slsa.dev/provenance/v1", "subject": [{ "name": "ghcr.io/my-org/my-app", "digest": {"sha256": ($digest | ltrimstr("sha256:"))} }], "predicate": { "buildDefinition": { "buildType": "https://gocdnext.example.com/builds/buildx@v1", "externalParameters": {"tag": $tag, "source": $repo, "ref": $ref} }, "runDetails": { "builder": {"id": "https://gocdnext.example.com/builders/gocdnext-agent"}, "metadata": {"invocationId": $run} } } }' > dist/slsa.provenance.json artifacts: paths: [dist/slsa.provenance.json]
attest-slsa: stage: attest needs: [slsa-provenance] needs_artifacts: - from_job: slsa-provenance paths: [dist/slsa.provenance.json] secrets: [COSIGN_PRIVATE_KEY, COSIGN_PASSWORD, GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-cosign@v1 with: image: ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} action: attest predicate: dist/slsa.provenance.json predicate-type: slsaprovenance key-content: ${{ COSIGN_PRIVATE_KEY }} key-password: ${{ COSIGN_PASSWORD }} registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }}Wire promote’s needs: to wait for [attest-sbom, attest-slsa]
in addition to cosign-sign. The final tag still only lands
when scan + sign + both attestations are in place.
On the cluster side, extend the Kyverno ClusterPolicy from Prerequisite 1 to require both attestations alongside the signature:
apiVersion: kyverno.io/v1kind: ClusterPolicymetadata: name: require-sbom-and-slsaspec: validationFailureAction: Enforce rules: - name: verify-attestations match: any: - resources: kinds: [Pod] namespaces: [prod, prod-system] verifyImages: - imageReferences: - "ghcr.io/my-org/*" attestations: # Both predicates MUST be present and signed by the # same key as the image signature. Missing attestation # = admission denied, same as missing signature. - type: https://cyclonedx.org/bom attestors: - entries: - keys: { publicKeys: |- ... } # same key as Prereq 1 - type: https://slsa.dev/provenance/v1 attestors: - entries: - keys: { publicKeys: |- ... } conditions: - all: # Pin which builder is allowed — closes the # "valid SLSA predicate but signed by a # rogue builder" gap. - key: "{{ predicate.runDetails.builder.id }}" operator: Equals value: "https://gocdnext.example.com/builders/gocdnext-agent" required: trueYou don’t need an SBOM + SLSA chain to ship — most teams adopt this at the same time they get their first SOC 2 or FedRAMP question about software supply chain. The cost is two extra attest jobs + ~30s wall-clock. The benefit is that the same cosign verification step that catches an unsigned image now also catches an image whose contents aren’t recorded — closes the “signed but anonymous” gap the auditor will flag.
What follows assumes these prerequisites are in place
The remainder of this page (branching, the four pipelines, hotfix flow, rollback) treats the prerequisites above as done. If your cluster doesn’t verify signatures, you can still adopt the four pipelines for everything else they give you (PR hygiene, stage smoke, audit trail, quorum gates) — but “production-ready” is a softer claim. Promote it to the actual claim once the checks are in place. Prereq 5 (SBOM + SLSA) is the one most teams defer until an auditor asks; the first four are non-negotiable from day one.
Branching contract
DO
- Keep PRs small: < 400 LOC ideal, < 800 LOC max.
- Keep branches short: < 2 days of life.
- Conventional Commits in merges:
feat(scope):,fix(scope):,chore:,docs:,refactor:,BREAKING CHANGE:in the body to force major. - Squash-merge to main (linear history; squash title becomes the CHANGELOG entry).
- Hotfixes use the same flow with a
hotfixlabel.
DON’T
- Long-lived branches (
release/*,develop,staging). - Force-push to main.
- Merge without PR approval + green
pr.yaml. - Tag from CLI (always via
release.yaml). - Skip stages “because it’s small”.
GitHub branch protection on main
In repo settings → Branches → main:
- ✓ Require pull request before merging
- ✓ Require approvals: 1
- ✓ Dismiss stale approvals when new commits push
- ✓ Require status checks to pass — required:
pr / build(the final job of pr.yaml; the check name comes from the gocdnext webhook → status API integration) - ✓ Require linear history
- ✓ Restrict who can push tags (only the gocdnext service
account that runs
release.yaml)
Recommended CONTRIBUTING.md
# Contributing
## Workflow
1. Branch off main: `git checkout -b feat/my-thing`2. Open a PR early (draft is fine).3. `pr.yaml` runs automatically on every push to the PR — lint, security scans, tests, build, AI review.4. Wait for 1 reviewer + green `pr.yaml`.5. Squash-merge with a Conventional Commits title: - `feat(scope): description` → minor bump - `fix(scope): description` → patch bump - `BREAKING CHANGE:` in body → major bump
## Releasing
Release manager opens gocdnext → `release` pipeline → Run latest,passing `TAG=vX.Y.Z` at trigger time. The pipeline pushes the gittag, builds the image, scans, signs, and deploys to stage in onego (~15 min wall-clock).
## Hotfixes
Same flow as features. Add the `hotfix` label on the PR; theprod approval gate reads this label via `quorum_by_label`(shipped in v0.13.0) to drop `required:` from 2 to 1.Separation of duties
“quorum 2” without a separation-of-duties rule is a number, not a control. By default both gocdnext’s approval block and GitHub’s branch protection treat “anyone with approve rights” as fungible — the PR author can theoretically approve their own change at the PR stage (if they have a co-maintainer hat) AND count as one of the two prod approvers. An auditor will ask. The answer “we trust the team” is not the answer they want.
Two pieces have to line up: CODEOWNERS on the repo (so PR review
goes to the right humans) and approver_groups on the prod gate
(so the PR author can’t self-promote). Neither is automatic.
CODEOWNERS
Repo-level file that tells GitHub which humans own which paths. Combined with branch protection’s “Require review from Code Owners” rule, a PR can’t merge until at least one CODEOWNER who isn’t the author has approved.
# Default reviewers — every PR needs at least one from this group.* @my-org/platform-maintainers
# Risk-weighted paths take a narrower group on top of the default./server/internal/auth/ @my-org/security-team/migrations/ @my-org/db-team/charts/ @my-org/sre-team/.gocdnext/ @my-org/release-engineering
# Operator-facing tooling that ships secrets handling./server/internal/secrets/ @my-org/security-teamThen in repo settings → Branches → main, enable:
- Require review from Code Owners. Stacks on top of the “Require approvals: 1” rule from the branching contract — at least one of those approvals must be from a CODEOWNER who isn’t the PR author. GitHub enforces “author can’t be the CODEOWNER who counts” natively.
- Restrict who can dismiss pull request reviews. Without this, a CODEOWNER’s stale approval doesn’t get re-requested when the author pushes — defeats the rule.
Approver groups exclude the PR author
prod.yaml’s approval block uses approver_groups: [release-approvers]
with required: 2. The membership of release-approvers is
defined in gocdnext under Settings → Groups. The contract that
makes the quorum meaningful is the PR author is never one of
the two approvers, even if they’re in the group.
gocdnext enforces this via the exclude_pr_author: true flag on
the approval block (since v0.13.0):
approve-prod: stage: gate approval: description: | Promote ${TAG} to production? Check that stage smoke results are green for this tag. approver_groups: [release-approvers] required: 2 # PR author is read from the run's cause_detail.pr_author # (same snapshot used by quorum_by_label). If the run was # cause=tag rather than cause=pull_request — i.e. the operator # cut a tag manually from the dashboard without a PR — the # exclusion is a no-op (no author to exclude) and the rule # falls back to "no actor counts twice", which is the default # anyway. The gate stays at required: 2. exclude_pr_author: true quorum_by_label: hotfix: 1The release-approvers group definition lives in the gocdnext
admin UI (Settings → Groups → release-approvers). Constrain
membership to humans who aren’t typical PR authors for
production code — usually SRE / release-engineering, not
application developers. The point of the separation is that an
application developer can ship code to prod, but only by getting
two humans whose JOB is “verify this is safe” to click approve.
If your team is small and the same humans both write code AND own release decisions, the rule degrades gracefully: it forbids “author approves own deploy” but leaves “co-maintainer + another co-maintainer” as a valid quorum. Tighten further only when audit requirements demand a hard role split — usually the trigger is SOC 2 or PCI scope.
Why this matters even when you trust the team
Trust isn’t auditable. The control is. The pattern an auditor asks for (“the author of a change cannot approve their own production deployment”) is enforced by gocdnext + GitHub together, not by team convention. Document the rule, configure the enforcement, and the auditor’s question is answered with a config screenshot instead of an explanation.
Pipeline 1: pr.yaml
Triggers on PR open / sync. Validates everything that human review can’t catch with eyes alone.
name: pr
when: event: [pull_request]
stages: [lint, security, test, build, review]
jobs: # ---------- lint (parallel, fail fast) ----------------------- lint-go: stage: lint uses: ghcr.io/klinux/gocdnext-plugin-golangci-lint@v1 cache: - key: golangci-{{ hash "go.sum" }} paths: [.go-mod, .go-cache, .golangci-cache] with: args: ./...
# ---------- security (parallel) ------------------------------ gitleaks: stage: security uses: ghcr.io/klinux/gocdnext-plugin-gitleaks@v1 with: scan-mode: dir exit-code: "1" allowlist-paths: "docs/, test/fixtures/"
trivy-fs: stage: security uses: ghcr.io/klinux/gocdnext-plugin-trivy@v1 cache: - key: trivy-db paths: [.cache/trivy] with: scan-type: fs target: . severity: HIGH,CRITICAL exit-code: "1" ignore-unfixed: "true"
sonar-pr: stage: security secrets: [SONAR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-sonar@v1 cache: - key: sonar-{{ hash "**/pom.xml" }} paths: [.sonar-cache, .m2-repo] with: host-url: https://sonarcloud.io organization: my-org project-key: my-org_my-app pull-request-key: ${CI_PULL_REQUEST_KEY} pull-request-branch: ${CI_PULL_REQUEST_BRANCH} pull-request-base: ${CI_PULL_REQUEST_BASE} wait-for-quality-gate: "true" quality-gate-timeout: "600"
# ---------- test (after lint + security) --------------------- unit: stage: test needs: [lint-go] uses: ghcr.io/klinux/gocdnext-plugin-go@v1 cache: - key: go-{{ hash "go.sum" }} paths: [.go-mod, .go-cache] with: command: test -race ./... test_reports: ["**/*-junit.xml"]
integration: stage: test needs: [unit] docker: true # testcontainers uses: ghcr.io/klinux/gocdnext-plugin-go@v1 cache: - key: go-{{ hash "go.sum" }} paths: [.go-mod, .go-cache] with: command: test -race -tags=integration ./... test_reports: ["**/*-junit.xml"]
# ---------- build (no push — only validates) ----------------- build: stage: build docker: true needs: [integration, trivy-fs, sonar-pr, gitleaks] uses: ghcr.io/klinux/gocdnext-plugin-buildx@v1 with: image: ghcr.io/my-org/my-app tags: "pr-${CI_PULL_REQUEST_KEY}" push: "false" # PR doesn't publish cache: registry
# ---------- AI review (comments on the PR) ------------------- ai-review: stage: review needs: [build] secrets: [ANTHROPIC_API_KEY, GITHUB_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-ai-review@v1 artifacts: optional: [.ai-review] with: provider: claude mode: pr-comment repo: my-org/my-app pr-number: ${CI_PULL_REQUEST_KEY} severity-threshold: warning # Advisory mode for the first month; flip fail-on-error to # "true" once the team has calibrated what severities mean. fail-on-error: "false"
notifications: - on: failure uses: ghcr.io/klinux/gocdnext-plugin-slack@v1 secrets: [SLACK_WEBHOOK] with: webhook: ${{ SLACK_WEBHOOK }} channel: "#ci-alerts" template: | :x: PR #${CI_PULL_REQUEST_KEY} ${CI_PIPELINE_ID} failed ${CI_COMMIT_SHORT_SHA} on ${CI_PULL_REQUEST_BRANCH}Notes:
- The implicit project material auto-fires on PR when the
pipeline has
when.event: [pull_request]at the top level (the webhook handler atserver/internal/webhook/pull_request.gochecks the material’s events list, which is derived fromTriggerEvents). CI_PULL_REQUEST_KEY,CI_PULL_REQUEST_BRANCH,CI_PULL_REQUEST_BASE,_TITLE,_AUTHOR,_URLare injected server-side from the webhook payload (since v0.9.0 — issue #9). No operator wiring needed; PR runs see them automatically. Push / manual runs skip them silently. See environment variables.- Sonar’s
wait-for-quality-gate: "true"blocks PR merge if the gate fails — the strict gate. Pair withgocdnext/ai-review’sfail-on-error: "true"for double gating once you trust both.
Pipeline 2: main.yaml
Triggers on push to main (post-merge). Re-runs the full CI suite
(defense against the rare flaky PR run), publishes a main-<sha>
image for preview environments, and generates release-candidate
notes the team sees in Slack so the release manager knows what’s
sitting in main vs the last tag.
Crucial: main.yaml does NOT tag. Tagging is a conscious
decision via release.yaml. The main-<sha> image is for
preview / dev environments — never prod.
name: main
when: event: [push] branch: [main]
stages: [test, build, candidate]
jobs: test: stage: test uses: ghcr.io/klinux/gocdnext-plugin-go@v1 cache: - key: go-{{ hash "go.sum" }} paths: [.go-mod, .go-cache] with: command: test -race ./... test_reports: ["**/*-junit.xml"]
build: stage: build docker: true needs: [test] secrets: [GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-buildx@v1 with: image: ghcr.io/my-org/my-app tags: | main-${CI_COMMIT_SHORT_SHA} main-latest push: "true" registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }}
candidate-notes: stage: candidate needs: [build] uses: ghcr.io/klinux/gocdnext-plugin-release-notes@v1 with: format: conventional output: dist/CANDIDATE.md heading: "## Candidate since last tag" artifacts: paths: [dist/CANDIDATE.md]
notify-candidate: stage: candidate needs: [candidate-notes] needs_artifacts: - from_job: candidate-notes paths: [dist/CANDIDATE.md] secrets: [SLACK_WEBHOOK] image: alpine:3.20 script: - apk add --no-cache curl jq - | NOTES=$(cat dist/CANDIDATE.md) PAYLOAD=$(jq -n --arg text "$NOTES" \ '{text: ":package: Release candidate ready (main is green)", attachments: [{text: $text}]}') curl -fsSL -X POST -H "Content-Type: application/json" \ -d "$PAYLOAD" "$SLACK_WEBHOOK"Pro git-flow dev this is the killer: “olha, eu mergei na main e a
imagem main-<sha> apareceu — mas ela NÃO foi pra prod. A
release-candidate notes diz o que tem na main desde a última
release; o release manager decide quando cortar.”
Pipeline 3: release.yaml
Triggers manually. The release manager passes TAG=vX.Y.Z at
trigger time. The pipeline validates → pushes the git tag →
builds the image → scans → signs by digest → deploys to stage →
smoke-tests stage. One pipeline does the full release-candidate
chain end-to-end (~15 min wall-clock on a typical project).
One pipeline vs. split release + tag (your call since v0.10.0)
This recipe uses one pipeline for cut-tag + build/scan/sign/
deploy because it’s the simpler shape. As of v0.10.0 gocdnext
ships event: [tag] + CI_TAG_NAME / CI_TAG_MESSAGE /
CI_TAG_AUTHOR (the git ref target SHA arrives via the existing
CI_COMMIT_SHA), so you CAN split this into a release.yaml
that only cuts the tag and a tag.yaml that auto-fires the
build/scan/sign/deploy when the tag push lands —
see the Variant: split release + tag.yaml
section below for the shape. Pick the single-pipeline form if
you want one button push to cut + deploy stage; pick the split
form if you want tag pushes from any source (CLI, GitHub UI,
another automation) to drive the build automatically.
Why scan-before-publish via candidate tag + crane copy
Earlier cuts of this doc published directly to ${TAG} and ran
the scan after, with a runbook that said “manually delete the
tag if Trivy fails”. The published-but-unscanned window was
~30s and the recovery step was a human action on a live registry
ref — fine until the day someone forgets, or a downstream
automation pulls in that window. Phase 2 of the release
hardening (issue #13) closes the window structurally: the final
tag does NOT exist until scan + sign both pass.
Multi-arch is what made the earlier doc hesitate. Buildx
publishes the multi-platform manifest list directly to the
registry — a docker tag + docker push retag from a local
daemon doesn’t preserve the manifest list. The fix is not
to retag locally; it’s crane copy. crane operates at the
registry API level: it copies the manifest list verbatim from
one tag ref to another. Zero blobs over the wire, microseconds,
multi-arch safe.
The shape:
- Build publishes
pre-${TAG}(the candidate). Manifest list lives in the registry as soon as build finishes. - Trivy scans
@${digest}. Digest is immutable — same bytes regardless of what tag points at them. - Cosign signs
@${digest}. The signature is anchored to bytes, not a tag — so it applies equally to any tag that later points at this digest. crane copy <pre-tag> <final-tag>. Registry-side manifest rewrite. Final tag now exists, signed, scanned.- Optional cleanup: drop the
pre-${TAG}tag once prod has consumed the digest. (Most operators leave it for ~7 days as a forensic anchor; it costs nothing since blobs are shared.)
If crane is unavailable in your agent image, the two drop-in
substitutes are skopeo copy and docker buildx imagetools create --tag <final> <pre>. All three are registry-side ops.
The cost: one extra job (promote) and an extra image pull in
build (for the crane binary, ~15MB Alpine package). The
benefit: an attacker can’t pull an unscanned tag because the
unscanned tag never had the production name in the first place.
name: release
when: event: [manual]
# Operator passes at trigger time. Pipeline refuses to run if# empty or if the tag already exists.variables: TAG: ""
stages: [validate, tag, build, scan, sign, promote, stage]
jobs: # ---------- validate inputs + repo state --------------------- validate: stage: validate image: alpine:3.20 script: - apk add --no-cache git - | if [ -z "$TAG" ]; then echo "❌ TAG variable required at trigger time (e.g. v1.2.3)" exit 1 fi case "$TAG" in v[0-9]*) ;; *) echo "❌ TAG must start with 'v' and a digit (got: $TAG)" exit 1 ;; esac if git rev-parse "refs/tags/$TAG" >/dev/null 2>&1; then echo "❌ tag $TAG already exists" exit 1 fi LAST=$(git describe --tags --abbrev=0 2>/dev/null || echo "v0.0.0") COUNT=$(git rev-list "$LAST..HEAD" --count) if [ "$COUNT" -eq 0 ]; then echo "❌ No commits since $LAST — nothing to release" exit 1 fi echo "✓ ready to cut $TAG ($COUNT commits since $LAST)"
# ---------- confirm main.yaml passed green on this SHA ------- # The validate job above checks "are there commits?" — it does # NOT check "did main.yaml succeed on the SHA we're about to # tag?". Without this second check, an operator can cut a tag # off a SHA whose main-branch CI never passed (smoke broke, # integration test broke, somebody force-merged past branch # protection). The release pipeline would then build + sign + # promote a binary that has no green CI behind it. # # `check-pipeline-run` (the same plugin prod.yaml uses for # preflight) queries the gocdnext REST API and matches by # `revision:` — the SHA the release pipeline checked out. # Refuses to advance unless that revision has a terminal- # success run of `main`. max-age is generous (14d) because a # SHA stays green forever; this is a "did it pass?" check, not # a freshness check (validate above already caught the "is the # tip past the last tag?" question). verify-main-green: stage: validate needs: [validate] secrets: [GOCDNEXT_API_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-check-pipeline-run@v1 with: api-url: https://gocdnext.example.com api-token: ${{ GOCDNEXT_API_TOKEN }} project: my-org pipeline: main revision: ${CI_COMMIT_SHA} expected-status: success max-age: 14d
# ---------- generate release notes + push tag ---------------- release-notes: stage: tag needs: [validate, verify-main-green] uses: ghcr.io/klinux/gocdnext-plugin-release-notes@v1 with: format: conventional output: dist/RELEASE_NOTES.md heading: "## ${TAG}" artifacts: paths: [dist/RELEASE_NOTES.md]
create-tag: stage: tag needs: [release-notes] needs_artifacts: - from_job: release-notes paths: [dist/RELEASE_NOTES.md] secrets: [GH_RELEASE_TOKEN] image: alpine:3.20 script: - apk add --no-cache git - | git config user.email "ci-release@my-org.example.com" git config user.name "release-bot" git tag -a "$TAG" -F dist/RELEASE_NOTES.md
# GIT_ASKPASS pattern: the token never lands on argv (vs # `git remote set-url ...token...`) and never persists in # `.git/config` (vs embedding in the remote URL). The # helper is a tiny shell script that prints credentials # from inherited env; the trap wipes it on exit. ASKPASS=$(mktemp) chmod 700 "$ASKPASS" trap 'rm -f "$ASKPASS"' EXIT cat > "$ASKPASS" <<'EOF' #!/bin/sh case "$1" in Username*) echo "x-access-token" ;; Password*) echo "$GH_RELEASE_TOKEN" ;; esac EOF chmod +x "$ASKPASS"
# Assumes `origin` is the https://github.com/my-org/my-app.git # URL the agent cloned (no embedded creds). If your agent # uses SSH, replace with an ssh-key-based push instead. GIT_ASKPASS="$ASKPASS" git push origin "$TAG" echo "==> Pushed tag $TAG"
# ---------- build immutable artefact (CANDIDATE tag) --------- # v0.14.x recipe: scan-before-publish via candidate tag. Pre- # v0.14.x this section published directly to ${TAG} and the # scan ran after — the failure mode was "scan finds a HIGH CVE, # the operator now has to remember to manually delete the tag # that's already live in the registry". The candidate-then- # promote shape below makes the FINAL tag exist ONLY after # scan + sign both pass. # # Multi-arch is the wrinkle: buildx publishes the manifest list # directly to the registry (it can't be retagged locally because # a docker daemon doesn't hold a multi-arch manifest natively). # The workaround is to push to a candidate tag, scan + sign # against the immutable digest (works regardless of tag), then # `crane copy` the manifest list to the final tag. crane copy is # a registry-side manifest rewrite — zero blob upload, microseconds. build: stage: build needs: [create-tag] docker: true secrets: [GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-buildx@v1 # `digest:` output is the sha256:… of the built+pushed manifest; # downstream jobs (scan, sign, promote, prod's preflight) all # consume this so the operation runs against an immutable # digest. See [Prerequisite 2: Deploy by digest, not tag](#2-deploy-by-digest-not-tag). outputs: digest: IMAGE_DIGEST with: image: ghcr.io/my-org/my-app # CANDIDATE tag (pre-${TAG}). Only operators with explicit # knowledge of this convention can `docker pull` it; the # public ${TAG} doesn't exist until after promote below. # Drops the historical `:latest` push entirely — a moving # tag fights the immutability claim the rest of the pipeline # makes (see Phase 1 Prerequisite 2). tags: pre-${TAG} platforms: linux/amd64,linux/arm64 push: "true" registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }} cache: registry
# ---------- scan published image ----------------------------- trivy-image: stage: scan docker: true needs: [build] cache: - key: trivy-db paths: [.cache/trivy] uses: ghcr.io/klinux/gocdnext-plugin-trivy@v1 with: scan-type: image # Scan the DIGEST emitted by build, not the tag — defends # against the (unlikely but real) race where someone pushes # over the tag between build and scan. The bytes that get # signed below are the SAME bytes scanned here. target: ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} severity: HIGH,CRITICAL exit-code: "1" # run-fail = operator deletes the tag
# ---------- sign via plugin's key-content input -------------- # The gocdnext/cosign plugin accepts `key-content:` which pipes # the private-key bytes straight from `secrets:` — the plugin's # entrypoint writes them to a 0600 mktemp file and a `trap` # wipes it on exit. The key never persists in the artifact # backend (S3/GCS/filesystem) and (since v0.7.x) never lands on # the docker run argv either — the agent's Docker engine # propagates env values via cmd.Env + `docker run -e KEY` # (name-only on argv) so secret-bearing values are invisible # to `ps auxww` on the host. # # No `docker: true` — cosign signs via the registry API; it # doesn't need the host docker socket. Keeping it off reduces # the blast radius of the job that holds the private key. # # Registry creds are required: `cosign sign` uploads the # signature manifest, which on a private registry (like # private GHCR) needs auth. cosign-sign: stage: sign needs: [trivy-image] secrets: [COSIGN_PRIVATE_KEY, COSIGN_PASSWORD, GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-cosign@v1 with: # Pin by digest, NOT tag. Cosign anchors the signature to # the manifest digest regardless of how you invoke it, but # passing the digest explicitly closes the # tag-resolved-twice race: build pushed digest A under # candidate, someone races a push of digest B, scan ran # on A, cosign would sign B if it re-resolves the tag # here. Reusing `needs.build.outputs.digest` ensures sign # sees the same bytes that scan blessed. image: ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} action: sign key-content: ${{ COSIGN_PRIVATE_KEY }} key-password: ${{ COSIGN_PASSWORD }} registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }}
# ---------- promote candidate → final tag -------------------- # crane copy is a REGISTRY-SIDE manifest rewrite. The blobs # already exist (build pushed them); copy creates a new tag # ref pointing at the same digest. Microseconds + zero bytes # over the wire. Multi-arch safe because it copies the manifest # LIST verbatim — no per-platform retag dance needed. # # After this job lands, the registry has: # ghcr.io/my-org/my-app:pre-${TAG} (candidate — operators can keep or delete) # ghcr.io/my-org/my-app:${TAG} (final, scanned + signed) # ghcr.io/my-org/my-app@sha256:… (digest, what prod actually pins) # # The cosign signature, anchored to the digest, applies to both # tag references — they point at the same bytes. promote: stage: promote needs: [cosign-sign] docker: true secrets: [GHCR_USERNAME, GHCR_TOKEN] image: alpine:3.20 script: - apk add --no-cache crane - | # crane authenticates via Docker-config; the plugin-style # `docker login` invocation works inside the agent's daemon. echo "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin crane copy \ ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} \ ghcr.io/my-org/my-app:${TAG}
# OPTIONAL: delete the candidate tag once the final tag has # landed and prod's preflight has consumed the digest. Many # operators keep the candidate around for ~7 days as a forensic # anchor (the candidate tag references the same digest, so # debugging "what did we ship" works against either). Adding a # scheduled cleanup pipeline is left as an exercise — the # important thing is that the candidate tag is NOT what prod # references; only the final + digest are load-bearing.
github-release: stage: sign needs: [cosign-sign] secrets: [GH_RELEASE_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-github-release@v1 with: tag: ${TAG} title: "Release ${TAG}" token: ${{ GH_RELEASE_TOKEN }} generate-notes: "true"
# ---------- auto-deploy to stage ----------------------------- deploy-stage: stage: stage needs: [promote, build] secrets: [STAGE_KUBECONFIG] uses: ghcr.io/klinux/gocdnext-plugin-helm@v1 with: # helm plugin auto-detects kubeconfig as inline YAML, # base64-encoded YAML, or workspace-relative path. kubeconfig: ${{ STAGE_KUBECONFIG }} # Set image.digest, not image.tag. The chart resolves to # `image: my-app@${digest}` — same posture prod uses, so # stage exercises the digest-pinned path before prod does. # See [Prerequisite 2: Deploy by digest, not tag](#2-deploy-by-digest-not-tag). command: | upgrade --install my-app charts/my-app --namespace stage --set image.digest=${{ needs.build.outputs.digest }} --wait --timeout 5m
smoke-stage: stage: stage needs: [deploy-stage] image: alpine:3.20 script: - apk add --no-cache curl - | for url in $(cat ./tests/smoke-urls.txt); do if ! curl -fsSL "$url"; then echo "❌ smoke failed on $url" exit 1 fi done echo "✓ stage smoke passed for ${TAG}"
notifications: - on: success uses: ghcr.io/klinux/gocdnext-plugin-slack@v1 secrets: [SLACK_WEBHOOK] with: webhook: ${{ SLACK_WEBHOOK }} channel: "#deploys" template: | :white_check_mark: ${TAG} deployed to STAGE. Run prod deploy manually when ready.Important: prod is not touched. The Slack message says explicitly “Run prod deploy manually when ready”. That’s the contract: published image = release candidate, not production truth.
Verifying the SHA passed main.yaml
The verify-main-green job in the validate stage above is the
quietest of the gates and the easiest to omit, so it’s worth
naming the failure mode it closes. The earlier validate job
checks only “does the tag already exist?” and “is there at least
one commit since the previous tag?”. Neither answers the
question that actually matters at release time: did the SHA
we’re about to tag run main.yaml green?
The combination that bites teams:
- An emergency revert lands on main while main.yaml is failing.
- The release manager opens release.yaml against the HEAD of main (which is now the revert commit), passes a tag, clicks Run.
- validate passes — the tag doesn’t exist, there’s a commit since last release.
- The build job runs against a SHA whose CI never finished: unit/integration/smoke/perf all skipped, because main.yaml is still mid-run or already red.
- A tag goes out, an image gets signed, stage deploys. The first evidence anyone has that the SHA was bad is when stage smoke catches it (sometimes) or when prod catches it (worse).
The gocdnext/check-pipeline-run@v1 plugin closes this by
querying the gocdnext API for a terminal-success main run
matching revision: ${CI_COMMIT_SHA} — the exact SHA the release
pipeline checked out. Same plugin prod.yaml uses for its
preflight against release; same shape, different upstream
pipeline. If main.yaml didn’t finish green on this SHA, the
release pipeline halts before producing the tag. The operator
fixes main first (corrective merge, hotfix, whatever), waits for
main.yaml to land green on the new HEAD, then re-runs release.
The max-age: 14d value is permissive on purpose: a SHA’s
green-ness doesn’t expire — main.yaml passed once, the bytes
haven’t changed since. The freshness gate is the validate
job’s “are there new commits since last tag?” check, which runs
right before this one. Set max-age lower (e.g. 48h) only if
your team explicitly wants to force a re-run of main.yaml when
holding a tag for more than two days.
Keyless cosign (alternative when OIDC is available)
The recipe above uses a key-based sign because most teams have a PEM key + passphrase but not Sigstore OIDC. When your platform DOES have OIDC (Fulcio + Rekor on the public network, or a private Sigstore deployment), prefer keyless — no key to manage, no rotation. Drop the secrets + key inputs:
cosign-sign: stage: sign needs: [trivy-image] secrets: [GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-cosign@v1 with: image: ghcr.io/my-org/my-app:${TAG} action: sign # No key or key-content → plugin signs keyless via Fulcio. registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }}This requires a workload-identity setup gocdnext doesn’t ship today (the agent’s pod needs a projected SA token + the registry needs to accept Fulcio certificates). Use it if your platform team has this; otherwise the key-content pattern above is the realistic path.
Cleaner downstream chains with native job outputs (v0.11+)
Since v0.11.0 gocdnext supports
structured job outputs.
The recipes above use the legacy “downstream is image: + script: +
source .gocdnext/foo.env” pattern because that’s what worked on
older agents; for new pipelines, declare outputs: on the producer
and reference ${{ needs.X.outputs.Y }} directly in any consumer’s
with: or env:. That lets the cosign-sign step (and every other
step that needs a runtime value from a prior job) stay as a clean
uses: plugin invocation instead of inlining shell.
gocdnext/semver-bump@v1 and gocdnext/image-copy@v1 already
write both the legacy workspace file AND $GOCDNEXT_OUTPUT_FILE,
so the migration is a one-line declaration on the producer + a
one-token substitution on the consumer.
Variant: split release + tag.yaml
Since v0.10.0 gocdnext routes event: [tag] webhooks and emits
CI_TAG_NAME / CI_TAG_MESSAGE / CI_TAG_AUTHOR (the git ref
target SHA is exposed via the existing CI_COMMIT_SHA — that’s
the git SHA the tag points at, NOT an OCI image digest, so use
CI_TAG_NAME for image references the registry will resolve).
This lets release.yaml do just the tag cut, and a separate
tag.yaml auto-fires on tag push to do the build/scan/sign/
deploy. The split is cleaner because the build pipeline doesn’t
need a manual TAG variable — it reads ${CI_TAG_NAME} from the
webhook payload — and any tag pushed via CLI or GitHub UI fires
the same build, not just tags from the Run-latest dashboard.
name: releasewhen: event: [manual]
variables: TAG: ""
stages: [validate, tag]
jobs: validate: stage: validate image: alpine:3.20 script: [apk add --no-cache git, | # Same validate body as the single-pipeline variant above. ...]
create-tag: stage: tag needs: [validate] secrets: [GH_RELEASE_TOKEN] image: alpine:3.20 script: [apk add --no-cache git, | # Same GIT_ASKPASS push as the single-pipeline variant. # The tag push fires the GitHub webhook → tag.yaml takes over. ...]name: tagwhen: event: [tag]
stages: [build, scan, sign, promote, stage]
jobs: build: stage: build docker: true secrets: [GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-buildx@v1 outputs: digest: IMAGE_DIGEST with: image: ghcr.io/my-org/my-app # Candidate tag only — same scan-before-publish posture # as the single-pipeline form. No `latest`: moving tags # fight the immutability claim Kyverno enforces at admission. tags: pre-${CI_TAG_NAME} platforms: linux/amd64,linux/arm64 push: "true" registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }} cache: registry
trivy-image: stage: scan docker: true needs: [build] uses: ghcr.io/klinux/gocdnext-plugin-trivy@v1 with: scan-type: image # Pin by digest — same bytes the build pushed, immune to # any concurrent push racing the candidate tag. target: ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} severity: HIGH,CRITICAL exit-code: "1"
cosign-sign: stage: sign needs: [trivy-image] secrets: [COSIGN_PRIVATE_KEY, COSIGN_PASSWORD, GHCR_USERNAME, GHCR_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-cosign@v1 with: image: ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} action: sign key-content: ${{ COSIGN_PRIVATE_KEY }} key-password: ${{ COSIGN_PASSWORD }} registry: ghcr.io username: ${{ GHCR_USERNAME }} password: ${{ GHCR_TOKEN }}
promote: stage: promote needs: [cosign-sign] docker: true secrets: [GHCR_USERNAME, GHCR_TOKEN] image: alpine:3.20 script: - apk add --no-cache crane - | echo "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin crane copy \ ghcr.io/my-org/my-app@${{ needs.build.outputs.digest }} \ ghcr.io/my-org/my-app:${CI_TAG_NAME}
deploy-stage: stage: stage needs: [promote, build] secrets: [STAGE_KUBECONFIG] uses: ghcr.io/klinux/gocdnext-plugin-helm@v1 with: kubeconfig: ${{ STAGE_KUBECONFIG }} command: | upgrade --install my-app charts/my-app --namespace stage --set image.digest=${{ needs.build.outputs.digest }} --wait --timeout 5mHow it routes: tag pushes match by URL (a tag points at a SHA
that may not be on any branch, so branch-based routing can’t
work). Every git material on this repo with events: [tag]
fires; materials without tag in their list are silently
skipped. The implicit project material auto-inherits its events
from the pipeline-level when.event:, so the tag.yaml above
needs no materials: block.
prod.yaml is unchanged — it still triggers manually with
TAG=vX.Y.Z passed at trigger time, since prod promotion is a
human gate by design (not a webhook follow-on).
Pipeline 4: prod.yaml
Triggers manually. The only pipeline that can touch production.
Approval block with approver_groups: [release-approvers] and
required: 2 ensures no single actor can ship to prod alone.
name: prod
when: event: [manual]
variables: # Release-approver picks the tag at trigger time. TAG: v0.0.0
stages: [gate, preflight, deploy, verify]
jobs: approve-prod: stage: gate approval: description: | Promote ${TAG} to production? Check that stage smoke results are green for this tag. approver_groups: [release-approvers] required: 2 # quorum 2 — default for normal promotions quorum_by_label: # v0.13.0+ — PR-label-driven override hotfix: 1 # `hotfix` label on the originating PR → quorum 1 # add `breaking-change: 3` if your team prefers stricter for risky tags
preflight: stage: preflight needs: [approve-prod] # Since v0.12.0 this is a typed plugin call instead of inline # curl + jq. Confirms a terminal-success run of `release` # exists for ${TAG} via the gocdnext REST API AND that the # resolved digest still matches what release.yaml signed # (`require-digest-match: true` since v0.14.x — re-runs # `crane digest` against the current registry state and # aborts if the tag has been re-pushed since release ran). # Fails the gate loud (exit 1) when either check fails — the # prod deploy chain stays red. The digest flows out as an # output for the deploy job below. secrets: [GOCDNEXT_API_TOKEN] uses: ghcr.io/klinux/gocdnext-plugin-check-pipeline-run@v1 outputs: digest: IMAGE_DIGEST run_url: RUN_URL with: api-url: https://gocdnext.example.com api-token: ${{ GOCDNEXT_API_TOKEN }} project: my-org pipeline: release tag: ${TAG} expected-status: success require-digest-match: true max-age: 7d artifacts: paths: [".gocdnext/preflight.env"]
deploy-prod: stage: deploy needs: [preflight] # `secrets: [PROD_KUBECONFIG]` is the legacy posture: a long- # lived kubeconfig sitting in gocdnext's secret store. # Acceptable when the cluster isn't on a cloud with OIDC # federation; otherwise prefer the workload-identity variant # — see [Workload identity (recommended for new deployments)](#workload-identity-recommended-for-new-deployments). secrets: [PROD_KUBECONFIG] uses: ghcr.io/klinux/gocdnext-plugin-helm@v1 with: kubeconfig: ${{ PROD_KUBECONFIG }} # Pin by DIGEST (not tag) so the bytes the cluster runs are # exactly the bytes release.yaml signed. See # [Prerequisite 2: Deploy by digest, not tag](#2-deploy-by-digest-not-tag) # for why this isn't optional. Kyverno's `mutateDigest` # would catch a tag-only deploy too, but pinning here gives # us a deterministic Helm release shape (revision N always # references the same bytes) which makes auto-rollback # behave predictably. # # `--set rollout.enabled=true` renders the chart's Rollout # template (Argo Rollouts) instead of a Deployment — see # [Prerequisite 4: Progressive delivery](#4-progressive-delivery-on-prod-canary-not-big-bang). # `--wait` here only blocks until the Rollout object is # applied, NOT until the canary ramp completes. The # canary progression is observed by `watch-canary` below. command: | upgrade --install my-app charts/my-app --namespace prod --set image.repository=ghcr.io/my-org/app --set image.digest=${{ needs.preflight.outputs.digest }} --set rollout.enabled=true --wait --timeout 5m
watch-canary: stage: deploy needs: [deploy-prod] secrets: [PROD_KUBECONFIG] image: alpine:3.20 script: - apk add --no-cache bash curl - | export KUBECONFIG=$(mktemp) chmod 600 "$KUBECONFIG" printf '%s' "$PROD_KUBECONFIG" > "$KUBECONFIG" trap 'rm -f "$KUBECONFIG"' EXIT curl -fsSL -o /usr/local/bin/kubectl-argo-rollouts \ https://github.com/argoproj/argo-rollouts/releases/download/v1.7.2/kubectl-argo-rollouts-linux-amd64 chmod +x /usr/local/bin/kubectl-argo-rollouts # `--watch` blocks until terminal status. Exit code 0 = # Healthy (canary fully promoted). Non-zero = Degraded # (analysis aborted; Argo Rollouts has already shifted # traffic back to the old ReplicaSet). The pipeline # failure surfaces in the failure notification block. # `--timeout 600s` is the upper bound on the FULL ramp # (10→25→50→100 with pauses + analysis between). kubectl-argo-rollouts get rollout my-app -n prod --watch --timeout 600s
smoke-prod: stage: verify needs: [watch-canary] secrets: [PROD_KUBECONFIG, SLACK_WEBHOOK] image: alpine:3.20 script: # apk's helm package (community repo on Alpine 3.19+) gives # us `helm rollback` from a tiny image without bundling a # custom one. Same as bash/curl/openssl pattern. - apk add --no-cache bash curl jq helm openssl - | # Write kubeconfig to a private file the rollback can use. # Inline secret value goes through stdin → file, never argv. export KUBECONFIG=$(mktemp) chmod 600 "$KUBECONFIG" printf '%s' "$PROD_KUBECONFIG" > "$KUBECONFIG" trap 'rm -f "$KUBECONFIG"' EXIT
# At this point the canary is at 100% (watch-canary returned # 0). Smoke is the LAST defence — catches issues that # weren't visible in the canary error-rate signal (e.g. # contract-level regressions where the response is HTTP # 200 but the body is wrong, or first-time-only paths # that only fire on a fresh user session). if ! ./tests/smoke-prod.sh; then echo "❌ prod smoke failed at 100% — full helm rollback" # `helm rollback <release> 0` rolls back to the prior # revision (0 = "one before current"). More reliable # than trying to compute "previous tag" under # concurrent partial deploys. helm rollback my-app 0 -n prod --wait --timeout 10m curl -fsSL -X POST "$SLACK_WEBHOOK" \ -H "Content-Type: application/json" \ -d "{\"text\":\":fire: prod rollback (smoke failed on ${TAG})\"}" exit 1 fi echo "✓ prod smoke passed for ${TAG}"
notifications: - on: success uses: ghcr.io/klinux/gocdnext-plugin-slack@v1 secrets: [SLACK_WEBHOOK] with: webhook: ${{ SLACK_WEBHOOK }} channel: "#deploys" template: ":rocket: ${TAG} live in PROD" - on: failure uses: ghcr.io/klinux/gocdnext-plugin-slack@v1 secrets: [SLACK_WEBHOOK] with: webhook: ${{ SLACK_WEBHOOK }} channel: "#oncall" template: | :fire: PROD deploy ${TAG} failed — auto-rollback may have triggered. Investigate: ${CI_RUN_ID}The deploy stage now gives you two halt mechanisms before human attention is required, ordered by blast radius:
- Canary analysis (Argo Rollouts) — aborts during ramp
when the prometheus error rate crosses 1% on the canary
slice. Traffic shifts back to the old ReplicaSet
automatically;
watch-canaryexits non-zero; the pipeline fails before smoke ever runs. Max user impact: 10% of traffic for ~30s. - Full smoke + helm rollback — runs ONLY after canary
reached 100%. Catches regressions invisible to the canary’s
error-rate signal (semantic regressions, first-time-only
code paths). When it triggers, the full deploy is on prod
already — same
helm rollback 0story the pre-v0.14 doc described. Max user impact: 100% of traffic until rollback completes (usually < 60s).
Why the auto-rollback path is non-negotiable for trunk-based:
both halts together remove the “what if it breaks?” fear. The
previous Helm revision is always one helm rollback away, and
most failures never reach that point because the canary catches
them at 10%.
Workload identity (recommended for new deployments)
secrets: [PROD_KUBECONFIG] in the deploy job above works, but
it bakes a long-lived credential into gocdnext’s secret store
that grants helm upgrade rights against the production cluster.
That secret has the same blast radius as a stolen ~/.kube/config
— anyone who can read the gocdnext secret (DB compromise, backup
extraction, a malicious plugin running in the same agent) gets
prod-cluster admin. Rotation requires a coordinated update
across every project that uses the secret. For 2026-era
deployments, this is the credential posture the auditor will
flag.
Workload identity federation removes the long-lived credential entirely. The agent exchanges its OIDC token (proof “this is run N of pipeline X on project Y, signed by gocdnext”) for a short-lived cloud token (5-minute STS credential), then uses that token to mint a kubeconfig that’s valid only for the duration of the deploy. Nothing long-lived sits anywhere.
The pattern is platform-specific in the details, identical in shape:
- The cloud IAM trusts gocdnext’s OIDC issuer as a federated identity provider.
- A trust policy on the prod-deploy role restricts which
subjects can assume it — typically
project:my-org+pipeline:prod+ref:refs/heads/mainso the role can’t be assumed from a fork PR or a different pipeline. - The agent presents its OIDC token at deploy time; the cloud returns a short-lived cluster-access token.
GKE + Workload Identity Federation (full example)
The path with the most operator value today, because GKE makes the kubeconfig minting clean:
resource "google_iam_workload_identity_pool" "gocdnext" { workload_identity_pool_id = "gocdnext-pool" display_name = "gocdnext federated identities"}
resource "google_iam_workload_identity_pool_provider" "gocdnext_oidc" { workload_identity_pool_id = google_iam_workload_identity_pool.gocdnext.workload_identity_pool_id workload_identity_pool_provider_id = "gocdnext-oidc" display_name = "gocdnext OIDC" oidc { issuer_uri = "https://gocdnext.example.com" } # Map the OIDC claims gocdnext puts in the token to attributes # the trust policy can match against. project/pipeline/ref are # what the cloud uses to scope WHICH gocdnext runs may assume # this role. attribute_mapping = { "google.subject" = "assertion.sub" "attribute.project" = "assertion.project" "attribute.pipeline" = "assertion.pipeline" "attribute.ref" = "assertion.ref" } # CRITICAL: without an attribute_condition, ANY gocdnext run # could assume any role this pool trusts. Lock to a single # project + pipeline + ref before the binding is created. attribute_condition = <<EOT attribute.project == "my-org" && attribute.pipeline == "prod" && attribute.ref == "refs/heads/main" EOT}
resource "google_service_account" "prod_deployer" { account_id = "gocdnext-prod-deployer" display_name = "gocdnext prod-deployer (federated)"}
resource "google_service_account_iam_member" "wif_binding" { service_account_id = google_service_account.prod_deployer.name role = "roles/iam.workloadIdentityUser" member = "principalSet://iam.googleapis.com/projects/PROJECT_NUM/locations/global/workloadIdentityPools/gocdnext-pool/attribute.project/my-org"}
# GKE-side: bind the SA to the cluster-access role you actually# need. `container.developer` is the minimum for `helm upgrade`# in a single namespace; never bind cluster-admin.resource "google_project_iam_member" "gke_access" { project = "my-prod-project" role = "roles/container.developer" member = "serviceAccount:${google_service_account.prod_deployer.email}"}Then prod.yaml’s deploy job changes shape — no PROD_KUBECONFIG
secret, instead an OIDC token claim that the gcloud plugin
exchanges via gcloud auth login --cred-file:
deploy-prod: stage: deploy needs: [preflight] # No PROD_KUBECONFIG secret — that's the entire point. The only # secret-bearing input is the OIDC token gocdnext mints # automatically; nothing long-lived to compromise or rotate. oidc: audience: "//iam.googleapis.com/projects/PROJECT_NUM/locations/global/workloadIdentityPools/gocdnext-pool/providers/gocdnext-oidc" uses: ghcr.io/klinux/gocdnext-plugin-gcloud@v1 with: # `credentials-json:` left empty triggers WIF mode (per plugin # docs). The plugin reads GOOGLE_APPLICATION_CREDENTIALS, which # points at a credential-config file the agent renders from the # `oidc:` block above. STS exchange happens inside the plugin # container; the short-lived token lives in env for ~5 minutes # and disappears with the container. command: | container clusters get-credentials prod-cluster --region us-east1 --project my-prod-project && helm upgrade --install my-app charts/my-app --namespace prod --set image.digest=${{ needs.preflight.outputs.digest }} --set rollout.enabled=true --wait --timeout 5m project: my-prod-projectAWS — IAM Roles for Service Accounts (IRSA) + STS AssumeRoleWithWebIdentity
Same shape, different control plane. The trust policy attaches
to an IAM role; gocdnext’s OIDC token gets exchanged via
sts:AssumeRoleWithWebIdentity for a short-lived AWS session.
The aws-cli plugin handles the exchange when role-arn: is set
and access-key: / secret-key: are empty. EKS clusters then
authenticate the deploy via the role’s mapped Kubernetes RBAC
(aws-auth ConfigMap).
Trust policy on the role restricts sub to the same
project/pipeline/ref tuple GCP locks down — without that, any
gocdnext run can assume it. The session is bounded to
--duration-seconds 900 (15 min cap on web-identity sessions).
Azure — Workload Identity Federation on AKS
The AKS path uses Azure AD’s federated identity credentials.
The trust policy lists allowed subject claims (same tuple).
The az-cli plugin (or azure/login@v2 equivalent) handles the
OIDC → AAD token exchange; az aks get-credentials then mints
the kubeconfig. Same short-lived window (~1h max).
Legacy: kubeconfig secret (still supported)
If your cluster isn’t on a cloud with OIDC federation, or your
platform team hasn’t wired the trust policy yet, the
Pipeline 4: prod.yaml secrets: [PROD_KUBECONFIG] form above remains valid. Treat it as the
fallback, not the recommended posture. Rotate the kubeconfig at
least quarterly; bind it to a service account with the
narrowest possible RBAC (a single namespace’s
Role/RoleBinding, never cluster-admin); audit who has
read access on the gocdnext secret store. When the platform
team can adopt WIF, prefer it.
Hotfix flow
Same four files, two differences:
- PR has the
hotfixlabel. The reviewer policy can be relaxed (e.g. branch protection rule “Allow specific actors to bypass” for the on-call user). prod.yaml’s approval gate reads thehotfixlabel natively viaquorum_by_label(shipped v0.13.0). The override is a snapshot at run materialisation — the PR’s labels are read once when the run is created, no pre-step or GitHub-API round-trip. See Approval gates.
The path through the pipelines is identical: PR → main → tag → stage → prod. Hotfix is faster, not safer. Code review (human + AI), security scans, build immutable, stage smoke don’t disappear.
Rollback
Three mechanisms, ordered by when they fire. Most failures stop at #1; the rest split across #2 and #3.
- Canary auto-abort (Argo Rollouts, during deploy).
Analysis crosses the error-rate threshold mid-ramp; Rollouts
shifts traffic back to the old ReplicaSet automatically.
watch-canaryexits non-zero; the pipeline fails. The workload is already on the old version — no rollback step needed, just close the loop (see Recovering from a canary abort). - Auto-rollback via
helm rollback(used bysmoke-prodon failure, post-canary). Catches regressions invisible to the canary’s error-rate signal — semantic regressions, first-time-only paths. At this point the deploy is at 100%, so the rollback transition has the full blast radius until it completes. - Manual rollback — re-trigger
prod.yamlwith the previous goodTAG. For regressions caught past the smoke window, or when chart values changed between deploys.
Pre-v0.14 versions of this doc described #2 and #3 without picking one — operators ended up mixing them, leaving the cluster in different states (Helm revision points at chart values from N-1, re-deploy uses values at HEAD with image of N-1, drift). The default is #2 (auto, within the smoke window); #3 is the explicit escape valve.
When rollback is safe (default path)
A rollback is safe when all of these hold:
- The migration contract from Prerequisite 3 is followed: every migration between current and target is backward-compatible, no destructive DDL crossed the rollback boundary.
- The signature verification + digest pinning prereqs are in
place — Kyverno’s
mutateDigestensures the rolled-back container is the bytes that were signed for the target tag, not a re-tagged drift. - The target tag is one that ran green through
release.yamlAND itsprod.yamldeploy in the past (i.e., it’s a tag that has already been to prod successfully). Tags that only passed stage haven’t proven themselves under prod load.
When all three hold, rollback is one of the flavours below. When they don’t, jump to Fix-forward.
Recovering from a canary abort
When Argo Rollouts auto-aborts mid-ramp, traffic is already on
the old ReplicaSet (status.currentPodHash still points at the
old version). The Rollout resource itself, however, references
the new image — Argo Rollouts holds the new ReplicaSet at 0
replicas pending operator action. Two paths:
-
Discard the new revision (the common case — you want the pipeline failure to mean “this tag is dead”):
kubectl-argo-rollouts undo my-app -n produndorewrites the Rollout spec back to the previous revision’s image. The 0-replica new ReplicaSet is GC’d. Cluster state matches “this deploy never happened” except for the failed ArgoExperiment/AnalysisRunresources, which Argo keeps for forensic reference. -
Force the canary through anyway (rare — analysis threshold was too tight, the new image is actually fine):
kubectl-argo-rollouts promote my-app -n prodpromoteskips remaining analysis steps and ramps to 100%. Use ONLY when you’ve verified externally (manual smoke, prometheus query at the canary slice) that the abort was a false positive. Re-trigger the on-call analysis-threshold review before the next release.
The on-call playbook should default to undo and treat
promote as a documented exception. Either way, log the
decision in the incident channel — the canary signal is only
as trustworthy as the operator’s discipline about NOT promoting
through it.
Auto-rollback (Helm-revision based)
When smoke-prod fails immediately after a deploy, the same job
calls helm rollback my-app 0 -n prod — Helm steps back to the
prior revision. This is what prod.yaml does today; nothing
human-driven is needed. The cluster state matches the previous
revision EXACTLY (same chart values, same image, same secrets)
because Helm tracks the revision atomically.
Use this for: smoke-test failures, immediate post-deploy regressions caught by automation. Window: the first ~5 minutes after the deploy lands.
Manual rollback (re-trigger prod.yaml with the previous tag)
After the smoke-test window closes (auto-rollback won’t fire),
or when the regression was caught by humans hours/days later,
re-trigger prod.yaml with the previous good tag:
Dashboard → prod pipeline → Run latest TAG: v0.6.4 (the previous known-good) → approval gate (quorum 2 — same as forward) → preflight (resolves digest from release.yaml) → helm upgrade --set image.digest=<v0.6.4 digest> → smokeYes, “shit’s broken right now” deploys still need 2 approvers. Intentional: panic deploys are how incidents compound. The on-call playbook should include a list of known-good tags so the approval is < 60 seconds.
Use this for: regressions caught past the smoke window, when auto-rollback didn’t fire, or when chart values changed between the deploys (Helm revision rollback would also revert the values change, which may not be what you want).
When rollback is NOT safe — fix-forward
If the deploy you’d roll back across includes any of these, don’t roll back — fix forward:
- A migration that’s not backward-compatible (column drop without
expand/contract, NOT NULL on existing data without default,
destructive DDL marked by the PR’s
migration:destructivelabel that bypassed the higher-quorum gate). - A secret rotation that depended on the new code reading the rotated value (rolling back means old code with new secret = auth failures).
- A feature-flag default change that the previous code can’t parse (also expand/contract territory; if you cross-deploy flag updates, leave the old default working).
Fix-forward = open a hotfix PR with the corrective change, ride
it through pr.yaml → main.yaml → release.yaml →
prod.yaml like any other deploy. The hotfix label loosens
quorum on prod from 2 → 1 (per quorum_by_label); everything
else stays the same. ~30 minutes wall-clock for the round-trip.
The on-call playbook should call this out explicitly: the default is rollback; the exception is fix-forward; the question that decides is “does the migration / secret / flag survive a backward jump?” That single question is the most important rollback skill an on-call rotation can build.
Migration plan
For a team currently on git-flow, adopt incrementally so each phase proves safety before the next lands:
Week 1-2: pr.yaml only
- Land the file, hook on PR open/sync.
- Team continues developing on
developif they want. - Devs see lint/security/test/AI-review results on every PR.
- They start trusting the CI.
Week 3-4: main.yaml
- Land the main pipeline.
- Switch default branch to main; mark
developdeprecated. - Devs see main stays green; merges don’t break anything (prod isn’t connected yet).
- Release-candidate notes start showing up in Slack.
Week 5: release.yaml
- Move release management from manual
git tag+ manual build scripts to running the release pipeline. Operator passesTAG=vX.Y.Zat trigger time; the pipeline does git tag + build + scan + sign + stage deploy + smoke in one go. - Team sees the value of stage smoke before prod.
Week 6+: prod.yaml
- Add the prod pipeline with quorum 2.
- The “release-manager-only-can-deploy” flow goes away.
- Delete
developbranch. - Update CONTRIBUTING.md to point at this page.
The point of phasing is psychological, not technical. Devs who were scared of trunk-based see each step prove safety. After phase 5 they’re asking “can we ship more often?”.
Selling this to skeptics
If you have devs who think git-flow is the safe option, the talking points that work:
- “Merging to main is NOT deploying to prod.” Show the 4-stop chain. The mental model fits on one slide.
- “Two humans approve every prod deploy.” That’s already more than your git-flow has — most git-flow shops have “release manager + tags” with no second-approval contract.
- “Rollback is one click — and most failures never need
one.” Canary auto-abort (Argo Rollouts) catches degraded
deploys at ≤ 10% traffic before they ramp. When the smoke
probe at 100% does catch something,
helm rollbackis one command. Re-running prod.yaml with the previous tag is the manual path for incidents found hours later — quicker than any merge-conflict-prone git-flow rollback. - “Branches > 800 LOC have 3-5× more bugs in production.” Short PRs are the actual safety net — not long-lived release branches.
- “AI + Sonar + Trivy + Gitleaks + tests + 1 human review every PR.” They have more eyes on each change than git-flow normally provides.
- “Hotfix is the same path, faster.” Show that no shortcut skips review.
Limitations + roadmap
The model above is mostly ready to apply against gocdnext as shipped. The deliberate gaps:
- PR vars: ✅ shipped in v0.9.0 (issue #9).
PR runs see
CI_CAUSE=pull_requestplusCI_PULL_REQUEST_KEY/_BRANCH/_BASE/_TITLE/_AUTHOR/_URLmaterialised server-side from the webhook payload. Thepr.yamlrecipe above uses them directly — novariables:workaround. - Tag-push event + tag env vars: ✅ shipped in v0.10.0. Tag
pushes now route via
event: [tag](URL-only matching since tags don’t carry a base branch); the scheduler injectsCI_TAG_NAME/CI_TAG_MESSAGE/CI_TAG_AUTHORfor any run withcause == "tag". The git ref target SHA is on the existingCI_COMMIT_SHA— same value, no duplicate var. The Variant: split release + tag.yaml section above shows the cleaner shape this enables. The single-pipeline form remains in the recipe for teams that prefer one button push. - Multi-arch scan-before-publish: ✅ shipped in v0.10.0
(plugin) and adopted in this recipe in v0.14.x. The
gocdnext/image-copy@v1plugin and the inlinecrane copyused in the release.yaml promote job both work the same way: build to a candidate tag, scan + sign the immutable digest, then promote (registry-side manifest rewrite) to the final tag. Multi-arch is preserved end-to-end because the manifest list is copied verbatim — no local retag, no manifest list rebuild. Teams that prefer thegocdnext/image-copy@v1plugin over a shell-out tocranecan substitute it in thepromotejob — same effect, picks crane / skopeo / buildx-imagetools as the backend at the operator’s discretion. - Per-tag preflight via API: ✅ shipped in v0.12.0. The
gocdnext/check-pipeline-run@v1plugin queries the gocdnext REST API and confirms a target pipeline produced a terminal-success run matching the operator’s filter (tag:,revision:). Replaces the inline curl + jq inprod.yamlwith a typed plugin + structured outputs (run id, URL, revision, counter, finished_at). Supports a poll mode for prod chains queued immediately after the release pipeline. - Semver bump as plugin: ✅ shipped in v0.10.0. The
gocdnext/semver-bump@v1plugin auto-computes the next tag from Conventional Commits since the prior tag (major onfeat!:/BREAKING CHANGE:, minor onfeat:, patch otherwise). Writes a shell-sourceable.gocdnext/semver.envthat downstreamcreate-tagjobs source. Combined withevent: [tag]+CI_TAG_NAMEabove, the release flow is now “click Run on release.yaml → semver-bump → create-tag → push; tag webhook auto-fires tag.yaml” with no operator-typed TAG variable anywhere. - Cosign by content key: ✅ shipped in this release. The
gocdnext/cosign@v1plugin now acceptskey-content:which writes the bytes to a 0600 mktemp inside the plugin’s container andtrap-wipes on exit. The recipe uses this; no artifact persistence, no shell hack with the official cosign image (which is distroless + has no shell). - PR-label-driven approval quorum: ✅ shipped in v0.13.0
via
quorum_by_labelon theapproval:block. Reads PR labels from the run’scause_detail.pr_labels(snapshot at materialisation) and applies the largest matching override. GitHub-only at v0.13.0 (#11 / #12 for GitLab MR / Bitbucket PR). - Coverage tab + HTML report preview: see issue #8.
Today coverage =
artifacts.optional:+ Sonar Quality Gate covers the gate.
See also
- gocdnext/ai-review plugin — Claude + OpenAI code review.
- gocdnext/sonar plugin — SonarQube + SonarCloud Quality Gate.
- Approval gates —
approver_groups+required(quorum). - Services lifecycle — sibling service containers (postgres, redis, …) for integration tests.
- Materials — implicit project material +
when.event:triggers.