Auto-Reverting Migrations on Health-Check Failure

The deploy succeeded, the migration applied, and four minutes later error rate is at 8% and p95 latency has doubled. By the time a human reads the alert, opens the dashboard, and decides to roll back, the incident is already ten minutes old. Automated reversal on health-check failure closes that gap: the pipeline watches a defined set of signals after the deploy and, if they breach the budget within the observation window, restores the previous application image or flips a kill-switch flag on its own — in seconds, not minutes. The critical constraint is that this reversal must never touch the schema. The migration ran forward and is additive; the right rollback restores the old code, not the old table. This page wires the health-check gate, defines the signals and budgets that should trigger it, and explains why the automated path flips behavior rather than reversing DDL.

Reversal restores the previous image or flips a flag; the additive schema stays in place because the old code was written to tolerate it.

Symptom / Error Signatures

The signals that should trigger an auto-revert are golden-path metrics, observed in a tight window right after the deploy. Error rate crossing a budget — HTTP 5xx ratio above 1%, or a spike in application exceptions in the new image’s logs. Latency regression — p95 or p99 climbing past a ceiling that the previous release held comfortably under. Saturation — connection-pool exhaustion or a queue depth that the new code introduced. Each is a numeric breach, not a crash, which is exactly why a human notices late and a machine should notice immediately.

Distinguish two failure shapes, because they call for different reversals. A code regression — the new image is slow or throws — is fixed by restoring the previous image. A behavioral regression behind a flag — the new query path the migration enabled is misbehaving — is fixed by flipping the flag off without changing the deployed image at all. The health-check gate must know which lever to pull.

The anti-signature to watch for is the schema-blaming reflex: an instinct to “roll back the migration” when the migration applied cleanly and the regression is in code or query behavior. Reverting the DDL would be both unnecessary and dangerous. The expanded schema is backward compatible by design, so the old code runs against it perfectly.

Root Cause Analysis

Automated rollback is safe only because the deploy was structured to make code and schema independently reversible. That structure is expand-and-contract: the migration adds the new shape (expand) and runs ahead of the code, leaving the old shape intact. Because the old code was written against the old shape and the new shape is purely additive, the previously deployed image runs correctly against the new schema. That single property — backward-compatible expansion — is what lets the pipeline revert code without reverting DDL.

This is why the automated path must not reverse the schema. A DROP COLUMN or DROP TABLE triggered by a health-check failure would destroy data the expand phase produced, and it would do so under incident conditions when judgment is worst. The reversible unit is the application image (restore the previous container) or the behavior flag (turn off the new path). The schema is a one-way ratchet during the incident; it is contracted later, deliberately, once the new code is proven — never automatically.

The choice of lever follows from how the change was rolled out:

Rollout mechanism	What auto-revert does	Why it is non-destructive
New image, no flag	Restore previous image (Deployment rollback / blue-green flip)	Old image tolerates the expanded schema
Behavior behind a flag	Flip the flag off	New schema stays; readers fall back to old path
Dual-write enabled	Disable the new write path	Both tables remain; reconcile later

Feature-flagged rollouts give the fastest and least disruptive lever, because flipping a flag is instantaneous and affects no running process lifecycle — the basis of using feature flags to toggle schema changes safely. The image restore is the fallback when no flag guards the change.

Immediate Mitigation

If a deploy is breaching SLOs right now and no automation exists yet, perform the safe reversal by hand — restore code or flip a flag, never drop schema:

Flip the kill-switch flag if the new behavior is flag-guarded. This is instant and reverses nothing structural.

# Ops shell · flag service API · no DB or deploy change · instantaneous
# Context: disable the new path; old code path resumes against the same schema.
./bin/flags set checkout_v2 --off --reason "p95 breach after deploy"

Otherwise restore the previous application image through your orchestrator’s native rollback, which leaves the schema untouched.

# Ops shell · Kubernetes · reverts the Deployment's pod template only
# Context: previous image runs correctly against the expanded, backward-compatible schema.
kubectl rollout undo deployment/checkout --to-revision=PREVIOUS
kubectl rollout status deployment/checkout --timeout=120s

Confirm recovery against the same health signals before declaring the incident resolved — error rate back under budget, p95 back to baseline.
Do not touch the schema. Leave the expanded columns and tables in place; they are harmless to the restored code and reversing them risks data loss. The reasoning is detailed in writing safe down migrations for automated rollback.

Permanent Fix / Long-Term Pattern

The durable control is a post-deploy verification gate whose non-zero exit triggers an image restore or flag flip — and never a schema change.

# Argo Rollouts-style analysis gate — fail triggers automatic rollback of the image
# Context: observes SLOs for a window; schema is never part of the rollback action.
analysis:
  metrics:
    - name: error-rate
      interval: 30s
      count: 10                       # observe for 5 minutes
      successCondition: "result < 0.01"   # < 1% 5xx
      failureLimit: 2
    - name: p95-latency-ms
      interval: 30s
      count: 10
      successCondition: "result < 250"
      failureLimit: 2
# On failure, Argo aborts the rollout and reverts to the previous ReplicaSet (old image).

# bin/post-deploy-gate.sh — generic gate: non-zero exit invokes the revert hook
# Context: reverts code/flag only; asserts schema is left expanded.
set -euo pipefail
if ! ./bin/healthcheck --error-rate-max 1% --p95-latency-max 250ms --window 5m; then
  echo "Health breach — reverting code, leaving schema expanded"
  ./bin/flags set "$RELEASE_FLAG" --off 2>/dev/null \
    || kubectl rollout undo "deployment/$SERVICE"
  exit 1   # marks the deploy failed; no DROP is ever issued
fi

Three design choices make this reliable. First, the observation window must be long enough to catch a regression that ramps with traffic but short enough to bound damage — five minutes of 30-second samples is a common balance. Second, the failure budget should require a sustained breach (two or more consecutive bad samples) so a single noisy data point does not trigger a spurious revert. Third, the revert action is parameterized by lever — flag flip first, image restore second — and the schema is explicitly out of scope. Couple the flag lifecycle to the schema change deliberately, as in coupling schema changes to feature flags, so the kill switch is always available. This gate is the verification half of the rollback automation contract.

Verification Checklist

The gate observes concrete SLOs (error rate, p95 latency) for a defined post-deploy window.
A breach requires a sustained signal (multiple consecutive bad samples), not one outlier.
The revert action restores the previous image or flips a flag — it never issues DROP or TRUNCATE.
The schema is explicitly left expanded after an auto-revert.
Flag-guarded changes flip the flag first; only un-flagged changes fall back to image restore.
Recovery is confirmed against the same signals before the incident is closed.

Frequently Asked Questions

Why not roll back the migration when health checks fail? Because the migration is additive and the previous code was written to tolerate it, so reverting the schema is unnecessary — and a DROP triggered automatically during an incident risks destroying data the expand phase produced. The fast, safe reversal restores the previous image or flips a flag and leaves the schema in place.

How long should the health-check observation window be? Long enough to catch a regression that grows with traffic, typically around five minutes of frequent samples, but bounded so a bad deploy does not run unobserved. Require several consecutive breaching samples before triggering, so transient noise does not cause a spurious rollback.

What if there is no feature flag guarding the change? Fall back to restoring the previous application image through your orchestrator’s native rollback, which reverts only the pod template and never the database. For future changes, guard behavioral switches with a flag so the fastest, least disruptive lever is always available.