Migration Pipeline Gating
Gating is the prepare-phase discipline that decides, before a single byte of DDL reaches production, whether a migration is allowed to advance. It belongs to the moment a pull request is opened and the moment a deploy is requested — the two points where a human is about to hand a schema change to the machine. A gate is a required, blocking status check: it inspects the proposed migration, runs a fixed set of safety assertions, and returns a hard pass or fail that the merge queue and the deploy job both honor. This page serves the platform and backend engineers who own the pipeline, the DBAs who must guarantee production survives every merge, and the reviewers who would otherwise have to remember every safety rule by hand. Gating exists so they don’t have to: the rules live in code, run on every change, and cannot be waved through.
The four checks that earn their place as blocking gates are backward-compatibility diffing, checksum verification, destructive-DDL detection, and lock-budget assertion. Each catches a distinct class of outage, and each is cheap enough to run on every pull request. This page sits under CI/CD & Migration Automation, which frames the whole pipeline as the unit of safety; gating is its first and cheapest defense. The expensive proof — applying the migration to a real snapshot and measuring its actual lock duration — belongs to Automated Migration Testing, and the contract for what happens when a gated deploy still goes wrong belongs to Rollback Automation.
Concept & Mechanism
A gate is a process, not a suggestion. Mechanically it is a CI job that exits non-zero on failure, registered with the host as a required status check so the merge queue or branch protection rule refuses to advance a head commit until the job is green. The job’s exit code is the entire contract; everything else is the assertion logic that produces it.
The four assertions divide along what they inspect. Backward-compatibility diffing compares the proposed schema against the live production schema and rejects any change that the currently-running application code cannot tolerate — a dropped column, a renamed column, a narrowed type, a new NOT NULL without a default. The mechanism is a structural diff, not a textual one: the tool reads the live catalog (via a read-only role) and the target model, computes the migration that would reconcile them, and classifies each step. This is the gate that enforces the same contract the Expand and Contract Methodology applies by hand — deploy additively, never break the version still in flight.
Checksum verification answers a different question: has anyone applied a migration to production outside the pipeline? Migration tools record a hash of each applied script in a history table (flyway_schema_history, Prisma’s _prisma_migrations, Liquibase’s DATABASECHANGELOG). The gate recomputes the hash of each repository script and compares it to the stored value. A mismatch means the script changed after it was applied, or a hand-edited hotfix bypassed the pipeline. This drift is the failure mode the Schema Version Control Basics cluster is built to prevent at the source.
Destructive-DDL detection is a static scan of the migration text for statements that lose data or rewrite tables under a heavy lock — DROP COLUMN, DROP TABLE, TRUNCATE, DELETE without a guard, or a type change that forces a full rewrite. On PostgreSQL the rewrite risk hides in ALTER COLUMN ... TYPE; on MySQL 8.0 it hides in any ALTER that the engine cannot run with ALGORITHM=INPLACE, LOCK=NONE. The scan does not decide intent — it flags the pattern and forces an explicit, reviewed override annotation, so a destructive change is never accidental.
Lock-budget assertion is the one gate that needs to actually touch a database. A static scan can tell you CREATE INDEX will lock, but only an execution against a representative table tells you for how long. The lightweight version of this check runs in the gate; the full version, with a restored production-size snapshot, lives in Automated Migration Testing. The distinction that drives both is engine behavior: PostgreSQL’s CREATE INDEX CONCURRENTLY and MySQL’s ALGORITHM=INPLACE builds avoid the exclusive lock, while their naive counterparts hold an ACCESS EXCLUSIVE or metadata lock for the duration of the build.
Prerequisites & Decision Criteria
Gating is worth wiring when migrations are frequent enough that human review misses things, and when the blast radius of a bad migration is an outage rather than an inconvenience. The checklist below decides whether each gate applies to your setup.
flyway_schema_history,_prisma_migrations, orDATABASECHANGELOG).rules/pathsfilter can target, so the gate runs only when a migration changed.ACCESS EXCLUSIVElock beyond 200 ms) so the assertion has a number to test against.
| Gate | What it reads | Blocks on | Cost to run |
|---|---|---|---|
| Backward-compat diff | Live catalog + target model | Drop, rename, narrow, new NOT NULL |
Low — one read-only query |
| Checksum verification | History table + repo scripts | Hash mismatch vs applied script | Low — string compare |
| Destructive-DDL scan | Migration text only | DROP/TRUNCATE/rewrite pattern |
Trivial — static parse |
| Lock-budget assertion | Ephemeral DB + migration | Lock held beyond budget | Medium — needs a DB |
If you adopt only one gate, make it backward-compatibility diffing — it catches the highest-severity, hardest-to-spot class of failure. If you adopt two, add checksum verification, because hand-applied drift silently invalidates every other check.
Step-by-Step Procedure
The procedure wires all four gates as a single required status check. Build it incrementally and turn each gate to blocking only after it runs clean on a few real pull requests.
1. Provision a read-only diff role. The backward-compatibility gate must read the live schema without any write capability.
-- PostgreSQL · run as a superuser/owner once · grants are persistent, review before production
CREATE ROLE ci_diff LOGIN PASSWORD 'rotate-me';
GRANT CONNECT ON DATABASE app_production TO ci_diff;
GRANT USAGE ON SCHEMA public TO ci_diff;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO ci_diff;
-- No INSERT/UPDATE/DELETE/CREATE — the gate reads the catalog, never writes.
Verify before proceeding: connect as ci_diff and confirm CREATE TABLE is denied.
2. Run the backward-compatibility diff and classify it. Generate the reconciling migration and fail on any breaking step.
# Context: CI job, read-only PROD_READONLY_URL, no production writes; non-zero exit blocks the PR.
npx prisma migrate diff \
--from-url "$PROD_READONLY_URL" \
--to-schema-datamodel prisma/schema.prisma \
--script > /tmp/diff.sql
./scripts/assert-no-destructive-ddl.sh /tmp/diff.sql # exits 1 on DROP/RENAME/narrowing
Verify before proceeding: feed the script a deliberately breaking diff (a DROP COLUMN) and confirm it exits non-zero.
3. Verify checksums against the live history table. Catch out-of-band drift before any new migration deploys.
# Context: CI job, read-only history access; fails when production diverges from the repo.
# Flyway exits 1 on a checksum mismatch between the repo script and the applied row.
flyway -url="$PROD_READONLY_JDBC" -user="$CI_DIFF_USER" validate
Verify before proceeding: edit an already-applied migration file locally and confirm validate reports the mismatch.
4. Scan for destructive DDL. A static parse is enough; it never connects to a database.
# Context: pure static analysis on migration files; no DB connection; fast and deterministic.
# Flags DROP COLUMN, DROP TABLE, TRUNCATE, unguarded DELETE, and rewriting ALTER ... TYPE.
./bin/assert-safe-down --dir migrations/ --require-override-label
5. Assert the lock budget on an ephemeral database. Apply the migration to a throwaway container and measure.
# .github/workflows/migration-gate.yml — runs on every PR that touches migrations/
# Context: ephemeral Postgres service container; disposable; never touches production.
- name: Lock-budget assertion
run: |
./bin/migrate up --database "$CI_DB_URL"
./bin/assert-lock-budget --database "$CI_DB_URL" --max-exclusive-ms 200
6. Register the job as a required check. Wiring it as required is what converts the assertions into a gate.
# GitHub branch protection (via API or settings) — the job name must be required.
# Context: configuration, not runtime; without this, a red check is only advisory.
required_status_checks:
strict: true
contexts: [ "migration-gate" ] # exact job name; merge is blocked until green
Verify before proceeding: open a pull request with a deliberately unsafe migration and confirm the merge button is disabled.
Verification & Observability
After wiring the gates, confirm they fire on real conditions and that you can see why a gate failed. The lock-budget gate in particular needs introspection into what the migration actually held.
On PostgreSQL, inspect the locks a migration acquires while it runs against the ephemeral database:
-- PostgreSQL · run in a second session against the CI database while the migration runs
-- Context: read-only diagnostics; safe anywhere; shows the exact lock mode and target.
SELECT a.pid, l.mode, l.relation::regclass AS target, a.query
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.mode = 'AccessExclusiveLock' AND l.granted;
On MySQL 8.0, confirm an ALTER ran online rather than copying the table:
-- MySQL 8.0 · run against the CI database during the migration
-- Context: read-only; the metadata_locks view requires performance_schema enabled.
SELECT object_name, lock_type, lock_status
FROM performance_schema.metadata_locks
WHERE object_schema = DATABASE() AND lock_status = 'GRANTED';
Treat the gate results as observable pipeline data, not just pass/fail noise. Confirm the following hold:
DROP COLUMNis blocked by the destructive-DDL scan, not merely warned.NOT NULLcolumn without a default is blocked by the backward-compat diff.migrations/changed, keeping unrelated pull requests fast.
The lock-budget assertion is intentionally the lighter cousin of the full snapshot test. When a migration passes the gate but you want certainty against production-scale data and replication, promote it to the deeper proof described in Automated Migration Testing.
Rollback Path
Gating runs before anything reaches production, so its “rollback” is reverting the gate itself — backing out a check that turned out to be too strict or flaky and blocked legitimate work. This must be reversible without surgery on production.
Roll back a gate by demoting it from required to advisory, never by deleting the assertion. Demotion keeps the signal visible while unblocking the queue:
# GitHub branch protection — temporary demotion while you fix a flaky gate
# Context: configuration change; reversible; the check still runs and reports.
required_status_checks:
strict: true
contexts: [ ] # migration-gate removed from required, still runs and reports
The safe conditions for demoting a gate: the failure is a false positive (the gate flagged a change that is provably safe), or the gate is non-deterministic (it passes and fails on identical input, usually because the ephemeral database state varies). In both cases, demote, fix the assertion, prove it deterministic on three identical runs, then re-promote.
Never roll back a gate by adding a blanket override that disables it for all future migrations — that converts a temporary fix into a permanent hole. Use the per-change override label instead, which forces a human approval on exactly the one migration that needs it. If a migration has already merged and deployed because a gate was wrongly demoted, the recovery is no longer a gating concern; it moves to Rollback Automation, which restores the previous application image while leaving the schema expanded.
Common Errors & Fixes
ERROR: permission denied for table ... during the diff step. The ci_diff role lacks SELECT on a table added after the grant was issued. Root cause: GRANT SELECT ON ALL TABLES is not retroactive to future tables. Fix by adding ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO ci_diff; so new tables are readable automatically.
Validate failed: Migration checksum mismatch for version 42. A migration file was edited after it was applied to production. Root cause: someone changed an already-run script instead of writing a new one. Fix by reverting the file to its applied form and expressing the intended change as a new migration; never edit applied history, a rule the Schema Version Control Basics cluster makes concrete.
The gate passes locally but fails in CI with a different diff. The local diff ran against a stale or seeded database, not the live production catalog. Root cause: the diff source differs between environments. Fix by always diffing against $PROD_READONLY_URL in CI and pinning the same migration-tool version locally and in the pipeline.
Destructive-DDL scan flags a legitimate, reviewed DROP. A contract phase of an expand-and-contract change genuinely needs to drop a now-unused column. Root cause: the scan cannot read intent. Fix by attaching the override label and recording the approval, so the drop is explicit and traceable rather than silently allowed — the proper end of the Expand and Contract Methodology lifecycle.
Lock-budget assertion is green in CI but the migration stalls in production. The ephemeral database held a fraction of production’s rows, so the index build was instant in CI and slow in production. Root cause: the gate’s database was not representative. Fix by promoting lock measurement to a restored snapshot, as covered in Automated Migration Testing.
Child Page Index
This section drills into the two gates engineers most often have to wire from scratch. Blocking Deploys on Failed Migration Dry Runs covers turning the dry-run apply into a hard, blocking deploy gate — running the migration against a disposable database, capturing the failure, and wiring the non-zero exit so a broken script can never reach production. Enforcing Backward-Compatibility Checks in Pull Requests covers the diff-and-classify gate in depth — reading the live catalog, computing the reconciling migration, and failing the pull request on any change the running application cannot tolerate.
For the layer beneath these gates, return to CI/CD & Migration Automation for the full pipeline contract, and read Automated Migration Testing for the production-scale proofs that confirm what the gates only sample.
Frequently Asked Questions
Should gating run on every pull request or only on the release branch? On every pull request that touches a migration. Catching a destructive or backward-incompatible change at review time is far cheaper than catching it at release, when the change is entangled with other work and the merge queue is already moving. Filter the job by path so pull requests that touch no migration stay fast and the gate runs only when it has something to inspect.
How is the lock-budget gate different from the snapshot test? The gate is a fast, lightweight sample: it applies the migration to a small ephemeral database to catch obvious locking mistakes and exits in seconds. The snapshot test in Automated Migration Testing restores a production-size database to measure the actual lock and replication-lag behavior under realistic data volume. Use the gate for fast feedback on every pull request and the snapshot test for certainty before a high-risk deploy.
What do I do when a gate blocks a migration I know is safe? Use the per-change override label, never a global disable. The label forces an explicit human approval recorded against that one migration, so the gate stays strict for everything else. If you find yourself overriding the same gate repeatedly, the assertion is miscalibrated — tighten or relax the rule rather than routing around it.