Automated Migration Testing

Migration testing is the proof step that sits between the cheap gates of the prepare phase and the irreversible act of applying DDL to production. Its job is to answer a question no static check can: what will this migration actually do to a database that looks like production? That means applying the migration to a restored, production-like snapshot, measuring how long it holds locks, measuring how far it pushes replication lag, and failing the build when either exceeds budget. This page serves the platform engineers who build the test harness, the DBAs who set the lock and lag budgets, and the application engineers who write migrations and need fast, trustworthy feedback that a change is safe before they ship it.

Where Migration Pipeline Gating samples safety with fast static checks and a small ephemeral database, testing buys certainty by paying the cost of realistic data volume and a real replica. The two phases are complementary: the gate runs on every pull request in seconds; the full snapshot test runs on the high-risk migrations where you need to know the lock duration to the millisecond. This page sits under CI/CD & Migration Automation, and what it proves safe is what Rollback Automation is the contingency for when production behaves unlike even the best snapshot.

The migration test harness A production snapshot is restored into an ephemeral database with a replica, the migration is applied under synthetic write load, and the harness asserts both the lock-duration budget and the replication-lag budget before reporting pass or fail. Proving a Migration on a Production-Like Snapshot Production snapshot restore Synthetic write load Apply migration primary + replica ephemeral container Assert lock duration budget Assert replica lag budget A red budget assertion fails the build before the migration can reach production.
Restore a realistic snapshot, apply the migration under load on a primary-and-replica pair, then assert both budgets before the change is allowed to ship.

Concept & Mechanism

A migration test is an experiment with a controlled variable. The variable is the migration; the controlled environment is a database whose data shape, row counts, and index sizes resemble production closely enough that the migration’s locking and lag behavior transfers. The output is two measurements compared against two budgets: how long the migration held a blocking lock, and how far it pushed a replica behind. If either breaks budget, the test fails and the migration does not ship.

The environment is built from a restored snapshot, not a seeded fixture. A handful of seeded rows tells you the migration is syntactically valid; it tells you nothing about how long CREATE INDEX runs against fifty million rows or whether ALTER TABLE triggers a full copy. The harness restores a recent production backup — anonymized so no real personal data lands in CI — into an ephemeral container, runs the migration, and tears the container down. Because the snapshot carries real row counts and data distribution, the lock the migration takes is the lock production will take.

Lock measurement depends on engine behavior. On PostgreSQL, CREATE INDEX CONCURRENTLY builds without an ACCESS EXCLUSIVE lock, while a plain CREATE INDEX holds one for the whole build; ALTER TABLE ... ADD COLUMN with a constant default is metadata-only on PG 11+, while ALTER COLUMN ... TYPE rewrites the table under an exclusive lock. On MySQL 8.0, the test must confirm the ALTER ran with ALGORITHM=INPLACE, LOCK=NONE; a silent fall back to ALGORITHM=COPY rebuilds the table and blocks writes. The harness captures the lock mode and duration directly from the engine, so the assertion tests the real behavior rather than a guess. This is the engine split the Transactional vs Non-Transactional Databases cluster details.

Replication-lag testing requires a replica in the test topology. A migration that is instant on the primary can still saturate a replica if it generates a flood of WAL or binlog the follower must replay. The harness attaches a streaming replica to the ephemeral primary, applies the migration, and samples the lag. This is the same budget Backfill Optimization tunes against — a backfill batch size that keeps the primary fast but pushes the slowest replica past its budget is a failure the test must catch before production does.

Locking-regression detection is the test made permanent. A migration that was online last quarter can become a table rewrite when someone changes a column type, and a CI test that asserts the lock budget on every migration turns that regression into a red build instead of an outage. The harness keeps the lock measurement in the test suite, so the safety property is re-verified on every change rather than checked once and forgotten.

Prerequisites & Decision Criteria

Snapshot testing costs more than gating — it needs a restore pipeline, an anonymizer, and CI runners large enough to hold production-scale data. The checklist decides whether a given migration earns the full test or only the lightweight gate.

Migration class Lightweight gate suffices Full snapshot test warranted
Add nullable column, constant default Yes Only if the table is huge
CREATE INDEX on a large table No Yes — measure real build time and lock
ALTER COLUMN ... TYPE (potential rewrite) No Yes — confirm online vs copy
Backfill touching millions of rows No Yes — measure replica lag under load
Drop an unused column (contract phase) Gate + override Optional

Reserve the full snapshot test for migrations whose lock or lag behavior depends on data volume. For trivial, metadata-only changes the gate’s fast sample is enough, and forcing every change through a slow restore only trains engineers to ignore the test. The principle of representative data parity is the same one the Environment Parity Strategies cluster develops for staging.

Step-by-Step Procedure

This procedure builds a reusable test job that restores a snapshot, applies the migration under load, and asserts both budgets. Stand it up once and parameterize it per migration.

1. Restore an anonymized snapshot into an ephemeral database. The restore is the foundation; everything downstream measures against it.

# Context: CI job, ephemeral container, disposable; the snapshot is pre-anonymized — no real PII in CI.
./bin/restore-snapshot --source s3://backups/prod/latest.dump --into "$CI_DB_URL"
./bin/anonymize --database "$CI_DB_URL" --rules config/anonymize.yml   # belt-and-suspenders

Verify before proceeding: confirm row counts in the restored database are within an order of magnitude of production for the affected tables.

2. Attach a replica so lag can be measured. Without a follower there is no lag to assert against.

# Context: ephemeral primary + replica pair, both disposable; never connected to production.
./bin/spawn-replica --primary "$CI_DB_URL" --replica "$CI_REPLICA_URL" --wait-for-sync

3. Start synthetic write load. Test the migration under concurrency, the condition that actually causes lock waits.

# Context: background load generator against the ephemeral primary; bounded; killed at teardown.
./bin/loadgen --database "$CI_DB_URL" --writes-per-second 500 --tables orders,line_items &
LOAD_PID=$!

4. Apply the migration while sampling locks. Capture the lock mode and duration as the migration runs.

-- PostgreSQL · run as the migration role against the ephemeral primary · sampled by a second session
-- Context: ephemeral DB only; CREATE INDEX CONCURRENTLY must run OUTSIDE a transaction.
SET lock_timeout = '5s';
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_region ON orders (region_code);

Verify before proceeding: the lock sampler should report no AccessExclusiveLock on orders for a concurrent build.

5. Assert the lock-duration budget. Fail the build if the migration held a blocking lock too long.

# Context: reads the sampled lock log from step 4; non-zero exit fails the build.
./bin/assert-lock-budget --log /tmp/locks.json --max-exclusive-ms 200

6. Assert the replication-lag budget. Compare the peak lag observed on the replica to the budget.

# Context: queries the ephemeral replica's lag metric; non-zero exit fails the build.
./bin/assert-lag-budget --replica "$CI_REPLICA_URL" --max-lag-seconds 2
kill "$LOAD_PID"   # stop synthetic load; teardown destroys both databases

Verify before proceeding: introduce a deliberate plain CREATE INDEX and confirm the lock-budget assertion fails, proving the test has teeth.

Verification & Observability

The test is only trustworthy if you can see the lock and lag it measured. Capture the engine’s own view of what happened during the migration.

On PostgreSQL, sample the active blocking locks from a second session while the migration runs:

-- PostgreSQL · run in a parallel session against the ephemeral primary
-- Context: read-only diagnostics; safe anywhere; identifies the exact lock mode and waiting queries.
SELECT a.pid, a.state, l.mode, l.relation::regclass AS target, now() - a.query_start AS held
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.relation = 'orders'::regclass AND l.granted
ORDER BY held DESC;

On MySQL 8.0, confirm the ALTER chose an online algorithm and read the lag from the replica:

-- MySQL 8.0 · first against the primary, then against the replica
-- Context: read-only; SHOW ENGINE INNODB STATUS needs PROCESS privilege; lag is in Seconds_Behind_Source.
SHOW ENGINE INNODB STATUS;          -- inspect the TRANSACTIONS section for the ALTER's lock waits
SHOW REPLICA STATUS;                -- read Seconds_Behind_Source on the replica

Confirm the harness behaves as designed:

  • CREATE INDEX is substituted for the concurrent form.

Treat a failing budget as data, not noise: the job log should record the measured lock duration and peak lag so an engineer can see by how much a migration missed budget and tune accordingly. When the measurements are clean, the migration is cleared for the deploy contract in CI/CD & Migration Automation.

Rollback Path

A migration test runs entirely on disposable infrastructure, so its rollback is reverting a test that became unreliable — a flaky budget assertion or a snapshot pipeline that broke and is failing every build for the wrong reason. The reversal must not silently disable the safety property.

When a test goes flaky, demote it from blocking to reporting while you stabilize it, and keep the measurement visible:

# CI config — temporary demotion of an unstable test, never deletion
# Context: configuration change; reversible; the test still runs and records its measurements.
migration_snapshot_test:
  allow_failure: true   # reports lock/lag but does not block while you fix flakiness

The safe conditions for demoting the test: the snapshot restore itself is failing (an infrastructure problem unrelated to the migration), or the budget assertion is non-deterministic because the ephemeral environment’s resources vary run to run. Stabilize by pinning runner size, restoring a fixed snapshot version, and confirming three identical runs produce the same measurement before re-promoting to blocking.

Never roll back a flaky test by widening the budget to make it pass — that hides the regression the test exists to catch. If a migration that the test cleared still misbehaves in production, the recovery is no longer a testing concern; it moves to Rollback Automation, which restores the previous application image while leaving the expanded schema in place, since the additive migration itself is safe to keep.

Common Errors & Fixes

The test passes in CI but the migration locks in production. The snapshot was too small, so an index build that was instant on a thousand rows took minutes on fifty million. Root cause: non-representative data volume. Fix by restoring full-size or a statistically faithful sample, and asserting the row count is representative before trusting the result.

ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. The harness wrapped the migration step in a transaction, but PostgreSQL forbids concurrent index builds inside one. Root cause: the test runner auto-wrapped statements. Fix by running concurrent builds as standalone statements outside any transaction, the same constraint the Idempotent Script Design cluster bakes into its scripts.

Replication-lag assertion never fires even on a heavy migration. No replica is actually attached, or it is synced before load starts and never stressed. Root cause: the test topology has no real follower under load. Fix by spawning a streaming replica and applying synthetic write load during the migration so the lag is real.

MySQL ALTER is slow in the test but the gate said it was online. The engine silently fell back to ALGORITHM=COPY because the requested INPLACE algorithm was unsupported for that change. Root cause: a static check cannot detect the fallback; only execution does. Fix by asserting the chosen algorithm explicitly with ALTER TABLE ... ALGORITHM=INPLACE, LOCK=NONE so the migration fails fast rather than copying.

Anonymized snapshot loses the data skew that caused the lock. Over-aggressive anonymization flattened a skewed column to a single value, so an index build that struggles with real distribution looked trivial. Root cause: anonymization changed the data shape the lock depends on. Fix by preserving cardinality and distribution during anonymization, a parity concern the Environment Parity Strategies cluster addresses.

Child Page Index

This section breaks into the two proofs engineers most often have to build. Testing Migrations Against Production-Like Snapshots covers the restore-and-apply harness end to end — sourcing a backup, anonymizing it without destroying data skew, restoring it into an ephemeral container, and applying the migration under realistic volume. Catching Table Lock Regressions in Migration Tests covers making lock measurement a permanent part of the test suite — sampling the engine’s lock view, asserting the duration budget, and turning a previously-online migration that became a table rewrite into a red build.

For the gates that run before these tests, return to Migration Pipeline Gating, and for the full pipeline this proof step lives inside, read CI/CD & Migration Automation.

Frequently Asked Questions

Do I need a full-size production snapshot, or is a sample enough? A sample is enough as long as it preserves the row counts, data distribution, and index sizes that drive lock and lag behavior on the tables your migration touches. A tiny seeded fixture is not enough — it proves syntax, not safety. When you cannot restore full size, take a statistically faithful sample of the affected tables and assert the row count is representative before trusting the measurement.

How do I keep real personal data out of CI when testing against production snapshots? Anonymize during the restore pipeline, before the data lands in the CI database, and preserve cardinality and distribution so the anonymized data still reproduces the migration’s locking behavior. Strip or hash personal columns but keep the data shape intact; flattening a skewed column to a constant can hide the exact lock you are trying to measure.

Why test replication lag if the migration is instant on the primary? Because a change that is metadata-only on the primary can still generate a flood of WAL or binlog that a replica must replay, pushing it past its lag budget and breaking read-after-write guarantees. Backfills are the classic case. Attaching a replica and measuring lag under load is the only way to catch this before production, and it is the same budget that governs batch sizing in backfill work.