Database Migration Fundamentals & Tool Selection
A schema migration is the riskiest line of code most teams ship, because it runs once, against live state, and a mistake is rarely a clean exception — it is a held lock, a half-rewritten table, or a replica drifting away from its primary. This section establishes the foundations every later technique depends on: what makes a migration safe, how PostgreSQL and MySQL 8.0 differ in the guarantees they offer, and which tooling enforces those guarantees instead of hiding them. It serves backend and platform engineers who write the DDL, DBAs who must keep the production database available through every release, and DevOps teams wiring migration steps into a deployment pipeline.
The discipline here is foundational rather than advanced. Before you can run an expand-and-contract methodology on a high-traffic table or wire a migration pipeline gate that blocks an unsafe deploy, you need migrations that are deterministic, re-runnable, and lock-aware. Get the fundamentals right and the harder patterns become mechanical; get them wrong and no amount of automation downstream will save a deploy that rewrites a 200-million-row table under an ACCESS EXCLUSIVE lock at peak traffic.
Core Principles
Four invariants hold across every safe migration, regardless of engine or tool. The rest of this section is application of them.
Operational safety supersedes deployment velocity. A migration that ships an hour late and holds no exclusive lock is a success; one that ships on time and rewrites a hot table is an outage. Every schema change must remain backward and forward compatible during the rolling window where application version N and version N+1 both run against the live database.
State is tracked deterministically, never inferred. Every node must apply an identical, ordered sequence of DDL, recorded in a version ledger with checksums. This is the contract that schema version control basics establishes — without it, a migration that “looks applied” can silently diverge between primary and replica.
Every script is idempotent and re-runnable. Pipelines retry, and a step that fails after a partial apply must be safe to run again. Following idempotent script design with IF NOT EXISTS guards and conflict-tolerant writes means a half-applied migration converges to the correct state on its next run instead of throwing duplicate column and blocking the deploy.
Rollback is a forward contract, not a DROP. Define the reversal path before execution begins. Destructive operations — DROP COLUMN, a restrictive ALTER TABLE ... SET NOT NULL — never run under load without a verified, non-destructive reversal. Reversal disables the new code path and leaves the expanded schema in place; it does not destroy data the backfill produced.
Phase-by-phase Overview
Every migration moves through four forward phases, each with a single job and a gate that must be green before the next begins.
Prepare — generate the migration, lint it, and assert backward compatibility against the live schema on an environment that mirrors production. Surfacing lock contention and index build time here, before merge, is the entire purpose of environment parity strategies.
# Shell · CI dry-run gate · read-only DB role, no production writes
# Fails the PR check if the generated DDL diverges from the expected diff.
flyway -url=jdbc:postgresql://staging-db:5432/app \
-user=deploy -password="$DB_PASS" \
-dryRunOutput=/tmp/migration-dry-run.sql migrate
./scripts/assert-no-destructive-ddl.sh /tmp/migration-dry-run.sql
Deploy — apply the migration as a discrete, additive, forward-only step that runs before the new application image rolls out. Cap how long any statement may block with an explicit lock_timeout.
-- PostgreSQL · run as the migration role · must run at a low-write window
-- lock_timeout caps how long the ALTER may wait before failing fast.
BEGIN;
SET LOCAL lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN IF NOT EXISTS fulfillment_status VARCHAR(50) DEFAULT 'pending';
COMMIT;
-- CREATE INDEX CONCURRENTLY must run OUTSIDE this transaction, as its own step.
Backfill — populate the new column in throttled, idempotent batches after the schema is live, halting if the slowest replica falls behind. Long transactions cause replication lag and lock escalation, so each batch commits independently.
# Python · post-deploy worker · safe to re-run · halts if replica lag is high
# Run as a separate job, NOT inline with the deploy step.
import psycopg2
from psycopg2.extras import execute_batch
def backfill_chunk(cur, last_id, batch_size=1000):
cur.execute(
"SELECT id FROM orders WHERE fulfillment_status IS NULL "
"AND id > %s ORDER BY id LIMIT %s",
(last_id, batch_size),
)
rows = cur.fetchall()
if not rows:
return None
execute_batch(
cur,
"UPDATE orders SET fulfillment_status = 'legacy_pending' WHERE id = %s",
[(r[0],) for r in rows],
)
return rows[-1][0] # cursor: last processed id
conn = psycopg2.connect(dsn)
last_id = 0
with conn.cursor() as cur:
while last_id is not None:
last_id = backfill_chunk(cur, last_id)
conn.commit() # commit per chunk to bound replication lag
conn.close()
Validate — gate promotion on schema-aware health checks and a zero-drift assertion; only after convergence do you schedule the contract phase. Reversal here is forward-only, renaming for audit rather than dropping.
-- PostgreSQL · run as the migration role · safe under load (metadata-only rename)
-- Application code must already have stopped writing this column.
BEGIN;
SET LOCAL lock_timeout = '3s';
ALTER TABLE orders RENAME COLUMN old_status TO _deprecated_old_status;
COMMIT;
-- Schedule the DROP for a later release, after confirming zero references.
Tool & Database Matrix
The engine decides what your tooling can promise. The matrix below drives how aggressive each phase can be, and the deeper split lives in transactional vs non-transactional databases.
| Capability | PostgreSQL | MySQL 8.0 (InnoDB) | Consequence for the migration |
|---|---|---|---|
| Transactional DDL | Yes (except CREATE INDEX CONCURRENTLY) |
No — each DDL forces an implicit commit | On MySQL you cannot wrap a multi-statement migration in one transaction; each step needs its own recovery plan |
| Online column add | Metadata-only, no rewrite (PG 11+) | ALGORITHM=INSTANT (8.0.12+) |
Assert the add is metadata-only, not a full table rewrite |
| Online index build | CREATE INDEX CONCURRENTLY |
ALGORITHM=INPLACE, LOCK=NONE |
Measure lock duration against a budget before promoting |
| Lock-wait control | SET lock_timeout |
innodb_lock_wait_timeout / lock_wait_timeout |
Cap how long a migration may block before it fails fast |
| Failed-step recovery | Statement rolls back atomically | Partial DDL may be left applied | MySQL scripts must be idempotent and re-entrant to recover |
| Auto-generated vs raw SQL | ORM hides CONCURRENTLY, lock class |
ORM hides ALGORITHM/LOCK choice |
Manage structural DDL as versioned raw SQL; reserve ORMs for data access |
The practical takeaway: a pipeline that assumes PostgreSQL’s transactional DDL will leave a MySQL database half-migrated on the first failed step. Pick the tool that exposes lock behavior rather than the one that hides it — the trade-offs are laid out in the migration tool comparison.
CI/CD Integration Pattern
The cheapest gate catches the most outages: a required, blocking status check that refuses to merge a destructive or non-online migration. Everything downstream assumes this check passed.
# .gitlab-ci.yml — migration safety as a required, blocking stage
# Context: runs against a throwaway snapshot DB; never touches production.
migration_safety:
stage: verify
rules:
- changes: [ "migrations/**/*" ] # only when a migration changed
script:
- ./bin/restore-snapshot --into ci_db # production-like state
- ./bin/migrate up --database ci_db # applies cleanly?
- ./bin/assert-lock-budget --database ci_db --max-exclusive-ms 200
- ./bin/assert-safe-down --dir migrations/ # no DROP/TRUNCATE/DELETE
allow_failure: false # blocks the merge train
Wire this so a red result cannot be overridden by a merge. The full set of pull-request and pre-deploy checks — checksum verification, backward-compatibility diffs, lock-budget assertions — is built out in migration pipeline gating.
Failure Modes & Rollback Contract
Migrations fail in a small set of characteristic ways. Naming each is how you build the gate that catches it.
- Checksum drift — an applied migration differs from the repository version. Root cause: a hotfix run by hand outside the pipeline.
- Lock timeout — the
lock_timeoutfires because a long-running query holds the table. Root cause: the migration ran during peak traffic, not a low-write window. - Full-table rewrite — an “add column” silently copies the whole table. Root cause: a non-
INSTANTdefault on MySQL, or avolatiledefault on older PostgreSQL. - Replication lag spike — an unthrottled backfill outruns the replica’s apply rate. Root cause: batch size tuned for the primary, not the slowest follower.
- Backward-incompatible deploy — new code ships expecting a column the gate failed to flag. Root cause: a rename disguised as an add-plus-drop across two migrations.
- Destructive rollback — an automated
downrunsDROP COLUMNand loses data the backfill produced. Root cause: treating rollback as schema reversal instead of path disablement.
The contract that prevents the last two: deploys are additive and forward-only, and reversal restores the previous application image while leaving the schema expanded. Prefer a forward migration over a destructive REVERT so the audit trail stays intact.
What This Section Covers
This section is five guides, one per fundamental. Environment parity strategies covers making staging mirror production closely enough that lock contention and index-build time surface before merge, including seeding anonymized production data so volume distributions match. Idempotent script design covers writing DDL and data backfills that converge to the same state no matter how many times a retrying pipeline runs them. Migration tool comparison weighs runners like Flyway and Liquibase on checksum verification, transactional behavior, and how much lock detail they expose. Schema version control basics covers ordering, branching, and conflict resolution so two engineers merging migrations never produce a non-deterministic apply order. Transactional vs non-transactional databases covers the implicit-commit behavior of MySQL DDL and how to make each step independently recoverable when atomic rollback is unavailable.
Frequently Asked Questions
Should structural schema changes go through an ORM’s auto-generated migrations or raw SQL?
Use the ORM for application data access, but manage structural DDL as versioned, reviewed raw SQL. Auto-generated migrations accelerate development at the cost of hiding the two things that decide whether a deploy is safe: the lock class of each statement (ALGORITHM=INPLACE versus a COPY rewrite) and whether the engine commits implicitly. When the lock behavior is invisible, you cannot gate on it.
Why prefer a forward migration over a down/REVERT script for rollback?
Because a forward, additive reversal preserves data and audit history, while a destructive down can lose data the backfill produced and leaves a gap in the version ledger. Rename a deprecated column to a _deprecated_ prefix and schedule the DROP for a later release once you have confirmed zero references, rather than dropping it inline during an incident.
How long must a migration stay backward compatible? For the entire rolling-deploy window where the previous application version is still serving traffic — in practice at least one full release cycle, often two to four weeks if you run canaries. During that window both the old and new schema shapes must be queryable, which is why additive changes deploy first and destructive contraction comes last.