Expand and Contract Methodology

The Expand and Contract Methodology is the operational standard for executing zero-downtime database migrations. By decoupling additive and destructive schema changes into discrete, reversible phases, teams eliminate table locks and prevent application downtime. This workflow integrates directly with established Zero-Downtime Schema Evolution Patterns to guarantee backward compatibility across deployment cycles.

Phase 1: Expand (Additive Schema Changes)

Introduce new columns, tables, or indexes while preserving legacy structures. The database must simultaneously support old and new application versions.

Execution Protocol Wrap DDL in explicit transactions where supported. Enforce DEFAULT NULL or explicit defaults to prevent legacy write failures. Defer constraint validation on large tables to avoid full-table scans.

BEGIN;
ALTER TABLE orders ADD COLUMN fulfillment_status VARCHAR(50) DEFAULT NULL;
-- PostgreSQL: defer validation to prevent exclusive locks on large tables
ALTER TABLE orders ADD CONSTRAINT chk_fulfillment 
 CHECK (fulfillment_status IN ('pending', 'shipped', 'delivered')) NOT VALID;
COMMIT;

Dry-Run & Validation

  • Schema diff: pg_dump --schema-only --no-owner -f expand_phase.sql
  • Migration tool dry-run: flyway migrate -dryRunOutput=expand_phase.sql
  • Query plan baseline: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE fulfillment_status = 'pending';

Environment Context

Environment Execution Policy Validation Scope
Dev Immediate DDL application ORM mapping sync, unit test coverage
Staging Load-test with production data volume Connection pool saturation, index bloat simulation
Prod Deploy during low-write windows Replication lag < 500ms, zero write errors

Compatibility Window: V1 (legacy) and V2 (new) application binaries must coexist. Active from Phase 1 deployment until Phase 2 traffic shift completes. Forward Path: Deploy V2 application code that reads/writes the new column. Rollback Path: Revert to V1 application binary. If DDL must be reverted, execute ALTER TABLE orders DROP COLUMN fulfillment_status; only if zero writes occurred post-deploy. For high-throughput systems, consult Implementing Expand-Contract for High-Traffic Tables to configure connection pooling and query routing before applying DDL.

Phase 2: Data Synchronization & Traffic Routing

Backfill historical records, establish parallel write paths, and incrementally shift read traffic.

Execution Protocol Implement dual-write logic at the application layer. Backfill using chunked, idempotent queries to prevent long-running transactions and lock escalation.

-- Chunked backfill (PostgreSQL/MySQL compatible)
DO $$
DECLARE
 batch_size INT := 1000;
 last_id BIGINT := 0;
BEGIN
 LOOP
 UPDATE orders 
 SET fulfillment_status = COALESCE(fulfillment_status, 'legacy_mapped')
 WHERE id > last_id AND id <= last_id + batch_size AND fulfillment_status IS NULL;
 
 GET DIAGNOSTICS last_id = ROW_COUNT;
 EXIT WHEN last_id = 0;
 COMMIT; -- Commit per batch to release locks
 last_id := last_id + batch_size;
 END LOOP;
END $$;

Dry-Run & Validation

  • Checksum validation: SELECT COUNT(*), MD5(STRING_AGG(fulfillment_status::text, '' ORDER BY id)) FROM orders WHERE id % 20 = 0;
  • Feature flag toggle test: curl -X POST /api/config/flags/dual-write-orders -d '{"enabled": true}'

Environment Context

Environment Execution Policy Validation Scope
Dev Mock traffic routing Feature flag toggle latency, dual-write mock
Staging Chaos testing, network partition simulation Replication lag, conflict resolution logic
Prod Gradual rollout (5% → 25% → 50% → 100%) Error rate < 0.1%, p99 latency stable

Compatibility Window: Both legacy and new read/write endpoints active. Traffic split governed by feature flags. Duration: Until 100% read routing is verified stable. Forward Path: Shift 100% read traffic to new schema. Disable legacy read paths. Rollback Path: Immediately toggle feature flags to legacy endpoints. Halt backfill jobs. Revert application to V1. Implement Dual-Write Synchronization to guarantee consistency during the transition window. Use Feature Flag Rollouts to gradually shift read traffic to the new schema while monitoring replication lag and error rates.

Phase 3: Contract (Destructive Cleanup)

Decommission legacy columns, tables, and synchronization logic only after verifying zero anomalies.

Execution Protocol Drop columns/tables during maintenance windows. Remove dual-write application logic. Validate deferred constraints before finalizing.

BEGIN;
-- Drop legacy column after confirming zero references
ALTER TABLE orders DROP COLUMN legacy_status;
-- Finalize constraint validation
ALTER TABLE orders VALIDATE CONSTRAINT chk_fulfillment;
COMMIT;

Dry-Run & Validation

  • Lock contention simulation: pt-online-schema-change --dry-run --alter "DROP COLUMN legacy_status" D=prod_db,t=orders
  • Query plan regression check: EXPLAIN (ANALYZE) SELECT * FROM orders WHERE legacy_status IS NULL; (should be removed from codebase)

Environment Context

Environment Execution Policy Validation Scope
Dev Immediate cleanup ORM schema sync, migration manifest update
Staging Full schema diff validation Index fragmentation, storage reclamation
Prod Scheduled maintenance window Lock queue depth < 10, zero deadlocks

Compatibility Window: V1 application must be fully decommissioned. Zero legacy code references in production. Duration: 48-hour observation post-contract. Forward Path: Finalize schema removal. Archive migration manifest. Rollback Path: Halt DROP execution. Restore from point-in-time recovery (PITR) snapshot. Re-add columns via Phase 1 scripts. Re-enable legacy app version. For multi-terabyte tables, apply Handling Large Table Alterations Without Locks using batched cleanup routines.

Safety Checks & Environment Parity

Maintain strict parity between staging and production. Every migration script must be reversible.

Pre-Flight Validation Matrix

Check Command/Tool Pass Criteria
Foreign Key Integrity SELECT * FROM information_schema.table_constraints WHERE constraint_type = 'FOREIGN KEY'; Zero orphaned references
Index Bloat pg_stat_user_indexes / SHOW INDEX Bloat ratio < 1.5x
Query Plan Regression EXPLAIN (ANALYZE) vs baseline No sequential scans on indexed columns
Schema Diff schema-diff staging prod 100% parity before Phase 3

Environment Parity Enforcement

  • Dev: Ephemeral databases, rapid iteration. No strict promotion gates.
  • Staging: Production topology replica. Automated load testing. Mandatory approval gates between phases.
  • Prod: Immutable deployment pipeline. Pre-flight checks block promotion if thresholds fail. Replicate production topology in staging to validate migration scripts under identical load profiles.

Rollback Paths & Incident Response

Define explicit rollback triggers for each phase. Never skip validation gates.

Incident Decision Matrix

Phase Trigger Immediate Action Recovery Path
Phase 1 Legacy write failures (>0.5% error rate) Halt deployment, revert app Drop new column, restore V1 binary
Phase 2 Replication lag > 2s or checksum mismatch Disable dual-write, route to legacy Re-run backfill with reduced batch size
Phase 3 Lock contention > 30s or deadlock spike ABORT transaction, pause cleanup PITR restore, re-add legacy column, extend observation window

Maintain automated PITR snapshots and versioned migration manifests to guarantee reversible deployments. Enforce mandatory 48-hour monitoring windows post-contract to catch delayed batch jobs or cached queries.