Implementing Expand/Contract for High-Traffic Tables

You need to change the shape of a table doing thousands of transactions per second, and the naive ALTER TABLE you ran in staging just hung in production behind an AccessExclusive lock while P99 latency blew past 500 ms and connections queued until the pool exhausted. On a hot table, the danger is never the final schema — it is the transition, where a single blocking statement or a code deploy that races ahead of the schema takes the service down. This page is the runbook for splitting one risky change into the expand-then-contract sequence that keeps every intermediate state readable and writable by both the old and new application code. It assumes PostgreSQL 12+ or MySQL 8.0+ with connection pooling (PgBouncer/ProxySQL) and feature-flag infrastructure in place.

Expand/contract is the master pattern these tactics serve, sitting alongside the Dual-Write Synchronization that keeps both columns consistent and the throttled jobs in Backfill Optimization that fill the new column without stalling replication.

Symptom / Error Signatures

A high-traffic table mid-change shows up as lock-driven latency and version-skew errors:

  • PostgreSQL: ERROR: canceling statement due to lock timeout (SQLSTATE 55P03) while an ALTER TABLE waits behind live DML.
  • MySQL: ERROR 1213: Deadlock found when trying to get lock; try restarting transaction during overlapping DDL and writes.
  • ORM layer: ColumnNotFoundException or mapping mismatch when new code reads a column the running release does not yet write — or old code hits a column the deploy already dropped.
  • Application: P99 latency spikes above 500 ms exactly at the ALTER TABLE timestamp.
  • Replication: follower lag past 10 s caused by a long-running DDL or an unthrottled backfill transaction.

The first thing to determine is whether DDL is blocked or blocking:

-- PostgreSQL · read-only · find the blocker behind a waiting ALTER
SELECT blocked.pid AS blocked, blocker.pid AS blocker,
       now() - blocker.query_start AS held_for, blocker.query
FROM   pg_stat_activity blocked
JOIN   pg_stat_activity blocker
       ON blocker.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE  blocked.query LIKE '%ALTER TABLE%';

Root Cause Analysis

Traditional DDL takes an AccessExclusive lock (PostgreSQL) or a metadata lock (MySQL) on the target table. On an idle table that lock is invisible; on a table doing thousands of TPS, the ALTER queues behind in-flight transactions, then every subsequent query queues behind the ALTER, and the connection pool fills with waiters until new requests are refused. Even online-capable DDL stalls if it cannot acquire the brief exclusive lock it needs at the start.

The second failure class is version skew. If the schema change and the code that depends on it ship together, there is always a window where some application instances run the old code against the new schema, or the new code against the old schema. A drop that lands before the last reader is redeployed throws ColumnNotFoundException; an add the new code requires before it deploys throws the same. Expand/contract removes both windows by ordering every step so that the schema is always a superset of what any running code needs — the discipline laid out in the Expand and Contract Methodology overview. The single hardest case, removing a column that is currently NOT NULL, has its own sequence in safely removing a NOT NULL column with expand-contract.

Expand → Backfill → Contract timeline Three phases left to right: expand adds a nullable column and deploys dual-write code, backfill populates historical rows in throttled batches, and contract switches reads and drops the legacy column. The schema stays compatible with both old and new code throughout. Expand → Backfill → Contract 1. Expand ADD COLUMN NULL deploy dual-write 2. Backfill throttled batches read legacy 3. Contract switch reads, NOT NULL DROP legacy The schema is a superset of every running release's needs at all times. No step takes a lasting exclusive lock; reads only switch after backfill verifies parity.
Each phase is independently deployable and reversible; the legacy column is dropped only after reads have moved and parity is confirmed.

Immediate Mitigation

If a change is hung and the table is timing out right now, restore availability first.

  1. Terminate the blocking session. Cancel the stuck DDL or the long transaction in front of it.

    -- PostgreSQL · requires pg_signal_backend or superuser · cancel the waiting ALTER
    SELECT pg_cancel_backend(pid)
    FROM   pg_stat_activity
    WHERE  state = 'active' AND query LIKE '%ALTER TABLE%';
    
    -- MySQL · requires CONNECTION_ADMIN/SUPER · kill the blocking statement
    KILL QUERY <process_id>;
  2. Shed write pressure. Trip circuit breakers to degrade to cached responses or route reads to a replica while the primary recovers.

  3. Revert routing, not schema. Flip the feature flag back to legacy-schema reads/writes — this is instant and needs no DDL.

  4. Stop the backfill cleanly. Send SIGTERM to migration workers and confirm the current batch committed or rolled back before they exit, so no partial range is left ambiguous.

  5. Flush pooled metadata. Reload PgBouncer (RELOAD) or restart ProxySQL to clear stale connections holding metadata locks and to drain the request queue.

Permanent Fix / Long-Term Pattern

Run the change as three independently deployed, individually reversible phases.

Phase 1 — Expand (additive, lock-safe)

Add the new column nullable so the ALTER is metadata-only, then deploy code that writes both fields and still reads the legacy one.

-- PostgreSQL · metadata-only on PG 11+ · run with a short lock_timeout to fail fast
SET lock_timeout = '3s';
ALTER TABLE high_traffic_table ADD COLUMN IF NOT EXISTS new_field VARCHAR(255) NULL;

-- MySQL 8.0 · request the non-blocking algorithm explicitly; it errors rather than blocking
ALTER TABLE high_traffic_table ADD COLUMN new_field VARCHAR(255) NULL, ALGORITHM=INSTANT;

Rollback is a clean reverse: ALTER TABLE high_traffic_table DROP COLUMN IF EXISTS new_field;, revert the deploy, reset the flag to legacy-only.

Phase 2 — Backfill and synchronize

Populate historical rows in throttled, idempotent batches while dual-write keeps new rows current. Keep each batch small and watch lag; the sizing rules live in Backfill Optimization, and keeping both copies in step is the job of Dual-Write Synchronization.

-- PostgreSQL · read-only · monitor lock waits while the backfill runs
SELECT pid, wait_event_type, wait_event, query
FROM   pg_stat_activity
WHERE  state = 'active' AND query LIKE '%UPDATE high_traffic_table%';

-- MySQL · read-only · watch replica lag during backfill (8.0.22+ column name)
SHOW REPLICA STATUS\G   -- check Seconds_Behind_Source

Phase 3 — Contract (verify, switch, drop)

Confirm parity, switch reads to the new column via the flag, then enforce the constraint and drop the legacy column in separate short transactions.

-- PostgreSQL · read-only parity check · must return 0 before switching reads
SELECT count(*) AS mismatch
FROM   high_traffic_table
WHERE  (legacy_field IS DISTINCT FROM new_field);

-- PostgreSQL · step 1: enforce NOT NULL only after backfill verified complete
ALTER TABLE high_traffic_table ALTER COLUMN new_field SET NOT NULL;

-- PostgreSQL · step 2: drop legacy in its own short transaction, off-peak window
BEGIN;
ALTER TABLE high_traffic_table DROP COLUMN legacy_field;
COMMIT;

Dropping a column that was NOT NULL and still read by some path is the one case that needs extra ordering — handle it via safely removing a NOT NULL column with expand-contract rather than the plain drop above.

Verification Checklist

  • ALTER completed as a metadata-only change with no measurable P99 latency spike.
  • legacy_field and new_field before reads switch.
  • information_schema.columns shows legacy_field absent and new_field with the target NOT NULL constraint.

Frequently Asked Questions

Why add the column nullable instead of with a default? On a hot table you want the ALTER to be metadata-only so it never rewrites rows under an exclusive lock. A nullable add is instant on PostgreSQL 11+ and ALGORITHM=INSTANT on MySQL 8.0.12+; a non-constant default that forces a rewrite can lock the table for the whole rewrite. Add nullable, backfill, then enforce the constraint in the contract phase.

Can I run expand and contract in the same deploy? No. The entire point is to separate them so the schema is always compatible with every running release. Expand, then let dual-write and the verified backfill run, then deploy the read switch, and only after that contract. Collapsing the phases reintroduces the version-skew window that throws ColumnNotFoundException.

How do I verify parity before switching reads? Run a mismatch query that returns zero — comparing legacy_field IS DISTINCT FROM new_field across the table — and confirm the backfill left no NULL rows. Switch reads only when that count is zero and replica lag is at baseline so replicas serve the same data the primary does.