Configuring Read Replicas for Seamless Schema Updates

Operational Intent: Validate replica schema synchronization and dynamically adjust query routing thresholds during active DDL execution to prevent column mismatch and schema version errors in production read workloads.

Symptom

  • ERROR: column "new_column_name" does not exist (PostgreSQL) or 1054 Unknown column (MySQL)
  • ORM stack traces: ActiveRecord::StatementInvalid: PG::UndefinedColumn during read-only operations
  • Connection pooler metrics: replica_lag_seconds spikes >5.0s immediately following ALTER TABLE execution
  • Proxy routing logs: ERROR: schema mismatch on replica node, triggering automatic failover to primary and subsequent connection exhaustion
  • Client-side: 504 Gateway Timeout on read endpoints while write endpoints remain responsive

Root Cause

Asynchronous replication mechanics collide with non-backward-compatible DDL. Schema changes apply instantly to the primary but propagate asynchronously to replicas. During this replication gap, standard Read/Write Splitting Tactics route queries to nodes operating on stale schema definitions. Connection poolers (PgBouncer, ProxySQL, HAProxy, Envoy) lack native schema-aware routing, treating all replicas as functionally identical regardless of DDL sync state. This creates a deterministic race condition where read queries hit outdated metadata, triggering fatal SQL exceptions and cascading connection pool exhaustion.

Immediate Mitigation

Execution Context: Application Layer & Proxy Configuration

  1. Force Primary Routing: Temporarily disable replica routing for the affected schema/table. Route 100% of read/write traffic to the primary until replication stabilizes.
  2. Deploy Circuit Breaker: Catch UndefinedColumn/UnknownColumn exceptions at the application layer and retry the query against the primary endpoint.
  3. Halt Deployment Pipeline: Pause automated pipelines and pending DDL batches. Do not proceed until replication_lag_seconds drops below 1.0s across all replica nodes.
  4. ORM Strict Fallback: Enable strict read-only mode fallback to the primary database for the duration of the migration window.

Production-Safe Diagnostic Queries:

-- PostgreSQL: Check replication lag & WAL apply status
SELECT client_addr, state, sent_lag, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

-- MySQL/MariaDB: Check slave status & lag
SHOW SLAVE STATUS\G
-- Critical fields: Seconds_Behind_Master, Relay_Log_Space, Slave_IO_Running, Slave_SQL_Running

Explicit Rollback Command: If schema mismatch triggers cascading failures, revert routing immediately before attempting DDL rollback:

# Proxy/PgBouncer: Force primary-only routing via hot-reload
proxyadmin -u admin -p admin -h localhost -e "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=10;"
-- DDL Rollback (PostgreSQL/MySQL compatible)
ALTER TABLE your_table DROP COLUMN IF EXISTS new_column_name;

Permanent Resolution

Integrate these controls into your broader Zero-Downtime Schema Evolution Patterns to eliminate deployment-induced outages and guarantee consistent query behavior across distributed data stores.

  1. Expand & Contract Methodology: Phase DDL into backward-compatible steps: ADD COLUMN (NULL/DEFAULT) → deploy app code → backfill data → enforce constraints → DROP legacy column.
  2. Lag-Aware Proxy Routing: Configure ProxySQL/HAProxy health checks to monitor pg_stat_replication or SHOW SLAVE STATUS. Automatically exclude replicas with lag > 2s from the read pool during active migrations.
  3. Schema Parity Validation Sidecar: Deploy a lightweight service polling information_schema.columns on replicas. Update service discovery endpoints only when column_count and constraint_definitions match the primary.
  4. Feature-Gated Read Routing: Inject schema version headers into queries or use feature flags to gate read traffic until all replicas report parity via a lightweight validation query.
  5. Online Schema Change Tools: Standardize execution during low-traffic windows using pt-online-schema-change (MySQL) or pg_repack equivalents to minimize lock contention and replication delay.

Validation

Execute the following checklist before, during, and after rollout:

  • Pre-DDL: Verify replication_lag_seconds consistently under 1.0s before initiating DDL execution.
  • Schema Safety: Confirm all new columns are added as NULL or with DEFAULT values to maintain backward compatibility.
  • Failover Test: In staging, artificially inject 5s+ replication delay. Verify read routing fails over to primary without connection exhaustion.
  • Query Planner Stability: Run EXPLAIN (ANALYZE, BUFFERS) on critical read queries post-migration across all nodes to detect plan regressions.
  • Log Audit: Monitor connection pooler logs to confirm zero schema_mismatch errors during the rollout window.

Post-Migration Verification Query:

-- Verify column parity across primary and replicas
SELECT table_schema, table_name, column_name, ordinal_position, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;
-- Cross-check output between primary and replica endpoints. Mismatches indicate incomplete replication.