Configuring Read Replicas for Seamless Schema Updates
Operational Intent: Validate replica schema synchronization and dynamically adjust query routing thresholds during active DDL execution to prevent column mismatch and schema version errors in production read workloads.
Symptom
ERROR: column "new_column_name" does not exist(PostgreSQL) or1054 Unknown column(MySQL)- ORM stack traces:
ActiveRecord::StatementInvalid: PG::UndefinedColumnduring read-only operations - Connection pooler metrics:
replica_lag_secondsspikes >5.0s immediately followingALTER TABLEexecution - Proxy routing logs:
ERROR: schema mismatch on replica node, triggering automatic failover to primary and subsequent connection exhaustion - Client-side: 504 Gateway Timeout on read endpoints while write endpoints remain responsive
Root Cause
Asynchronous replication mechanics collide with non-backward-compatible DDL. Schema changes apply instantly to the primary but propagate asynchronously to replicas. During this replication gap, standard Read/Write Splitting Tactics route queries to nodes operating on stale schema definitions. Connection poolers (PgBouncer, ProxySQL, HAProxy, Envoy) lack native schema-aware routing, treating all replicas as functionally identical regardless of DDL sync state. This creates a deterministic race condition where read queries hit outdated metadata, triggering fatal SQL exceptions and cascading connection pool exhaustion.
Immediate Mitigation
Execution Context: Application Layer & Proxy Configuration
- Force Primary Routing: Temporarily disable replica routing for the affected schema/table. Route 100% of read/write traffic to the primary until replication stabilizes.
- Deploy Circuit Breaker: Catch
UndefinedColumn/UnknownColumnexceptions at the application layer and retry the query against the primary endpoint. - Halt Deployment Pipeline: Pause automated pipelines and pending DDL batches. Do not proceed until
replication_lag_secondsdrops below 1.0s across all replica nodes. - ORM Strict Fallback: Enable strict read-only mode fallback to the primary database for the duration of the migration window.
Production-Safe Diagnostic Queries:
-- PostgreSQL: Check replication lag & WAL apply status
SELECT client_addr, state, sent_lag, write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
-- MySQL/MariaDB: Check slave status & lag
SHOW SLAVE STATUS\G
-- Critical fields: Seconds_Behind_Master, Relay_Log_Space, Slave_IO_Running, Slave_SQL_Running
Explicit Rollback Command: If schema mismatch triggers cascading failures, revert routing immediately before attempting DDL rollback:
# Proxy/PgBouncer: Force primary-only routing via hot-reload
proxyadmin -u admin -p admin -h localhost -e "UPDATE mysql_servers SET status='OFFLINE_SOFT' WHERE hostgroup_id=10;"
-- DDL Rollback (PostgreSQL/MySQL compatible)
ALTER TABLE your_table DROP COLUMN IF EXISTS new_column_name;
Permanent Resolution
Integrate these controls into your broader Zero-Downtime Schema Evolution Patterns to eliminate deployment-induced outages and guarantee consistent query behavior across distributed data stores.
- Expand & Contract Methodology: Phase DDL into backward-compatible steps:
ADD COLUMN(NULL/DEFAULT) → deploy app code → backfill data → enforce constraints →DROPlegacy column. - Lag-Aware Proxy Routing: Configure ProxySQL/HAProxy health checks to monitor
pg_stat_replicationorSHOW SLAVE STATUS. Automatically exclude replicas withlag > 2sfrom the read pool during active migrations. - Schema Parity Validation Sidecar: Deploy a lightweight service polling
information_schema.columnson replicas. Update service discovery endpoints only whencolumn_countandconstraint_definitionsmatch the primary. - Feature-Gated Read Routing: Inject schema version headers into queries or use feature flags to gate read traffic until all replicas report parity via a lightweight validation query.
- Online Schema Change Tools: Standardize execution during low-traffic windows using
pt-online-schema-change(MySQL) orpg_repackequivalents to minimize lock contention and replication delay.
Validation
Execute the following checklist before, during, and after rollout:
- Pre-DDL: Verify
replication_lag_secondsconsistently under 1.0s before initiating DDL execution. - Schema Safety: Confirm all new columns are added as
NULLor withDEFAULTvalues to maintain backward compatibility. - Failover Test: In staging, artificially inject 5s+ replication delay. Verify read routing fails over to primary without connection exhaustion.
- Query Planner Stability: Run
EXPLAIN (ANALYZE, BUFFERS)on critical read queries post-migration across all nodes to detect plan regressions. - Log Audit: Monitor connection pooler logs to confirm zero
schema_mismatcherrors during the rollout window.
Post-Migration Verification Query:
-- Verify column parity across primary and replicas
SELECT table_schema, table_name, column_name, ordinal_position, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'
ORDER BY ordinal_position;
-- Cross-check output between primary and replica endpoints. Mismatches indicate incomplete replication.