Coupling Schema Changes to Feature Flags and Removing Both

Six months after a successful launch, an audit finds a feature flag named enable_orders_v2_total permanently set to true, and a database column orders.legacy_total that nothing reads anymore but every write path still populates. Neither is harmful in isolation; together they are the residue of a rollout that ramped the flag, shipped the feature, and then stopped — never completing the teardown. The flag is now dead config that every engineer must mentally skip past; the column is dead weight on every INSERT and every backup. The discipline that prevents this is not the rollout itself — gating a new column’s read path behind a flag is straightforward — but the teardown order: ramp to 100%, confirm zero fallback, remove the flag, and only then contract the schema. Skip a step and you are left with an orphan flag, an orphan column, or both.

This page is about the removal half of the lifecycle, which is where the orphans accumulate. It assumes you already know how to gate a change behind a flag — covered in Using Feature Flags to Toggle Schema Changes Safely — and focuses on closing the loop without breakage.

Symptom / Error Signatures

Orphans are quiet; you find them by their traces, not by an error at the moment they are created.

  • A flag in the flag service that has been at 100% (or 0%) for weeks with no scheduled removal, and whose name references a feature that is fully launched.
  • Dead branches in code: an if (flags.enable_x) whose else arm is unreachable in production because the flag never flips back.
  • A column that appears in INSERT/UPDATE statements but in no SELECT — written but never read — visible by grepping the codebase or by query-log analysis.
  • The reverse: a column still read by a fallback path that someone believed was dead, surfacing as a NULL-related error (column "new_total" contains null values, failed NOT NULL add) when the old column is dropped too early.
  • Migration history shows an ADD COLUMN with no matching DROP COLUMN months later, or a flag-gated read with no corresponding cleanup commit.

The error case — dropping a column the fallback still reads — is the one that causes an incident. The orphan case just causes slow rot, but both come from the same missing discipline.

Root Cause Analysis

A flag-coupled schema change has two independent pieces of state that must be torn down in the right relationship: the flag (which decides whether code reads the new path) and the schema (the new column, and the old column it replaces). The new column’s read path is gated by the flag so that you can roll the feature forward and back without a deploy. That coupling is what makes the rollout safe — but it also means the schema cannot be contracted while the flag still has any path that depends on the old shape.

The orphan-column case happens when the flag reaches 100% and the team moves on: the feature works, so nobody removes the flag, and because the flag still exists the old column “might be needed for fallback” and is never dropped. The orphan-flag case is the same inertia from the other side: the column is dropped, the new path is the only path, but the now-constant flag is left in the code because removing it is unglamorous.

The incident case happens when teardown runs out of order — dropping the old column while the flag is below 100% or while some fallback read still references it. At that point a request that falls through to the old path hits a column that no longer exists, or a NOT NULL constraint added to the new column rejects a row the old path was still writing.

Teardown step done out of order Result
Drop old column before flag is 100% Fallback reads hit a missing column → errors
Remove flag before new column is fully populated Reads return NULL for unbackfilled rows
Ramp to 100% but never remove flag Orphan flag — dead config forever
Remove flag but never drop old column Orphan column — written but never read

The correct sequence is strictly ordered, and it is the read-side mirror of the additive deploy order in the Expand and Contract Methodology: you expand, you migrate reads behind the flag, you prove the new path, then you contract — and the flag must be retired between “prove” and “contract.”

Flag-and-schema teardown order A left-to-right timeline showing the four teardown steps in order: ramp the flag to 100 percent, confirm zero fallback reads, remove the flag, then drop the old column. Teardown Order 1. Ramp flag to 100% 2. Confirm zero fallback 3. Remove flag + dead branch 4. Contract drop old column Each step gates the next; the schema is contracted only after the flag is gone.
The flag is always removed before the schema contracts, so no fallback path can ever reference a column that has been dropped.

Immediate Mitigation

If you have found an orphan, or a teardown is mid-flight and you need to make it safe, proceed by which orphan you have.

  1. For an orphan column still being written: first prove nothing reads it. Grep the codebase and inspect the query log for any SELECT of the column before you touch it.
-- PostgreSQL · run as a read-only role · find rows that still diverge old vs new
-- Safe anytime; tells you whether the old column carries data the new one lacks.
SELECT count(*) AS divergent
FROM   orders
WHERE  legacy_total IS DISTINCT FROM new_total;
  1. Stop writing the orphan column in a deploy before dropping it — remove it from the write path, ship, confirm writes have stopped, then drop. Dropping a column that code still writes to causes immediate INSERT failures.

  2. For an orphan flag at a constant value: delete the losing branch in code and ship that first, so the flag has no behavioral effect, then remove the flag from the flag service. Removing the service entry while code still queries it can throw a “flag not found” default that may differ from the value you relied on.

  3. For a half-done teardown that dropped the column too early: if reads are now failing, the fastest recovery is to re-add the column as nullable and re-enable the fallback path, restoring the prior working state, then redo the teardown in the correct order.

-- PostgreSQL · run on primary as migration role · additive, restores the fallback
-- Re-add nullable so the old read path works again; backfill if the data is needed.
ALTER TABLE orders ADD COLUMN IF NOT EXISTS legacy_total NUMERIC;

Permanent Fix / Long-Term Pattern

Treat the flag and the schema as a single coupled unit with a four-step teardown that is recorded as work, not left to inertia.

Step 1 — Ramp the flag to 100%. Move all traffic onto the new column’s read path. Hold here long enough to cover your slowest cohort and any cached sessions.

Step 2 — Confirm zero fallback. Instrument the old path so you can prove it is taking zero traffic, not assume it. A counter on the else branch that has read zero over a full traffic cycle is the gate.

# Instrument the fallback branch so "zero fallback" is measured, not assumed.
# Context: application code; emit a metric whenever the old path is taken.
if flags.enabled("enable_orders_v2_total", user):
    total = row["new_total"]
else:
    metrics.increment("orders.total.legacy_fallback")   # must read 0 before teardown
    total = row["legacy_total"]

Step 3 — Remove the flag. With fallback proven dead, delete the conditional and the losing branch, ship that change, then delete the flag from the flag service. The code is now unconditional on the new column; the flag is gone from both code and config in the same change set, leaving no orphan.

Step 4 — Contract the schema. Only now drop the old column. Because nothing reads or branches on it, the drop is a pure contraction.

-- PostgreSQL · run on primary as migration role · final contraction, low-write window
-- Safe only after the flag is removed and no code path references legacy_total.
ALTER TABLE orders DROP COLUMN IF EXISTS legacy_total;

If the old column was NOT NULL, the contraction has extra steps — you must drop the constraint and stop writes in the right order, which is the specific subject of Safely Removing a NOT NULL Column With Expand-Contract. The unifying idea, drawn from the parent Feature Flag Rollouts section, is that a flag is temporary scaffolding: it has a creation ticket and a removal ticket from day one, and the removal ticket explicitly orders the schema contraction after the flag deletion so neither is ever left behind.

Verification Checklist

  • before the old column is dropped.
  • INSERT/UPDATE paths) before the DROP COLUMN runs.

Frequently Asked Questions

Can I drop the old column at the same time I remove the flag? Not in the same instant. Remove the flag and its branch first and ship that, so production is running unconditionally on the new column with no path that references the old one. Only after that deploy is live and stable do you drop the column. Doing both in one change set risks a request in flight against the old code hitting a column that no longer exists.

How long should the flag stay at 100% before I tear it down? Long enough to cover your slowest-refreshing clients and any cached sessions — typically at least one full traffic cycle, often a few days for consumer apps with long-lived sessions. The objective metric is that the instrumented fallback branch reads zero over that window; calendar time is just a proxy for “everyone has moved on.”

What stops a flag from becoming permanent dead config? A removal ticket created alongside the rollout, with an explicit owner and a trigger condition (“fallback at zero for one week”). Flags without a scheduled end date become orphans by default. The flag service itself should surface flags that have been at a constant value past a threshold so they cannot hide indefinitely.

The old column is NOT NULL and code still writes it. How do I drop it? Carefully and in order: drop the NOT NULL constraint first, remove the column from all write paths and confirm writes have stopped, verify no readers remain, then DROP COLUMN. The full procedure with PostgreSQL and MySQL specifics is covered in the guide on removing a NOT NULL column with expand-contract.