There are projects you never forget.
For me, it was the night (actually, many nights) we moved over a billion recordsfrom our old database to a new one…
without a single second of downtime.
We were moving critical financial data payments, orders, ledgers and if we messed up, customers would lose money, dashboards would break, and trust would vanish overnight.
I learned system design the hard way through database internals, tough trade-offs, and the human stress baked into every decision.
Why We Had To Do It
Our old database had served us faithfully. But scale changes everything.
- Queries that once took milliseconds now dragged on for seconds.
- Batch jobs (like settlements) sometimes ran for hours.
- We had already scaled vertically (bigger hardware), and horizontally with replicas but it wasn’t enough.
- Schema rigidity meant every new feature felt like surgery.
We crossed 1 billion records, and the database was groaning. The business was growing, downtime was unacceptable. We had no choice: migrate.
But how do you move that much data without shutting the lights off?
Step 1: Bulk Migration of Old Data
We started with the “cold” data the old transactions that were no longer being updated.
What was happening inside the DB?
- If you try to export a billion rows, the DB will grab huge buffers, maybe spill to disk, and choke.
- Every row insert into the new DB updates indexes which slows everything down.
- Foreign key constraints mean every insert must check other tables.
What we did instead:
- Split the table into chunks by primary key ranges (e.g., IDs 1–5M, 5M–10M).
- Disabled secondary indexes and constraints during load.
- Ran multiple workers in parallel.
- After each chunk, we ran a checksum to make sure the data matched exactly.
This wasn’t glamorous, but it worked.
Lesson: For huge migrations, don’t think “one giant dump.” Think chunks + parallelism + idempotency.
Step 2: Dual Writes for Live Traffic
Copying old data was easy. The real challenge was keeping up with new traffic.
Imagine this: while we were copying the old rows, new payments were still coming in thousands per second. If we didn’t capture those, the new DB would always lag behind.
What we did:
- Modified the app to do dual writes: every new insert went to both old and new DBs.
- If the new DB write failed, the event was pushed into a Kafka retry queue.
- A consumer retried until it succeeded.
- We made writes idempotent by tagging them with unique IDs. (So if a retry happened twice, it wouldn’t duplicate the row.)
Why this worked:
In relational databases like Postgres/MySQL, every insert first goes into a WAL (Write Ahead Log). We piggybacked on that idea writes were guaranteed in at least one place, and retried until both DBs matched.
Lesson: Dual writes = cheap distributed transaction. Retry queues save you from partial failures.
Step 3: Shadow Reads (Secret Testing in Production)
Now we had old + new DBs in sync.
But… could we trust the new one?
This was our secret weapon: shadow reads.
- Customers still read from the old DB.
- But in the background, every query was silently duplicated against the new DB.
- We compared the results.
What we found:
- Timezones behaved differently (
TIMESTAMP WITHOUT TZvsWITH TZ). - Some
NULLs turned into default values in the new DB. - Sorting differed because collations (UTF-8 vs Latin1) weren’t identical.
None of these issues would show up in tests. Only real traffic exposed them. Shadow reads gave us weeks to fix these quirks without a single customer noticing.
Lesson: Shadow traffic isn’t optional. It’s the only way to catch query planner and encoding mismatches before customers do.
Step 4: The Cutover (Nerve-Wracking Night)
Cutover day was like preparing for battle.
The risks:
- The new DB’s buffer pool (cache) was cold. First queries could slam the disk.
- Indexes might not be fully warmed up.
- Background tasks (like autovacuum/compaction) could spike I/O.
Our plan:
- Pre-warmed the DB by running synthetic queries to load indexes and cache.
- Picked 4:30 AM (lowest traffic).
- Flipped a feature flag reads started going to the new DB.
- Kept dual writes ON as a safety net.
The first 10 minutes…
We sat staring at Grafana dashboards like they were EKG monitors.
- Latency? Normal.
- Error rate? Flat.
- Business metrics (payments, refunds)? All green.
No one celebrated. We were too scared.
But after 24 hours, when things stayed quiet, we finally smiled.
Lesson: Cutover isn’t just a switch. It’s cache warming + rollback safety + obsessive monitoring.
Step 5: Observability (Our Lifeline)
If you ask me what saved us, it wasn’t clever SQL. It was observability.
We tracked:
- Replication lag (seconds behind master).
- Deadlocks in the new DB.
- Cache hit ratio (had to stay >95%).
- Mismatch counters from shadow reads.
- Business KPIs (orders/min, revenue flow).
Without these dashboards, we’d have been flying blind.
Lesson: Migrations are monitoring problems disguised as data problems.
Trade-Offs We Faced
- Big Bang vs Phased:
- Big bang was faster, but rollback was impossible.
- Phased was slower, but reversible.
→ We picked phased.
2. ETL vs Dual Writes:
- ETL was simpler, but missed live traffic.
- Dual writes were harder, but safer.
→ We picked dual writes.
3. Indexes During Migration:
- Building indexes during load slowed everything.
- Loading first, then indexing later was faster.
→ We delayed index creation.
The Human Side
- Business teams kept asking: “Why can’t you just do this over the weekend?”
- DBAs lost sleep worrying about hidden corruption.
- Developers stared at Grafana dashboards like parents watching a sick child’s heartbeat monitor.
When the cutover finally worked, we didn’t cheer. We just sat in silence, staring at green graphs.
And then someone cracked a joke:
“So… who’s writing the postmortem if this fails tomorrow?”
We laughed. Nervously.
Final Lessons (If You’re Designing for This)
- Design every job to be idempotent. You will rerun things.
- Disable indexes & constraints during bulk loads. Rebuild later.
- Use dual writes + unique IDs to avoid duplication.
- Run shadow reads to catch planner/encoding quirks.
- Warm caches before cutover, or your latency graph will look like a heart attack.
- Monitor internals (WAL, cache, deadlocks), not just app metrics.
- Keep a rollback plan. Confidence comes from knowing you can undo.
Closing Thought
We didn’t just move a billion records. We learned that migrations are not database problems they’re system design problems.
You don’t migrate a billion rows at once. You migrate them one safe batch at a time, one WAL entry at a time, one checksum at a time.
That’s the secret.
Because when you treat migrations like building a distributed system, zero downtime isn’t luck. It’s design.


















