🏗️ How Quora Migrated a Decade of Redshift to Trino (and Lived to Tell the Tale)

Lessons in Migrating a Core Data System Without Breaking Everything

Apr 26, 2025

∙ Paid

🚨 TL;DR

Over a decade, Redshift became tightly woven into Quora’s data systems, powering dashboards, A/B testing, and ad-hoc analytics. But as scale grew, so did pain: spiraling storage costs, constant firefighting, and architectural complexity.

In 2022, they committed to migrating everything to Trino. The journey involved auto-converting thousands of SQL queries, shadow-validating results at scale, optimizing joins manually, and ultimately cutting 50% of their costs.

This edition breaks down why Quora abandoned Redshift, how they pulled off the migration, and what you should watch out for when modernizing critical systems.

📖 What Will We Dive Into Today?

Why Redshift was great—until it wasn’t
The storage and maintenance traps at scale
How Quora auto-converted 2,000+ queries to Trino
How to Plan Migrations Without Meltdowns (Paid)
How to Rebuild Joins, Buckets, and Indexes for Speed (Paid)
How to Future-Proof Your Data Architecture (Paid)

🧱 Why Redshift Worked (At First)

When Quora started scaling, Redshift seemed perfect:

🔥 Fast small queries: Ad-hoc analysis without heavy tuning
⚡ Simple setup: Managed service, SQL native
📈 Powerful for OLAP: Dashboards, product metrics, A/B tests
✏️ Relational features: Easy updates/deletes compared to S3/Hive

But these early wins hid ticking time bombs.

💣 What Went Wrong

1. Complexity Explosion
Managing Redshift and Spark and S3/HDFS meant:

Constant data duplication
Different teams using different SQL dialects
Schema changes triggering cascading failures
Tooling and monitoring fragmentation

2. Storage Death Spiral
Using Redshift DS instances (compute + storage bundled) created an ugly trap:

Storage shortages despite unused compute
Only way to add storage? Buy more nodes (and waste money)
Constant "VACUUM" jobs needed to free space
Split-table strategies needed for long-term retention

3. Cost Overruns
Scaling Redshift required constant node additions, emergency provisioning, and manual maintenance—all of which drained engineering time and budget.

🚀 Why Trino?

Quora needed a query engine that could:

Read data straight from S3 Hive tables
Handle analytical SQL workloads
Be open-source, extensible, and self-hosted
Integrate smoothly into their Spark/S3 stack

Trino fit the bill. It was already in limited use—and it was fast, flexible, and free from storage lock-in.

🛠️ How They Migrated (Without Losing Their Minds)

🔍 Step 1: Full Query Observability

Custom Redshift query logging captured:

Full SQL text
Table dependencies
Query runtime
Call location (Python module + function)

No "unknown unknowns" left behind.

To learn more about observability read our previous edition “The Engineer’s Guide to Observability: Making Metrics, Logs, and Traces Work for You”.

The Engineer’s Guide to Observability: Making Metrics, Logs, and Traces Work for You

Byte-Sized Design

Jan 19

Read full story

📋 Step 2: Double-Write All Tables

Every Redshift table was mirrored to a Hive equivalent via automated copy jobs.
This guaranteed no missing data during the migration—and decoupled data prep from query rewriting.

⚙️ Step 3: Automate SQL Conversion (Twice)

First attempt: Regex-based rewrites (DATEDIFF → DATE_DIFF, etc.)
Reality: Regex melted on complex queries (nested casts, lateral aliases).

Pivot: Adopted SQLGlot, an open-source SQL syntax tree converter:

Parsed full query trees
Handled deep rewrites safely
Added custom rules for quirks like Trino’s stricter type enforcement

🧪 Step 4: Shadow Pipelines for Validation

Every migrated query ran side-by-side with the original Redshift version.
Results were auto-compared, tolerating minor float rounding drift.
Alerts for any non-trivial data mismatches, triggering manual investigation.

We’ve talked about this in our previous edition “Modernizing Legacy Systems Without Breaking Production”. Give it a read!

Modernizing Legacy Systems Without Breaking Production

Byte-Sized Design

Jan 5

Read full story

⚡ What Surprised Them

1. Query Performance Cliff
Some migrated queries ran 10x slower in Trino—because of bad join planning.

Root Cause: No table statistics = bad cost-based optimization.

Fix:

Run ANALYZE on big tables to generate Trino stats
Rewrite JOIN orders manually where needed
Bucket Hive tables and define sorted_by indexes for faster access

2. Updates and Deletes Got Harder
Redshift = easy in-place table updates.
Trino on S3 = full partition rewrites required.
Result? Longer write times and higher downtime risk.

3. Autoscaling Was Mandatory
To hit their SLA for daily A/B tests, Quora had to triple Trino cluster size—temporarily.
They built Airflow triggers to pre-scale clusters before peak jobs, then shrink afterward.
Saving costs without missing deadlines.

🧠 What Quora Learned (And So Should You)

Multiple storage systems will kill your agility. Consolidate early.
Storage/computation coupling (like Redshift DS nodes) is a silent budget killer.
Regex will always fail at serious query rewriting. Use syntax trees.
Auto-validation at scale saves you from silent data corruption.
Table stats are life or death for modern query optimizers.
Assume in-place updates may not exist in your future system. Design accordingly.

🧠 How to Plan Migrations Without Meltdowns

Most migrations fail because the transition wasn't survivable.
Here’s how to design a migration process that doesn’t blow up your team or your customers.

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.