Byte-Sized Design

Byte-Sized Design

Share this post

Byte-Sized Design
Byte-Sized Design
🏗️ How Quora Migrated a Decade of Redshift to Trino (and Lived to Tell the Tale)

🏗️ How Quora Migrated a Decade of Redshift to Trino (and Lived to Tell the Tale)

Lessons in Migrating a Core Data System Without Breaking Everything

Byte-Sized Design's avatar
Byte-Sized Design
Apr 26, 2025
∙ Paid
12

Share this post

Byte-Sized Design
Byte-Sized Design
🏗️ How Quora Migrated a Decade of Redshift to Trino (and Lived to Tell the Tale)
3
Share

🚨 TL;DR

Over a decade, Redshift became tightly woven into Quora’s data systems, powering dashboards, A/B testing, and ad-hoc analytics. But as scale grew, so did pain: spiraling storage costs, constant firefighting, and architectural complexity.

In 2022, they committed to migrating everything to Trino. The journey involved auto-converting thousands of SQL queries, shadow-validating results at scale, optimizing joins manually, and ultimately cutting 50% of their costs.

This edition breaks down why Quora abandoned Redshift, how they pulled off the migration, and what you should watch out for when modernizing critical systems.


📖 What Will We Dive Into Today?

  • Why Redshift was great—until it wasn’t

  • The storage and maintenance traps at scale

  • How Quora auto-converted 2,000+ queries to Trino

  • How to Plan Migrations Without Meltdowns (Paid)

  • How to Rebuild Joins, Buckets, and Indexes for Speed (Paid)

  • How to Future-Proof Your Data Architecture (Paid)


🧱 Why Redshift Worked (At First)

When Quora started scaling, Redshift seemed perfect:

  • 🔥 Fast small queries: Ad-hoc analysis without heavy tuning

  • ⚡ Simple setup: Managed service, SQL native

  • 📈 Powerful for OLAP: Dashboards, product metrics, A/B tests

  • ✏️ Relational features: Easy updates/deletes compared to S3/Hive

But these early wins hid ticking time bombs.


💣 What Went Wrong

1. Complexity Explosion
Managing Redshift and Spark and S3/HDFS meant:

  • Constant data duplication

  • Different teams using different SQL dialects

  • Schema changes triggering cascading failures

  • Tooling and monitoring fragmentation

2. Storage Death Spiral
Using Redshift DS instances (compute + storage bundled) created an ugly trap:

  • Storage shortages despite unused compute

  • Only way to add storage? Buy more nodes (and waste money)

  • Constant "VACUUM" jobs needed to free space

  • Split-table strategies needed for long-term retention

3. Cost Overruns
Scaling Redshift required constant node additions, emergency provisioning, and manual maintenance—all of which drained engineering time and budget.


🚀 Why Trino?

Quora needed a query engine that could:

  • Read data straight from S3 Hive tables

  • Handle analytical SQL workloads

  • Be open-source, extensible, and self-hosted

  • Integrate smoothly into their Spark/S3 stack

Trino fit the bill. It was already in limited use—and it was fast, flexible, and free from storage lock-in.


🛠️ How They Migrated (Without Losing Their Minds)

🔍 Step 1: Full Query Observability

Custom Redshift query logging captured:

  • Full SQL text

  • Table dependencies

  • Query runtime

  • Call location (Python module + function)

No "unknown unknowns" left behind.

To learn more about observability read our previous edition “The Engineer’s Guide to Observability: Making Metrics, Logs, and Traces Work for You”.

The Engineer’s Guide to Observability: Making Metrics, Logs, and Traces Work for You

The Engineer’s Guide to Observability: Making Metrics, Logs, and Traces Work for You

Byte-Sized Design
·
Jan 19
Read full story

📋 Step 2: Double-Write All Tables

Every Redshift table was mirrored to a Hive equivalent via automated copy jobs.
This guaranteed no missing data during the migration—and decoupled data prep from query rewriting.

⚙️ Step 3: Automate SQL Conversion (Twice)

  • First attempt: Regex-based rewrites (DATEDIFF → DATE_DIFF, etc.)

  • Reality: Regex melted on complex queries (nested casts, lateral aliases).

Pivot: Adopted SQLGlot, an open-source SQL syntax tree converter:

  • Parsed full query trees

  • Handled deep rewrites safely

  • Added custom rules for quirks like Trino’s stricter type enforcement

🧪 Step 4: Shadow Pipelines for Validation

  • Every migrated query ran side-by-side with the original Redshift version.

  • Results were auto-compared, tolerating minor float rounding drift.

  • Alerts for any non-trivial data mismatches, triggering manual investigation.

We’ve talked about this in our previous edition “Modernizing Legacy Systems Without Breaking Production”. Give it a read!

Modernizing Legacy Systems Without Breaking Production

Modernizing Legacy Systems Without Breaking Production

Byte-Sized Design
·
Jan 5
Read full story

⚡ What Surprised Them

1. Query Performance Cliff
Some migrated queries ran 10x slower in Trino—because of bad join planning.

Root Cause: No table statistics = bad cost-based optimization.

Fix:

  • Run ANALYZE on big tables to generate Trino stats

  • Rewrite JOIN orders manually where needed

  • Bucket Hive tables and define sorted_by indexes for faster access

2. Updates and Deletes Got Harder
Redshift = easy in-place table updates.
Trino on S3 = full partition rewrites required.
Result? Longer write times and higher downtime risk.

3. Autoscaling Was Mandatory
To hit their SLA for daily A/B tests, Quora had to triple Trino cluster size—temporarily.
They built Airflow triggers to pre-scale clusters before peak jobs, then shrink afterward.
Saving costs without missing deadlines.


🧠 What Quora Learned (And So Should You)

  • Multiple storage systems will kill your agility. Consolidate early.

  • Storage/computation coupling (like Redshift DS nodes) is a silent budget killer.

  • Regex will always fail at serious query rewriting. Use syntax trees.

  • Auto-validation at scale saves you from silent data corruption.

  • Table stats are life or death for modern query optimizers.

  • Assume in-place updates may not exist in your future system. Design accordingly.


🧠 How to Plan Migrations Without Meltdowns

Most migrations fail because the transition wasn't survivable.
Here’s how to design a migration process that doesn’t blow up your team or your customers.

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Byte-Sized Design
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share