Byte-Sized Design

Byte-Sized Design

How Stripe built real-time billing analytics that actually works

Byte-Sized Design's avatar
Byte-Sized Design
Dec 09, 2025
∙ Paid

TL;DR

Stripe’s batch-based billing analytics worked fine when updates could wait 24 hours. By 2024, customers demanded real-time visibility into MRR, churn, and conversions because in fast-moving markets, yesterday’s data loses deals today.

The problem? Subscriptions are stateful nightmares. Every $20 payment needs context from months of history. Batch processing couldn’t scale to sub-hour latency. Preaggregated queries were fast but couldn’t incorporate live data. And letting customers change metric definitions meant reprocessing years of history without breaking real-time ingestion.

The fix? Event-driven streaming with Apache Flink. A brand-new Apache Pinot query engine that aggregates on-the-fly. And a dual-mode system that recalculates history while streaming live updates without the dashboard ever going dark.

🚨 The Breaking Points

Batch Processing Hit a Wall

The old system recalculated subscription state by replaying every event from the beginning of time. Want to know if that June payment was on-time? Re-analyze January through June. For every subscription. Every 24 hours.

This worked until customers started asking: “Why can’t I see this trial conversion that just happened?” Because the batch job won’t run for another 18 hours, that’s why.

Preaggregation Made Queries Fast But Data Stale

Apache Pinot delivered sub-second dashboard queries by precomputing MRR over time in offline batch jobs. Fast responses, but baked-in staleness. Real-time streaming meant throwing out preaggregation—which meant risk of slow, unresponsive queries that would make the dashboard unusable.

Custom Metric Definitions Created a Consistency Nightmare

Customers could tweak MRR formulas (exclude coupons, adjust trial periods, etc.). Great for flexibility. Terrible for streaming systems. Change a definition? Now you need to:

  1. Reprocess 8 years of historical data (hours of computation)

  2. Keep streaming new events using the old definition (can’t stop the world)

  3. Somehow merge them without showing Frankenstein data in the dashboard

There was no playbook for this.

🔍 Root Causes

1. Stateful Data Modeled with Stateless Batch Jobs

Subscriptions have memory. Payments build on each other. But the analytics system pretended each batch was independent—forcing full history replays to reconstruct state.

2. OLAP Optimization Assumed Offline Preparation

Pinot’s speed came from precomputed aggregations. Remove that step for real-time data, and suddenly you’re doing complex windowed aggregations at query time—something the original engine couldn’t handle.

3. No Strategy for Incremental Schema Evolution

Metric definition changes were treated as “reindex everything from scratch” events. No concept of applying changes incrementally while preserving consistency.

🧠 The Solution Architecture

Keep reading with a 7-day free trial

Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Byte-Sized Design · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture