How Stripe built real-time billing analytics that actually works
TL;DR
Stripe’s batch-based billing analytics worked fine when updates could wait 24 hours. By 2024, customers demanded real-time visibility into MRR, churn, and conversions because in fast-moving markets, yesterday’s data loses deals today.
The problem? Subscriptions are stateful nightmares. Every $20 payment needs context from months of history. Batch processing couldn’t scale to sub-hour latency. Preaggregated queries were fast but couldn’t incorporate live data. And letting customers change metric definitions meant reprocessing years of history without breaking real-time ingestion.
The fix? Event-driven streaming with Apache Flink. A brand-new Apache Pinot query engine that aggregates on-the-fly. And a dual-mode system that recalculates history while streaming live updates without the dashboard ever going dark.
🚨 The Breaking Points
Batch Processing Hit a Wall
The old system recalculated subscription state by replaying every event from the beginning of time. Want to know if that June payment was on-time? Re-analyze January through June. For every subscription. Every 24 hours.
This worked until customers started asking: “Why can’t I see this trial conversion that just happened?” Because the batch job won’t run for another 18 hours, that’s why.
Preaggregation Made Queries Fast But Data Stale
Apache Pinot delivered sub-second dashboard queries by precomputing MRR over time in offline batch jobs. Fast responses, but baked-in staleness. Real-time streaming meant throwing out preaggregation—which meant risk of slow, unresponsive queries that would make the dashboard unusable.
Custom Metric Definitions Created a Consistency Nightmare
Customers could tweak MRR formulas (exclude coupons, adjust trial periods, etc.). Great for flexibility. Terrible for streaming systems. Change a definition? Now you need to:
Reprocess 8 years of historical data (hours of computation)
Keep streaming new events using the old definition (can’t stop the world)
Somehow merge them without showing Frankenstein data in the dashboard
There was no playbook for this.
🔍 Root Causes
1. Stateful Data Modeled with Stateless Batch Jobs
Subscriptions have memory. Payments build on each other. But the analytics system pretended each batch was independent—forcing full history replays to reconstruct state.
2. OLAP Optimization Assumed Offline Preparation
Pinot’s speed came from precomputed aggregations. Remove that step for real-time data, and suddenly you’re doing complex windowed aggregations at query time—something the original engine couldn’t handle.
3. No Strategy for Incremental Schema Evolution
Metric definition changes were treated as “reindex everything from scratch” events. No concept of applying changes incrementally while preserving consistency.
🧠 The Solution Architecture
Keep reading with a 7-day free trial
Subscribe to Byte-Sized Design to keep reading this post and get 7 days of free access to the full post archives.

