The Hidden Cost of Streaming Pipelines Nobody Talks About
Everyone in data engineering is obsessed with real time.
Kafka. Flink. Event-driven architectures. Millisecond latency. Live dashboards. It's the direction every conference talk points, every job description asks for, every architecture diagram proudly features.
And I bought into it completely.
About a year into my data engineering career, our product team came to us with a request: customers wanted to see their order status update in real time. Our existing batch pipeline ran at 2am every night, customers were calling support asking where their orders were.
Reasonable ask. So we rebuilt the pipeline as a streaming system.
Six months later, I had learned more about the real cost of streaming than any blog post or conference talk had ever prepared me for.
This is that story β and the honest breakdown I wish someone had given me before I started.
What Nobody Tells You Before You Go Streamingβ
The conversation around streaming pipelines almost always focuses on the benefits:
-
- Real-time insights
-
- Millisecond latency
-
- Event-driven reactions
-
- Live personalization
-
- Fraud detection before transactions complete
All of that is true. Streaming genuinely enables things batch never could.
But the conversation stops there. Nobody walks you through what you're signing up for on the other side of that decision.
Let me do that.
What We Had Before (And Why It Worked)β
Our original order pipeline was batch. It ran every night at 2am via Azure Data Factory, pulled 24 hours of orders from our SQL database, ran a Spark transformation job, and wrote clean Delta tables to ADLS Gen2.
Every night at 2am:
β
ADF Pipeline triggers
β
Pull all orders from the last 24 hours
β
Spark: clean β deduplicate β join product catalog
β
Write to Silver layer (Delta table on ADLS Gen2)
β
Aggregate into Gold layer
β
Power BI refreshes β customers see updated status
It ran in 45 minutes. Our Spark cluster spun up, did its job, and shut down. We paid for 45 minutes of compute per day. The pipeline was simple, debuggable, and recoverable, if something broke, we fixed it and replayed from Bronze.
The only problem: customers saw data that was 6 to 30 hours old depending on when they ordered.
For most use cases, that's fine. For order status, it wasn't.
Hidden Cost #1 - Infrastructure That Never Sleepsβ
The first thing that surprised me about our streaming pipeline was the infrastructure bill.
Our batch Spark cluster ran 45 minutes a day. Our Kafka + Flink setup runs every minute of every day - 24 hours, 7 days a week, whether there are 10 events per second or 10,000.
Streaming infrastructure requires 24/7 uptime. You can't spin it down overnight to save money. You can't schedule it during off-peak hours. The pipeline is always on, always consuming resources, always incurring cost.
For our team, the monthly compute cost for the streaming pipeline was roughly 4x what the equivalent batch job cost and that was before accounting for the additional engineering time to maintain it.
The question to ask before going streaming: Is the business value of real-time data worth 4x the infrastructure cost? Sometimes the answer is yes. Often it isn't.
Hidden Cost #2 - Late-Arriving Data Will Break Your Logicβ
In a batch pipeline, late data is not a problem. If an event arrives 3 hours late, it's in the next batch. The pipeline processes it, life goes on.
In a streaming pipeline, late-arriving data is one of the hardest problems in distributed systems.
Events can arrive out of order due to network delays, retries, or clock skew between services. Your Flink job is processing event #1,000 when event #987 suddenly arrives 45 seconds late. What do you do?
The answer involves watermarking, telling your stream processor "wait X seconds after the event time before closing a window, to account for late arrivals." But choosing the right watermark is a balance:
- Too short: you miss late-arriving events, your aggregations are wrong
- Too long: you hold state in memory longer, increasing latency and memory pressure
We got this wrong twice before landing on a configuration that worked. Both times, our order counts were silently off by 1-3%, small enough to look like noise, large enough to cause problems in financial reconciliation.
Late data problem illustrated:
Event time: 10:00 10:01 10:02 10:03 10:04
Arrived at: 10:00 10:01 10:04 10:03 10:05
β
event #3 arrived 2 minutes late
β already missed the 10:02 window
β your aggregate is wrong
In batch, this doesn't exist as a problem. In streaming, it's a constant engineering challenge.
Hidden Cost #3 - Exactly-Once Is Harder Than It Soundsβ
In a batch pipeline, if a job fails, you rerun it. Simple.
In a streaming pipeline, failure recovery is a distributed systems problem.
Say your Flink job crashes halfway through processing a Kafka topic partition. When it restarts, does it reprocess events it already processed? Does it skip events it hadn't gotten to yet? Does it process some events twice?
This is the exactly-once semantics problem ensuring each event is processed exactly once, not zero times, not twice.
Getting this right requires:
- Kafka consumer offset management
- Flink checkpoint configuration
- Idempotent writes to your output store
- Careful handling of state during recovery
We had a production incident in our first month where a Flink job restart caused approximately 2,000 order events to be processed twice. Customers received duplicate status update notifications. Support tickets spiked. It took us two days to identify, patch, and backfill the correct state.
A batch job that fails leaves your data unchanged. You fix it, rerun, done. A streaming job that fails mid-stream can leave your data in a partially-updated state that's genuinely difficult to reason about.
Hidden Cost #4 - Testing Is a Different Disciplineβ
Testing a batch pipeline is relatively straightforward. You have a dataset, you run the transformation, you check the output. Deterministic, reproducible, easy to validate.
Testing a streaming pipeline requires simulating event streams with realistic timing, ordering, and volume. You need to test:
- What happens when events arrive out of order?
- What happens when a consumer crashes and restarts?
- What happens when Kafka lag builds up during a traffic spike?
- What happens when an upstream service sends a malformed event?
We discovered most of our edge cases in production, not in testing. Not because we were careless, but because accurately simulating a live event stream in a test environment is genuinely difficult.
Our batch pipeline had a test suite that ran in 8 minutes. Our streaming pipeline's test suite took 40 minutes and still missed three production bugs in the first month.
Hidden Cost #5 - Your Team Needs Streaming-Specific Skillsβ
This one is easy to underestimate.
Batch data engineering skills - Spark, SQL, dbt, ADF are well-understood, well-documented, and widely held. If someone on your team leaves, finding a replacement with those skills is manageable.
Streaming-specific skills Kafka internals, Flink state management, watermarking strategies, consumer group management, exactly-once configuration are genuinely harder to find and take longer to develop.
When we hit our first major Flink issue (a state backend misconfiguration causing memory pressure under load), our team spent three days debugging something that an experienced Flink engineer would have spotted in 20 minutes. We didn't have one. We learned on the job, which is fine but it was expensive learning.
Before committing to a streaming architecture, ask: does your team have the skills to maintain it? And if not, what's the cost of developing those skills or hiring them?
So When Is Streaming Actually Worth It?β
None of this means streaming is wrong. It means streaming has a real cost that should be weighed against a real business need.
Streaming is worth it when the business problem genuinely cannot tolerate batch latency. Here's a clear test:
Reach for streaming when:
- Fraud needs to be detected before a transaction completes β batch latency means the fraud already happened
- A customer's app needs to reflect a change within seconds of it occurring
- A system needs to react to an event automatically β alerts, triggers, automated responses
- You're processing IoT sensor data where stale readings are dangerous, not just inconvenient
Stick with batch when:
- You're building monthly reports, financial summaries, or historical analyses
- Your stakeholders check dashboards in the morning, not the second
- Your transformations involve complex aggregations over large historical datasets
- Your team is small and operational simplicity matters more than latency
The tech industry is currently obsessed with "real-time," which has led many organizations to over-engineer their stacks implementing complex stream-processing frameworks where a simple batch job would have sufficed. A well-built batch pipeline is more reliable, cheaper, and easier to maintain than a poorly-justified streaming one.
The Architecture That Actually Works: Bothβ
Here's what I'd tell myself before starting that project:
You probably need both, not either/or.
Our final architecture uses batch for everything that can tolerate it, and streaming only for the specific cases that genuinely can't:
Streaming layer (Kafka + Flink):
Order events β real-time status updates (Cassandra)
Fraud signals β real-time alerts (notification service)
Batch layer (Spark + ADF):
Nightly order aggregations β Silver β Gold (Power BI)
Monthly revenue reports (finance team)
ML training datasets (data science team)

The streaming layer handles the 5% of use cases where seconds matter. The batch layer handles the 95% where they don't , more reliably, more cheaply, with less operational overhead.
Microsoft Fabric is built around exactly this pattern, Eventstreams for real-time ingestion, ADF Pipelines and Spark Notebooks for batch transformation, both writing to the same OneLake. You don't have to choose one architecture. You choose the right tool for each use case within the same platform.
The Honest Summaryβ
| Batch | Streaming | |
|---|---|---|
| Infrastructure cost | Low - runs on schedule | High - always on |
| Latency | Minutes to hours | Milliseconds to seconds |
| Late data | Not a problem | Significant engineering challenge |
| Failure recovery | Fix and rerun | Complex - risk of duplicates or data loss |
| Testing | Straightforward | Requires stream simulation |
| Team skills needed | Spark, SQL, ADF | Kafka, Flink, state management |
| Best for | Analytics, reporting, ML | Fraud detection, live status, alerts |
| Operational complexity | Low | High |
Streaming pipelines are powerful. They enable product experiences that batch simply can't deliver.
But they come with real costs - infrastructure that never sleeps, late-data handling that never stops being tricky, failure recovery that's genuinely hard to get right, and a skills requirement that's easy to underestimate.
The next time someone on your team says "we should make this real time", ask the question first:
How long can the business actually wait for this data?
If the honest answer is "overnight is fine" β keep the batch job. It's not boring. It's the right call.
References & Further Readingβ
- Databricks - Batch vs Streaming
- Apache Flink - Watermarks and Late Data
- Apache Kafka Documentation
- Microsoft Fabric - Real-Time Intelligence
- RecodeHive - How Netflix Handles Millions of Events Every Minute
- RecodeHive - Medallion Architecture Explained
- RecodeHive - Microsoft Fabric: One Platform, One Lake
About the Authorβ
I'm Aditya Singh Rathore, a Data Engineer passionate about building modern, scalable data platforms. I write about data engineering, Azure, and real-world pipeline design on RecodeHive, turning hard-won lessons into content anyone can learn from.
π© Have you been burned by a streaming pipeline that didn't need to be? Drop it in the comments.