Why was Azure Service Bus only part of the damage?
A Reddit post described a system that generated $79,847 in charges over a single weekend.
The root cause appeared simple: a retry loop that fired every 50ms against Azure Service Bus.
The post claimed 847 million operations. The number sounded shocking. The math did not.
A closer look reveals something more interesting and common in cloud systems.
Service Bus was not the main cost driver. It was the spark that triggered a much larger cost cascade.
Using actual Azure Service Bus pricing for 2025, let’s break down:
- What could have happened?
- Where did the money likely go?
- Which safeguards were missing?
The bug that started it all: 50 millisecond retry loops
At the center of the incident was a retry loop with no upper limit. One service instance retrying every 50 milliseconds produces:
- 20 operations per second
- 1.73 million operations per day
- 5.19 million operations over three days
To achieve 847 million operations, the system required approximately 163 parallel instances running continuously.
That already tells that this was not a single bug looping in isolation. There was more to it.
What Service Bus actually costs in 2025
Before blaming the messaging layer, it helps to look at pricing.
Azure Service Bus pricing
| Tier | Base Cost | Operations Cost (per million ops) |
| Standard | $0.0135/hour | First 13M: FREE 13-100M: $0.80 100M-2.5B: $0.50 >2.5B: $0.40 |
| Premium | $0.977/hour/MU | ~$0.013 average (capacity-based, no free tier) |
Now let’s apply that pricing to the reported volume.
Standard tier costs
847 million operations on the Standard tier
- First 13M operations: $0
- Next 87M operations: $69.60
- Remaining 747M operations: $373.50
- Base cost for 87 hours: $1.17
Total Service Bus cost: ~$444
A Service Bus alone could not explain an $80k bill.
That leads to the next question: what if this was Premium?
Premium tier changes the picture, but not enough
Premium pricing is capacity-based rather than per request.
There is no free tier, but throughput is much higher.
Assuming a realistic production setup with 10 Messaging Units:
- Messaging Units (MU): 10 × $0.977 per hour × 87 hours = $849
- Estimated operation cost: ~847M × $0.013 per million = $11,011
Total Premium Service Bus cost: ~$11,860
Even in the worst-case scenario, Service Bus explains 10 to 15 percent of the bill.
The rest came from somewhere else.
Where the other $68,000 likely came from
Once retries escape the messaging layer, they rarely stay contained.
Each failed message often triggers other services. Like: compute, logging, storage, and network traffic. That is where costs most possibly grow:
| Service | Estimated cost | Trigger |
| AKS compute | ~$2,175 | Stuck in retry loops |
| Azure Functions | ~$20,000 | Repeated failed executions |
| Storage | ~$5,000 | Queued messages and checkpoints |
| Network egress | ~$10,000 | Cross-region traffic |
| Database | ~$15,000 | Failed writes and retries |
| Logs and monitoring | ~$10,000 | Debug level logging explosion |
| Service Bus | $444 to $11,860 | Message operations |
Total: approximately $79,000
Nothing was broken. Everything worked exactly as designed.
Why was the volume even possible
At first glance, 847 million operations sounds excessive.
Service Bus limits explain why it happened.
Throughput limits by tier
| Tier | Throughput | Realistic for payments |
| Standard | ~1,000 messages per second per namespace | No |
| Premium | ~5,000 messages per second per MU | Yes |
With Premium and enough messaging units, this volume is feasible.
There is another multiplier that many teams miss.
One retry is not one operation
A single failed retry can include:
- Send
- Peek or lock
- Abandon
- Dead letter
- Logging
In some scenarios, we can count about five to six operations per retry.
847 million operations cloud have come from around 140 million retry attempts.
The real failure was layered, not singular
This incident was not caused by a single mistake. It was caused by several missing safeguards aligning.
| Layer | What failed | Result |
| Code | No retry limit | Infinite loops |
| Monitoring | Only success metrics | Failures invisible |
| Alerts | Budget notifications only | No early stop |
| Quotas | No spending caps | No automatic brake |
| Architecture | No circuit breakers | Bug cascaded system-wide |
Any one of these could have limited the blast radius.
Five safeguards that would have limited the damage
None of these is exotic. Most are boring. That is why they matter.
- Cap retry attempts
Set a maximum number of retries before dead lettering. - Enforce cost boundaries
Use budgets with automated actions (like alerts or stopping the service) when limits are exceeded. - Monitor failure signals
Track abandon rates and dead letter queue growth, not only successful messages. - Add circuit breakers
Pause message processing when failure rates spike and allow cooldown time. - Watch cost anomalies
Treat sudden cost changes as operational signals.
Final takeaway
Service Bus did not cause an $80k bill alone. The lack of guardrails allowed the spread.
Retry bugs happen in a distributed system. The bug has not solely influenced the invoice, but has allowed it to go that far.
Payment systems operate at this scale every day. The difference between a minor incident and a weekend disaster is usually one missing limit.
- The Service Bus cost was a small part of the bill
- Compute, storage, logs, and network did a large share of damage
- The root cause was blurred visibility
The retry bug was obvious in hindsight. The missing safeguards were the real problem.
Sources: Azure Service Bus Pricing, Reddit Original Post, Service Bus Quotas