Azure Service Bus triggered an $80k weekend bill?

Why was Azure Service Bus only part of the damage?

A Reddit post described a system that generated $79,847 in charges over a single weekend.
The root cause appeared simple: a retry loop that fired every 50ms against Azure Service Bus.

The post claimed 847 million operations. The number sounded shocking. The math did not.

A closer look reveals something more interesting and common in cloud systems.
Service Bus was not the main cost driver. It was the spark that triggered a much larger cost cascade.

Using actual Azure Service Bus pricing for 2025, let’s break down:

What could have happened?
Where did the money likely go?
Which safeguards were missing?

The bug that started it all: 50 millisecond retry loops

At the center of the incident was a retry loop with no upper limit. One service instance retrying every 50 milliseconds produces:

20 operations per second
1.73 million operations per day
5.19 million operations over three days

To achieve 847 million operations, the system required approximately 163 parallel instances running continuously.

That already tells that this was not a single bug looping in isolation. There was more to it.

What Service Bus actually costs in 2025

Before blaming the messaging layer, it helps to look at pricing.

Azure Service Bus pricing

Tier	Base Cost	Operations Cost (per million ops)
Standard	$0.0135/hour	First 13M: FREE 13-100M: $0.80 100M-2.5B: $0.50 >2.5B: $0.40
Premium	$0.977/hour/MU	~$0.013 average (capacity-based, no free tier)

Now let’s apply that pricing to the reported volume.

Standard tier costs

847 million operations on the Standard tier

First 13M operations: $0
Next 87M operations: $69.60
Remaining 747M operations: $373.50
Base cost for 87 hours: $1.17

Total Service Bus cost: ~$444

A Service Bus alone could not explain an $80k bill.

That leads to the next question: what if this was Premium?

Premium tier changes the picture, but not enough

Premium pricing is capacity-based rather than per request.
There is no free tier, but throughput is much higher.

Assuming a realistic production setup with 10 Messaging Units:

Messaging Units (MU): 10 × $0.977 per hour × 87 hours = $849
Estimated operation cost: ~847M × $0.013 per million = $11,011

Total Premium Service Bus cost: ~$11,860

Even in the worst-case scenario, Service Bus explains 10 to 15 percent of the bill.

The rest came from somewhere else.

Where the other $68,000 likely came from

Once retries escape the messaging layer, they rarely stay contained.

Each failed message often triggers other services. Like: compute, logging, storage, and network traffic. That is where costs most possibly grow:

Service	Estimated cost	Trigger
AKS compute	~$2,175	Stuck in retry loops
Azure Functions	~$20,000	Repeated failed executions
Storage	~$5,000	Queued messages and checkpoints
Network egress	~$10,000	Cross-region traffic
Database	~$15,000	Failed writes and retries
Logs and monitoring	~$10,000	Debug level logging explosion
Service Bus	$444 to $11,860	Message operations

Total: approximately $79,000

Nothing was broken. Everything worked exactly as designed.

Why was the volume even possible

At first glance, 847 million operations sounds excessive.
Service Bus limits explain why it happened.

Throughput limits by tier

Tier	Throughput	Realistic for payments
Standard	~1,000 messages per second per namespace	No
Premium	~5,000 messages per second per MU	Yes

With Premium and enough messaging units, this volume is feasible.

There is another multiplier that many teams miss.

One retry is not one operation

A single failed retry can include:

Send
Peek or lock
Abandon
Dead letter
Logging

In some scenarios, we can count about five to six operations per retry.

847 million operations cloud have come from around 140 million retry attempts.

The real failure was layered, not singular

This incident was not caused by a single mistake. It was caused by several missing safeguards aligning.

Layer	What failed	Result
Code	No retry limit	Infinite loops
Monitoring	Only success metrics	Failures invisible
Alerts	Budget notifications only	No early stop
Quotas	No spending caps	No automatic brake
Architecture	No circuit breakers	Bug cascaded system-wide

Any one of these could have limited the blast radius.

Five safeguards that would have limited the damage

None of these is exotic. Most are boring. That is why they matter.

Cap retry attempts
Set a maximum number of retries before dead lettering.
Enforce cost boundaries
Use budgets with automated actions (like alerts or stopping the service) when limits are exceeded.
Monitor failure signals
Track abandon rates and dead letter queue growth, not only successful messages.
Add circuit breakers
Pause message processing when failure rates spike and allow cooldown time.
Watch cost anomalies
Treat sudden cost changes as operational signals.

Final takeaway

Service Bus did not cause an $80k bill alone. The lack of guardrails allowed the spread.

Retry bugs happen in a distributed system. The bug has not solely influenced the invoice, but has allowed it to go that far.

Payment systems operate at this scale every day. The difference between a minor incident and a weekend disaster is usually one missing limit.

The Service Bus cost was a small part of the bill
Compute, storage, logs, and network did a large share of damage
The root cause was blurred visibility

The retry bug was obvious in hindsight. The missing safeguards were the real problem.

Sources: Azure Service Bus Pricing, Reddit Original Post, Service Bus Quotas

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.