You may or may not have heard of this statistic from Google:
53% of mobile site visits leave a page that takes longer than three seconds to load
With that in mind, it’s important to make our site as responsive and quick as possible.
We recently rewrote one of our APIs from scratch, Saved Items, to better pave the way for new functionality and improve the performance by utilising new technologies such as .NET Core 2.
Fast forward to the end of the development
The API has been rewritten. Functionally it’s performing as expected — we’ve got acceptance, integration, performance tests in place. It looks like we’re good to go!
We release our API contract to the relevant teams that will consume it and they get to work on switching their calls to us from the old endpoints to the new endpoints.
Our mobile applications were the first to have a working implementation and once any functional anomalies were ironed out, we rolled out these application updates as beta versions to our customers.
And so the traffic starts cranking up on our new API. Scary stuff as it’s brand new. We’ve tested what we can, but we all know how environment configurations can behave differently. This was our first live traffic with a significant amount of load behind it.
But it’s all looking smooth. Until some time later, we start getting alerts. Uh oh.
We’re seeing some errors and dependency call response duration spikes in our logs.
Then it would calm down and our performance would return to normal.
Then maybe the next day, some errors and dependency call duration spikes would happen again.
We were in a random cycle of good performance, bad performance. There wasn’t any clear correlation as to what was happening. We were a bit stumped.
The impact to our customers was minimal — it was only our beta app users, but this was definitely blocking us from releasing to all of our customers. We had to fix it quickly, but first, we had to find out what was causing these problems.
After some brainstorming as a team, we had some ideas.
Thread blocking code
One common cause of slow, bottlenecked code is thread pool starvation — Where your thread pool is too busy and exhausted and is unavailable to jump around and pick up new requests. These were the symptoms we were suffering from, and therefore the best place to start.
Our application followed asynchronous best practices to the best of our knowledge — unless we’d missed some. But the code-base isn’t small. How would we find them?
After some researching, we found this library — Ben.BlockingDetector:benaadams/Ben.BlockingDetectorBlocking Detection for ASP.NET Core. Contribute to benaadams/Ben.BlockingDetector development by creating an account on…github.com
And we can’t thank it enough!
To add Blocking Detection to your API, follow these steps:
- Install via NuGet >
Install-Package Ben.BlockingDetector -Version 0.0.3
- Call a single method on your
- There is no step three — it’s that easy.
We pushed the API with this library included to our testing environments and ran our regression suite of tests against it.
We monitored the logs and started to see some stack traces.
Blocking method has been invoked and blocked, this can lead to threadpool starvation at …. blah blah
Following these stack traces, we found a number of blocking calls which had an async alternative. While yours (if you have any) will more than likely be different, these were ours:
Redis database connection
We use Redis caching through the StackExchange client package.
On the initial connection, we were calling:
This one was one of the biggest culprits we think. If it connected quickly and successfully — great, but if it was having issues, it may have sat there for say, 30 seconds, blocking the thread.
We changed this to the below and awaited the async call wherever we used it.
This is also a reason why we didn’t see it in our performance tests because we didn’t have any Redis connection issues while we were testing. Then in our live environment, we had some for whatever reason, and it was very damaging to our application.
Redis bulk data management
As well with Redis, we work with collections and multiple records at once, and so we add to or delete from Redis in bulk. Previously we were bulking our requests like this:
batch.Execute() was another culprit that was apparently blocking.
Well, batching wasn’t doing us any favours apparently, so we switched to immediately executing and awaiting all the steps as tasks.
Notice here that we aren’t creating a
batch object. We’re creating records straight from our
Notice also that we aren’t creating a
foreach loop and awaiting inside it. Because that’s slow — it waits for each record to complete before continuing. Instead, we kick off all the tasks first and await them all afterwards — this way they run in parallel.
And lastly, we use the Serilog library, and this was said to be blocking. Now we were writing to two different places. One of these was the Console Standard Out.
We don’t look at the console output for production — we have another source for our logs that we’re able to query more easily. And after some interesting articles about production logging, there was the discussion that console output was slow.
So with these these three factors combined:
- It’s slow
- We didn’t read it
- It was a potentially blocking action
We disabled writing to the console for production.
isDevelopmentEnvironment boolean is set based on the value of our Environment variable
We noticed a few other blocking stack traces, but these were outside of our control. E.g. JSON serialisation.
It may be worth asking any libraries’ authors about these so that they can investigate.
Cancelling redundant tasks
While we were performing this cleanup, we took a look at our classes responsible for making these dependency calls that were spiking in response duration randomly, to see if there was anything here that we’d missed.
Now we use a popular library called Polly — It helps you easily control tasks such as timeouts, retrying failed code, and some other powerful functionality.
Polly, for these dependency calls, was set up to:
- Timeout after 30 seconds
- Retry three times if the request fails (this can include failing due to the timeout)
- If it failed after three times, throw an exception back to our application so that we can return an error on our API response
So if it’s timing out after 30 seconds, we shouldn’t see that request taking any longer, right?
In our monitoring, we saw some requests taking over 10 minutes!
We did a form of chaos testing where we forced a request to be slow by using a middleman proxy. After some testing and debugging with this configured delay, we were behaving as expected — we were definitely timing out and returning an error due to the call failing within our configured timeout duration.
Back to the Polly documentation, and this important piece of information stuck out:
Polly will not risk the state of your application by unilaterally terminating threads.
So even though Polly had dealt with timing out that request and resuming our application, that request was still sitting on a thread somewhere in the background of our application, running until it itself received a response. This was unwanted and unnecessary behaviour. That dependency call is no longer tied to an incoming request to our API. This means that if it does eventually return successfully, it doesn’t actually return to the request that initiated it, so it was wasting resources — it was a completely wasted connection.
Polly states this about connections:
Waiting forever (having no timeout) is a bad design strategy: it specifically leads to the blocking up of threads or connections (itself often a cause of further failure)
If this dependency call times out, we needed to proactively cancel them to prevent them from running in the background and wasting our application’s resources.
The change was surprisingly simple, as Polly has an implementation to easily deal with cancellations.
The changes required:
- When declaring our Polly timeout policy, we must set the
Optimistic(Optimistic should only be used for code that honours a CancellationToken).
- When executing our request from the Polly policy, we should use the overloaded method that takes a cancellation token.
The timeout policy looks as simple as:
And our request changed from
CancellationToken.None can also be your own
To explain what’s happening:
- We’re passing in a
CancellationToken(or none) to the Polly Policy and the Polly Policy passes us another
CancellationTokenin a lambda. This is important, as this is the token we should pass into our executable code.
- Polly internally is creating its own
CancellationTokenSource,linking that to the
CancellationTokenthat you pass in and then passing that newly created token back to you.
- If you cancel your original token passed in, the request will cancel (because they’re linked).
If the request times out, Polly has its own
CancellationTokenSource available to cancel the token and the request will also cancel.
Since we’ve released these changes, we cannot see any dependency calls over 30 seconds, meaning they’re successfully being cancelled by Polly, resulting in no connections being unnecessarily wasted.
We’re seeing much less spiking in general as well, most likely due to the fact that we’re not making blocking calls anymore — there are threads available to pick up work whereas before they weren’t always available.
This is great — as before we were randomly receiving alerts every single day. Performance has definitely improved and our application monitoring is proof of this.
The BlockingDetection we have left in our application is turned on only for our test environments. This means that going forward we can be confident that we won’t introduce any more blocking calls if we keep monitoring these, and will raise and fix what we do find.
Async calls must be used wherever possible to avoid thread pool starvation.
As for long-running dependency calls, we now know that we need to provide cancellation support where appropriate, with the implementation in place of timing out and cancelling those tokens.
These best practices help ensure that we have a performant application and that we shouldn’t run into these issues again.