Testing races with a slow Decorator
Delaying database interactions for test purposes. In chapter 12 in Code That Fits in Your Head, I cover a typical race condition and how to test for it. The book comes with a pedagogical explanation of the problem, including a diagram in the style of Designing Data-Intensive Applications. In short, the problem occurs when two or more clients are competing for the last remaining seats in a particular time slot. In my two-day workshop based on the book, I also cover this scenario. The goal is to show how to write automated tests for this kind of non-deterministic behaviour. In the book, and in the workshop, my approach is to rely on the law of large numbers. An automated test attempts to trigger the race condition by trying 'enough' times. A timeout on the test assumes that if the test does not trigger the condition in the allotted time window, then the bug is addressed. At one of my workshops, one participant told me of a more efficient and elegant way to test for this. I wish I could remember exactly at which workshop it was, and who the gentleman was, but alas, it escapes me. Reproducing the condition # How do you deterministically reproduce non-deterministic behaviour? The default answer is almost tautological. You can't, since it's non-deterministic. The irony, however, is that in the workshop, I deterministically demonstrate the problem. The problem, in short, is that in order to decide whether or not to accept a reservation request, the system first reads data from its database, runs a fairly complex piece of decision logic, and finally writes the reservation to the database - if it decides to accept it, based on what it read. When competing processes vie for the last remaining seats, a race may occur where both (or all) base their decision on the same data, so they all come to the conclusion that they still have enough remaining capacity. Again, refer to the book, and its accompanying code base, for the details. How do I demonstrate this condition in the workshop? I go into the Controller code and insert a temporary, human-scale delay after reading from the database, but before making the decision: var reservations = await Repository.ReadReservations(r.At); await Task.Delay(TimeSpan.FromSeconds(10)); if (!MaitreD.WillAccept(DateTime.Now, reservations, r)) return NoTables500InternalServerError(); await Repository.Create(restaurant.Id, reservation); Then I open two windows, from which I, within a couple of seconds of each other, try to make competing reservations. When the bug is present, both reservations are accepted, although, according to business rules, only one should be. So that's how to deterministically demonstrate the problem. Just insert a long enough delay. We can't, however, leave such delays in the production code, so I never even considered that this simple technique could be used for automated testing. Slowing things down with a Decorator # That's until my workshop participant told me his trick: Why don't you slow down the database interactions for test-purposes only? At first, I thought he had in mind some nasty compiler pragmas or environment hacks, but no. Why don't you use a Decorator to slow things down? Indeed, why not? Fortunately, all database interaction already takes place behind an IReservationsRepository interface. Adding a test-only, delaying Decorator is straightforward. public sealed class SlowReservationsRepository : IReservationsRepository { private readonly TimeSpan halfDelay; public SlowReservationsRepository( TimeSpan delay, IReservationsRepository inner) { Delay = delay; halfDelay = delay / 2; Inner = inner; } public TimeSpan Delay { get; } public IReservationsRepository Inner { get; } public async Task Create(int restaurantId, Reservation reservation) { await Task.Delay(halfDelay); await Inner.Create(restaurantId, reservation); await Task.Delay(halfDelay); } public async Task Delete(int restaurantId, Guid id) { await Task.Delay(halfDelay); await Inner.Delete(restaurantId, id); await Task.Delay(halfDelay); } public async Task ReadReservation( int restaurantId, Guid id) { await Task.Delay(halfDelay); var result = await Inner.ReadReservation(restaurantId, id); await Task.Delay(halfDelay); return result; } public async Task ReadReservations( int restaurantId, DateTime min, DateTime max) { await Task.Delay(halfDelay); var result = await Inner.ReadReservations(restaurantId, min, max); await Task.Delay(halfDe

Delaying database interactions for test purposes.
In chapter 12 in Code That Fits in Your Head, I cover a typical race condition and how to test for it. The book comes with a pedagogical explanation of the problem, including a diagram in the style of Designing Data-Intensive Applications. In short, the problem occurs when two or more clients are competing for the last remaining seats in a particular time slot.
In my two-day workshop based on the book, I also cover this scenario. The goal is to show how to write automated tests for this kind of non-deterministic behaviour. In the book, and in the workshop, my approach is to rely on the law of large numbers. An automated test attempts to trigger the race condition by trying 'enough' times. A timeout on the test assumes that if the test does not trigger the condition in the allotted time window, then the bug is addressed.
At one of my workshops, one participant told me of a more efficient and elegant way to test for this. I wish I could remember exactly at which workshop it was, and who the gentleman was, but alas, it escapes me.
Reproducing the condition #
How do you deterministically reproduce non-deterministic behaviour? The default answer is almost tautological. You can't, since it's non-deterministic.
The irony, however, is that in the workshop, I deterministically demonstrate the problem. The problem, in short, is that in order to decide whether or not to accept a reservation request, the system first reads data from its database, runs a fairly complex piece of decision logic, and finally writes the reservation to the database - if it decides to accept it, based on what it read. When competing processes vie for the last remaining seats, a race may occur where both (or all) base their decision on the same data, so they all come to the conclusion that they still have enough remaining capacity. Again, refer to the book, and its accompanying code base, for the details.
How do I demonstrate this condition in the workshop? I go into the Controller code and insert a temporary, human-scale delay after reading from the database, but before making the decision:
var reservations = await Repository.ReadReservations(r.At); await Task.Delay(TimeSpan.FromSeconds(10)); if (!MaitreD.WillAccept(DateTime.Now, reservations, r)) return NoTables500InternalServerError(); await Repository.Create(restaurant.Id, reservation);
Then I open two windows, from which I, within a couple of seconds of each other, try to make competing reservations. When the bug is present, both reservations are accepted, although, according to business rules, only one should be.
So that's how to deterministically demonstrate the problem. Just insert a long enough delay.
We can't, however, leave such delays in the production code, so I never even considered that this simple technique could be used for automated testing.
Slowing things down with a Decorator #
That's until my workshop participant told me his trick: Why don't you slow down the database interactions for test-purposes only? At first, I thought he had in mind some nasty compiler pragmas or environment hacks, but no. Why don't you use a Decorator to slow things down?
Indeed, why not?
Fortunately, all database interaction already takes place behind an IReservationsRepository
interface. Adding a test-only, delaying Decorator is straightforward.
public sealed class SlowReservationsRepository : IReservationsRepository { private readonly TimeSpan halfDelay; public SlowReservationsRepository( TimeSpan delay, IReservationsRepository inner) { Delay = delay; halfDelay = delay / 2; Inner = inner; } public TimeSpan Delay { get; } public IReservationsRepository Inner { get; } public async Task Create(int restaurantId, Reservation reservation) { await Task.Delay(halfDelay); await Inner.Create(restaurantId, reservation); await Task.Delay(halfDelay); } public async Task Delete(int restaurantId, Guid id) { await Task.Delay(halfDelay); await Inner.Delete(restaurantId, id); await Task.Delay(halfDelay); } public async Task<Reservation?> ReadReservation( int restaurantId, Guid id) { await Task.Delay(halfDelay); var result = await Inner.ReadReservation(restaurantId, id); await Task.Delay(halfDelay); return result; } public async Task<IReadOnlyCollection<Reservation>> ReadReservations( int restaurantId, DateTime min, DateTime max) { await Task.Delay(halfDelay); var result = await Inner.ReadReservations(restaurantId, min, max); await Task.Delay(halfDelay); return result; } public async Task Update(int restaurantId, Reservation reservation) { await Task.Delay(halfDelay); await Inner.Update(restaurantId, reservation); await Task.Delay(halfDelay); } }
This one uniformly slows down all operations. I arbitrarily decided to split the Delay
in half, in order to apply half of it before each action, and the other half after. Honestly, I didn't mull this over too much; I just thought that if I did it that way, I wouldn't have to speculate whether it would make a difference if the delay happened before or after the action in question.
Slowing down tests #
I added a few helper methods to the RestaurantService
class that inherits from WebApplicationFactory<Startup>, mainly to enable decoration of the injected Repository. With those changes, I could now rewrite my test like this:
[Fact] public async Task NoOverbookingRace() { var date = DateTime.Now.Date.AddDays(1).AddHours(18.5); using var service = RestaurantService.CreateWith(repo => new SlowReservationsRepository( TimeSpan.FromMilliseconds(100), repo)); var task1 = service.PostReservation(new ReservationDtoBuilder() .WithDate(date) .WithQuantity(10) .Build()); var task2 = service.PostReservation(new ReservationDtoBuilder() .WithDate(date) .WithQuantity(10) .Build()); var actual = await Task.WhenAll(task1, task2); Assert.Single( actual, msg => msg.StatusCode == HttpStatusCode.InternalServerError); var ok = Assert.Single(actual, msg => msg.IsSuccessStatusCode); // Check that the reservation was actually created: var resp = await service.GetReservation(ok.Headers.Location); resp.EnsureSuccessStatusCode(); var reservation = await resp.ParseJsonContent<ReservationDto>(); Assert.Equal(10, reservation.Quantity); }
The restaurant being tested has a maximum capacity of ten guests, so while it can accommodate either of the two requests, it can't make room for both.
For this example, I arbitrarily chose to configure the Decorator with a 100-millisecond delay. Every interaction with the database caused by that test gets a built-in 100-millisecond delay. 50 ms before each action, and 50 ms after.
The test starts both tasks, task1
and task2
, without awaiting them. This allows them to run concurrently. After starting both tasks, the test awaits both of them with Task.WhenAll.
The assertion phase of the test is more involved than you may be used to see. The reason is that it deals with more than one possible failure scenario.
The first two assertions (Assert.Single
) deal with the complete absence of transaction control in the application. In that case, both POST
requests succeed, which they aren't supposed to. If the system works properly, it should accept one request and reject the other.
The rest of the assertions check that the successful reservation was actually created. That's another failure scenario.
The way I chose to deal with the race condition is standard in .NET. I used a TransactionScope. This is peculiar and, in my opinion, questionable API that enables you to start a transaction anywhere in your code, and then complete when you you're done. In the code base that accompanies Code That Fits in Your Head, it looks like this:
private async Task<ActionResult> TryCreate(Restaurant restaurant, Reservation reservation) { using var scope = new TransactionScope(TransactionScopeAsyncFlowOption.Enabled); var reservations = await Repository .ReadReservations(restaurant.Id, reservation.At) .ConfigureAwait(false); var now = Clock.GetCurrentDateTime(); if (!restaurant.MaitreD.WillAccept(now, reservations, reservation)) return NoTables500InternalServerError(); await Repository.Create(restaurant.Id, reservation).ConfigureAwait(false); scope.Complete(); return Reservation201Created(restaurant.Id, reservation); }
Notice the scope.Complete()
statement towards the end.
What happens if someone forgets to call scope.Complete()
?
In that case, the thread that wins the race returns 201 Created
, but when the scope
goes out of scope, it's disposed of. If Complete()
wasn't called, the transaction is rolled back, but the HTTP response code remains 201
. Thus, the two assertions that inspect the response codes aren't enough to catch this particular kind of defect.
Instead, the test subsequently queries the System Under Test to verify that the resource was, indeed, created.
Wait time #
The original test shown in the book times out after 30 seconds if it can't produce the race condition. Compared to that, the refactored test shown here is fast. Even so, we may fear that it spends too much time doing nothing. How much time might that be?
The TryCreate
helper method shown above is the only part of a POST
request that interacts with the Repository. As you can see, it calls it twice: Once to read, and once to write, if it decides to do that. With a 100 ms delay, that's 200 ms.
And while the test issues two POST
requests, they run in parallel. That's the whole point. It means that they'll still run in approximately 200 ms.
The test then issues a GET
request to verify that the resource was created. That triggers another database read, which takes another 100 ms.
That's 300 ms in all. Given that these tests are part of a second-level test suite, and not your default developer test suite, that may be good enough.
Still, that's the POST
scenario. I also wrote a test that checks for a race condition when doing PUT
requests, and it performs more work.
[Fact] public async Task NoOverbookingPutRace() { var date = DateTime.Now.Date.AddDays(1).AddHours(18.5); using var service = RestaurantService.CreateWith(repo => new SlowReservationsRepository( TimeSpan.FromMilliseconds(100), repo)); var (address1, dto1) = await service.PostReservation(date, 4); var (address2, dto2) = await service.PostReservation(date, 4); dto1.Quantity += 2; dto2.Quantity += 2; var task1 = service.PutReservation(address1, dto1); var task2 = service.PutReservation(address2, dto2); var actual = await Task.WhenAll(task1, task2); Assert.Single( actual, msg => msg.StatusCode == HttpStatusCode.InternalServerError); var ok = Assert.Single(actual, msg => msg.IsSuccessStatusCode); // Check that the reservations now have consistent values: var client = service.CreateClient(); var resp1 = await client.GetAsync(address1); var resp2 = await client.GetAsync(address2); resp1.EnsureSuccessStatusCode(); resp2.EnsureSuccessStatusCode(); var body1 = await resp1.ParseJsonContent<ReservationDto>(); var body2 = await resp2.ParseJsonContent<ReservationDto>(); Assert.Single(new[] { body1.Quantity, body2.Quantity }, 6); Assert.Single(new[] { body1.Quantity, body2.Quantity }, 4); }
This test first has to create two reservations in a nice, sequential manner. Then it attempts to perform two concurrent updates, and finally it tests that all is as it should be: That both reservations still exist, but only one had its Quantity
increased to 6
.
This test first makes two POST
requests, nicely serialized so as to avoid a race condition. That's 400 ms.
Each PUT
request triggers three Repository actions, for a total of 300 ms (since they run in parallel).
Finally, the test issues two GET
requests for verification, for another 2 times 100 ms. Now that I'm writing this, I realize that I could also have parallelized these two calls, but as you read on, you'll see why that's not necessary.
In all, this test waits for 900 ms. That's almost a second.
Can we improve on that?
Decreasing unnecessary wait time #
In the latter example, the 300 ms wait time for the parallel PUT
requests are necessary to trigger the race condition, but the rest of the test's actions don't need slowing down. We can remove the unwarranted wait time by setting up two services: One slow, and one normal.
To be honest, I could have modelled this by just instantiating two service objects, but why do something as pedestrian as that when you can turn RestaurantService
into a monomorphic functor?
internal RestaurantService Select(Func<IReservationsRepository, IReservationsRepository> selector) { if (selector is null) throw new ArgumentNullException(nameof(selector)); return new RestaurantService(selector(repository)); }
Granted, this is verging on the frivolous, but when writing code for a blog post, I think I'm allowed a little fun.
In any case, this now enables me to rewrite the test like this:
[Fact] public async Task NoOverbookingRace() { var date = DateTime.Now.Date.AddDays(1).AddHours(18.5); using var service = new RestaurantService(); using var slowService = from repo in service select new SlowReservationsRepository(TimeSpan.FromMilliseconds(100), repo); var task1 = slowService.PostReservation(new ReservationDtoBuilder() .WithDate(date) .WithQuantity(10) .Build()); var task2 = slowService.PostReservation(new ReservationDtoBuilder() .WithDate(date) .WithQuantity(10) .Build()); var actual = await Task.WhenAll(task1, task2); Assert.Single( actual, msg => msg.StatusCode == HttpStatusCode.InternalServerError); var ok = Assert.Single(actual, msg => msg.IsSuccessStatusCode); // Check that the reservation was actually created: var resp = await service.GetReservation(ok.Headers.Location); resp.EnsureSuccessStatusCode(); var reservation = await resp.ParseJsonContent<ReservationDto>(); Assert.Equal(10, reservation.Quantity); }
Notice how only the parallel execution of task1
and task2
run on the slow system. The rest runs as fast as it can. It's as if the client was hitting two different servers that just happen to connect to the same database. Now the test only waits for the 200 ms described above. The PUT
test, likewise, only idles for 300 ms instead of 900 ms.
Near-deterministic tests #
Does this deterministically reproduce the race condition? In practice, it may move us close enough, but theoretically the race is still on. With the increased wait time, it's now much more unlikely that the race condition does not happen, but it still could.
Imagine that task1
queries the Repository. Just as it's received a response, but before task2
starts its query, execution is paused, perhaps because of garbage collection. Once the program resumes, task1
runs to completion before task2
reads from the database. In that case, task2
ends up making the right decision, rejecting the reservation. Even if no transaction control were in place.
This may not be a particularly realistic scenario, but I suppose it could happen if the computer is stressed in general. Even so, you might decide to make such false-negative scenarios even more unlikely by increasing the delay time. Of course, the downside is that tests take even longer to run.
Another potential problem is that there's no guarantee that task1
and task2
run in parallel. Even if the test doesn't await
any of the tasks, both start executing immediately. There's an (unlikely) chance that task1
completes before task2
starts. Again, I don't consider this likely, but I suppose it could happen because of thread starvation, generation 2 garbage collection, the disk running full, etc. The point is that the test shown here is still playing the odds, even if the odds are really good.
Conclusion #
Instead of running a scenario 'enough' times that reproducing a race condition is likely, you can increase the odds to near-certainty by slowing down the race. In this example, the race involves a database, but you might also encounter race conditions internally in multi-threaded code. I'm not insisting that the technique described in this article applies universally, but if you can slow down certain interactions in the right way, you may be able reproduce problems as automated tests.
If you've ever troubleshot a race condition, you've probably tried inserting sleeps into the code in various places to understand the problem. As described above, a single, strategically-placed Task.Delay
may be all you need to reproduce a problem. What escaped me for a long time, however, was that I didn't realize that I could cleanly insert such pauses into production code. Until my workshop participant suggested using a Decorator.
A delaying Decorator slows interactions with the database down sufficiently to reproduce the race condition as an automated test.
This blog is totally free, but if you like it, please consider supporting it.