Transactions: the limits of isolation

At Toushay, we ran into a tricky issue. Our customers were noticing that, very rarely, their ‘digital signage’ playlists were being subtly corrupted. After our customers had been moving items around in the list, things would go awry. Sometimes an image would be lost, other times one would be duplicated. It happened only a few times, and in each case we were able to manually correct it. But it was very painful for them when it did.

Clearly something was wrong with how we were handling this data.

The problem was that it wasn’t reproducible on a developer’s machine, it didn’t show up in our integration tests, and occurred only very rarely in production.

I was tasked with figuring out the source of the problem and solving it once and for all. At first, everything worked okay on my machine and I couldn’t reproduce it. On a hunch that it may only occur on a database under high load, I tried hammering my local database with as much load as I could manage with this command:

mysqlslap -a --auto-generate-sql-execute-number=10000 --concurrency 140

That command spins up 140 MySQL clients, and each one hits the database with 10,000 random queries as fast as they can. That keeps the server busy for several minutes with much higher load than our production servers will ever see.

With that command running, I was able to reproduce the problem fairly consistently: duplicated images, missing images, the whole bit. Clearly, the extra database load was causing the different requests to step on each other’s toes. But how?

The code to move an item in the list basically works like this:

  1. Read in the list of images.
  2. Check that the item we’re moving is in the list and at the position we expect. This is a sanity check to ensure the UI isn’t running on stale data.
  3. Move the item to the new position.
  4. Save the list to the database.

This block of code is run within a database transaction (I checked to be sure), which should isolate it from other transactions. Yet, the behaviour could only be caused by two requests colliding with each other something like this:

Request A Request B
Read in the list of images.
Check that the item we're moving is in the list and at the position we expect.
Read in the list of images.
Check that the item we're moving is in the list and at the position we expect.
Move the item.
Save the list to the database.
Move the item.
Save the list to the database.

But how can this be? Transactions are one of the key features of any relational data store and, as we all learned in school, transactions provide ACID guarantees. Shouldn’t the Isolation guarantee prevent two transactions from interfering with each other like that?

By digging deeper, I realized what was going on. One key fact that is not well known about transactions, or is easily forgotten, is this: by default, databases do not provide full Isolation guarantees.

How isolated is isolated?

What exactly the Isolation guarantee in the SQL standard means based on what they call “read phenomena”. There are three types of phenomena:

  • Dirty reads: If another transaction writes, but does not commit, during your transaction, is it possible that you will see their data?
  • Non-repeatable reads: If you read the same row twice, is it possible that you might get different data the second time?
  • Phantom reads: If you read a collection of rows twice, is it possible that different rows will be returned the second time?

In the SQL standard, there are four levels of transactional isolation based on which of these phenomena they prevent (from weakest to strongest):

  • Read Uncommitted: A transaction can see the results of another transaction’s INSERT or UPDATE queries before they are committed. This allows all read phenomena.
  • Read Committed: A transaction is guaranteed to only see rows that are committed. This prevents dirty reads but allows the other phenomena.
  • Repeatable Read: A transaction is gets a snapshot of all rows it reads during the transaction. This prevents dirty reads and non-repeatable reads, but allows phantom reads.
  • Serializable: A transaction gets a “range lock” any time it queries a collection of rows, guaranteeing that it sees a snapshot of that data. This prevents phantom reads as well as dirty reads and non-repeatable reads.

The default isolation level, at least for PostgreSQL, MySQL and Oracle, is Read Committed and so that’s what we were using in our application. This means that, unless you change the isolation level, you cannot assume that because you read in some data that someone can’t change it underneath you before you commit.

The reason they don’t use a higher default level is that both Repeatable Read and Serializable both introduce cases where a transaction cannot be completed and must be aborted (race conditions). In these cases, the DBMS aborts the transaction, rolls it back, and returns an error.

When this happens, the application has to know how to retry the transaction. In some cases it may be difficult to retry a transaction if it had side effects on the application’s side: talking to external services or other data stores. In these cases, the application would have to use something like a two-phase commit protocol, or else manually roll back the side effects.

The solution

The corruption we were seeing was the result of non-repeatable reads and phantom reads: we were reading a range of images, doing some checks on them, then modifying that range and writing them out. In between the reads and writes, there’s a window where another transaction could sneak in and insert or remove an image, or rearrange the list. Any of those cases could cause corruption of the list when you make your changes and overwrite the new changes.

There are two solutions to this: either you use row locks (probably a shared lock, but possibly exclusive) to prevent any concurrent access to the rows, or you use stricter transaction isolation. In our case, we need Serializable isolation to prevent the phenomena we were seeing.

The problem with both of these is that you take a performance hit. We checked what impact enabling Serializable isolation globally (by setting the MySQL server variable tx_isolation = SERIALIZABLE) would have, and the results were… mixed. In some cases, the effect was negligible; in other cases, it raised the response time by up to 10x. And in all cases, it made the response time much more variable. So switching to Serializable everywhere was a no-go.

We also didn’t want to use row locks: it’s easy to miss a case where a lock is necessary, and need to be used properly to prevent deadlocks. They’re also often stricter than we need, which may incur a greater performance hit than is necessary.

In the end, we switched to Serializable isolation on the particular operations where the stricter isolation is necessary. In all the rest, we continue to use the default Read Committed isolation.

Lessons learned

This was a very subtle and interesting bug. It came about from a fairly serious misunderstanding of how transactions work. But it’s an easy one to make: it’s natural to think of transactions as some sort of magic that protects you from concurrency issues. Going into this, my own mental model was that transactions gave, basically, Serializable isolation by default. I’m sure I’m not alone there, which is why I wrote this blog post.

Here the key lesson you can take away from this: be sure to understand your database’s documentation in detail. All the information in this post is available in both the MySQL, and PostgreSQL documentation.

It also showed us that there’s a gap in our testing methodology: the tests are run on an unloaded database, and transaction isolation issues don’t appear until you have multiple competing transactions. When you’re doing integration testing, consider running mysqlslap, or similar, in the background. This should increase the chances that isolation problems will occasionally cause your tests to fail.

comments powered by Disqus