Friday, April 6, 2012

Fun With Deadlocks

We recently had a customer in St. Louis report that the system would occasionally freeze when trying to scan in the same job from multiple workstations.  My first instinct when I heard about this was that it was probably the "deadly embrace" problem or a deadlock, which is what we call it when two threads are waiting for locks held by the other.

Let say Thread 1 has an exclusive lock on RED and Thread 2 has an exclusive lock on BLUE.  If Thread 1 tries to acquire a lock on BLUE, the thread will stop and wait for Thread 2 to release the BLUE lock.  If Thread 2 is trying to acquire a lock on RED, it too, will wait for Thread 1 to release the RED lock.  Since neither thread can proceed without one of the other threads releasing a lock, both threads are stuck, or in nerd parlance, deadlocked.

Duplicating Deadlocks

Deadlocks are rare, even in poorly designed software (unless that software is under a heavy concurrent load), because they usually rely on a rare confluence of timing events between the threads and usually specific quirks related to the underlying dataset.  Thread 1 has to ask for a lock before Thread 2 releases it.  More threads and heavy usage can increase the probability of generating a deadlock, but whenever I suspect a deadlock, I always dread the tedious process of recreating it on my developer workstation.

We developed a test utility some time back that was designed to help us understand how a large number of concurrent users impacts scan response time or performance.  Finding deadlocks wasn't exactly what we built it for, but the tool did allow us to simulate high load and many concurrent users, so it seemed worth a try.

I downloaded the St. Louis customer's database and ran a load test simulating 5 concurrent users and a 1-5 second delay between scans.  The test locked up around 17 percent.  I tried it again.  This time it locked up around 7 percent.  Every time we ran the load test, it would get off to a good start and eventually lock up.  When you're trying to recreate a bug, this is a good thing.

Here's a screenshot from the load test utility:

Deadlock Test
In this screenshot, you can see that this particular test run failed right around 39%.  (You might also note from one of the open browser tabs that I'm hosting a party this weekend for my brother-in-law, who's moving to Austria.) And this lock up happened every time, not always at the same percentage of completeness, but eventually.  A consistently reproducable deadlock is a rare thing.  This was a good point to start from.

The next step was to confirm that the freezing was, in fact, a deadlock and to figure out which threads were deadlocking each other and from where in the code.  I hooked the process up to the YourKit profiler and did another test run.  When the test run hit the wall, soon after YourKit popped up a deadlock warning with copies of the call stack from both threads.

I didn't think to grab a screenshot during the failed test run, but here's a screenshot from the YourKit thread monitor during a successful test:


You can see little red blocks in the thread status lines that show small periods where a thread is blocked waiting for a lock, but the blocks are short and eventually the thread gets back to business.  Occasional locks are good; deadlocks are not.

In our failed test run, the profiler told us that one deadlocked thread was in the warehouse facade's updateLineItems() method and the other was stuck in updateLineItemProgress().
 
These methods work together to generate new manifest line items, update the completion status of any line items impacted by a scan, and set audit trail information.  It makes sense that these functions are subject to locks because two threads trying update the same line item at the same time could lead to inconsistent data (more on that later), but since updateLineItemProgress() is called by updateLineItems(), it made it possible to pull all the lock acquisition required by updateLineItemProgress() up into updateLineItems().  I made this change and also moved the lock acquisition logic into a synchronized code block as shown below:

            synchronized (elementLockId) {
                              
                getConcurrencyService().acquireLock(elementLockId);
                             
                if (request.getManifest() != null) {
                    getConcurrencyService().acquireLock(manifestLockId);
                }
                  
              
            }

The synchronized block ensures that in situations where multiple locks are needed, that this logic is somewhat transactional.  No thread has to worry about a edge case thread timing issue creating a deadlock during lock acquisition.  I don't think this change would impact the customer issue, but it seemed like a logical housekeeping task while in the code.

After making this change, I did another test run and the test completed without a hitch.  I varied up the conditions a bit and made the test progressively more aggressive, increasing the factors that might induce a deadlock.  And while performance was degraded (as expected) the deadlock issue never returned.


Concurrency Versus Liveness

All this effort to clean up locking issues begs the question about why we bother with locks in the first place.  To answer that question, consider what might happen when scanning out barn doors, for example.  If there is one line item on the pull sheet for, say, 50 barn doors, and two workstations are scanning out barn doors, and both workstations scan 25 barn doors at exactly the same time -- and with no server side locking for line item updates -- you might scan out 50 barn doors, but because the data updates happened at exactly the same time, one progress update would override the other, and only show a progress of 25/50 instead of 50/50.  This would be bad.

We call this a concurrency issue.  The simplest way to fix this issue in our system would have been to add the synchronized modifier to the processScan() method.  This would have made concurrency issues all but impossible, because only one thread would be permitted in that method at the same time.  Scan 2 would have to wait until Scan 1 completed before being processed.  And this would not only apply to scanning the same job with multiple workstations, but with all scanning, period.  Only one scan at a time would be permitted to go through.

As you may be able to intuit, this would make things very slow, and trade off a concurrency issue for a performance -- or liveness -- issue.  Writing multi-threaded software is all about balancing these two potential issues.  Living with concurrency glitches for the sake of speed is never acceptable, and living with slow performance for the sake of ultra-conservative concurrency restrictions usually isn't acceptable either.

The scan processing code in Flex has many different components, only a fraction of which pose concurrency risks.  Checking the availability of an orphan scan, for example, or checking for prompt conditions.  Blocking other threads while doing things that have no impact on the data is unnecessary and would have severe performance restrictions.  In fact, most of the effort we've applied into improving scan performance relates to locking only when necessary. 

And the only place in the scan process we have potential concurrency issues is when updating the line item data.  So, when the process enters that part of the code, we acquire the necessary locks and release them as soon as the data update is complete.  And we don't lock all threads across the board, only threads trying to update the same manifest or pull sheet.  Workstations attempting to work on different pull sheets should experience no major slow down due to lock-induced request serialization.  (In reality, this isn't 100% true because MySQL has it's own system of page and table locks, but Flex owned locks would not block each other.)

Next Steps

We feel reasonably confident that we've nailed the issue, but due to the uncertain nature of deadlocks, there's no way of knowing 100% until we've tested the fix in the real world.  We'll be building a release candidate soon and conducting a test in the St. Louis customer's warehouse and see if we "really" fixed it.

Update:

The St. Louis customer has tested the fix in their warehouse and reports all is well. 

No comments:

Post a Comment