Tuesday, April 24, 2012

Grinding and Slogging

One of the things I miss about being on the road (other than fall arrest harnesses and my trusty Gerber) is the feeling of "doneness" you get when a show is loaded out and the trucks are padlocked.  There's no sliding scale of what it means to be finished: the show is over and loaded out.  The end.  You can head off to the hotel bar and kick back with your fellow road dogs and celebrate a job well done.

In software development, you never really get that sense of fait accompli.  Every release of the software is an accomplishment, but it also represents a list of things you wish you could have done or could have done better.  It takes a lot of stamina to write software, and even more to finish writing a software package once you've started.  In my view, this grinding and slogging at the never-ending end of a software project is responsible for much of the burnout and high turnover you see in software development.

Whenever I sit down and look at Flex, especially when I'm working with customer data on customer issues, I often see only the flaws and all the things that still need work.  If something bugs you as a Flex user, you can rest assured it bugs me ten times more.  But the wishlist is long and time is short.  It can seem daunting.  When I become mindful of that, I try to remember something a much older and wiser software engineer once told me when I was just starting out.  This conversation took place in an office building on French Broad Avenue in Asheville, North Carolina.

He told me that the last mile is always the hardest.  Building new software and new features is fun.  What's not fun is correcting mistakes and ironing out the details.  Good software, he said, is built by people who have the stamina to stick it out when it's not fun anymore, when the work turns to tuning, fixing and correcting wrong assumptions that infected your design.  Often this kind of work is done under duress, with time pressure, customer pressure, internal pressure.  But if you have the guts to stick with it, to slog through and put in the grunt work it takes to really "finish" or come close, you'll always come out on top, because you will have pushed past the point where most of your competitors just gave up.

I think you really see this in Flex's warehouse process because the warehouse module has been reworked many times to incorporate edge cases and workflow exceptions comparable systems have yet to contemplate.

The last few weeks at Flex, at least for me, (Suman and Roger have been working on new labor and audit trail features respectively) have been spent down in the mire of difficult bug fixes and performance tuning.  It's been tough because estimating the time it takes to fix a problem before you understand the problem is dubious business.  Performance tuning is especially tough because you have to make sure the cure isn't worse than the disease.  It's work no one really planned on doing in the first place, it usually represents a slip in the schedule, and so this work, work that's notoriously hard to pin down in terms of man-hours, is usually done in a hasty atmosphere under pressure.  It  involves long hours and often serious lapses in nutrition or personal hygiene.  You often feel it's during these times when, as a developer, that you're most deserving of praise.  But I can tell you from experience that it's the exact time when you're least likely to get it.

What keeps my spirits up when I have these kinds of weeks is that I know our competitors don't have the stomach for this kind of work.  But it's this kind of work, the digging in and cleaning out the cobwebs, that ultimately separates good software from great software.

Friday, April 13, 2012

Rethinking Line Item Deletion

In the midst of troubleshooting some issues with subrental paperwork generation, we decided to tackle a related issue that's been neglected over the last year or so, and that would be how line item deletes are handled during paperwork syncs (pull sheet and PO generation) and how line item deletes are handled in general.


Visibility

The Flex workflow, where quotes are used to generate other types of documents, can be a source of confusion.  A quote line item has associations with other line items in other documents, but when you go to delete a line item, you don't necessarily get a heads up on what impact it will have on other documents.  This was a frequent source of unexplained weirdness. 

Another view of the same basic problem relates to syncs.  For example, say you delete line items from a quote and want those deletes to propagate down to the pull sheet.  Most Flex users know deletes don't get synced and this was a design choice to prevent accidental deletion, although we've always known we needed a better solution for this problem.

No Deletion Without Confirmation

The new approach to deletes can be summed up as "no deletion without confirmation," meaning that deletes during document syncs, instead of being ignored (or done automatically), should be prompted and presented to the user for confirmation.  We also think that manual deletes should show a confirmation dialog that shows the links to other documents and present a user with the opportunity to delete the whole object graph in one fell swoop or use the extended information to make a more informed decision about deletion.

The Delete Confirmation Dialog

To implement this new way of handling line item deletes, we've replaced the existing validation process (which just did a quick relationship check) and replaced it with a new dialog box.



This dialog will show users the line items they selected with any related line items shown in a tree structure.  From here, the user can make a judgement call about whether or not to stick with the original plan, delete the pull sheet line items as well, or forget the whole thing.

This dialog will also pop up on occasion when updating subrental paperwork or invoking a workflow action - if the automation process detects line items that might need to be deleted to make the paperwork consistent.


Generic Deletion

 Making this work on the back end presented a number of new challenges, because our existing facades for supporting client initiated delete operations have line item delete logic specific to the type of line item.  For example, the logic used to process a quote line item delete is slightly different than pull sheet deletes.  The creates a problem if you're trying to delete both quote and pull sheet line items in a single batch.  The system must have a system for switching the delete logic used for each line item.

To make this work, we had to correct a few architectural problems.  First, much of the type-specific logic was in the Facade layer of the application, which is the server endpoint the Flash client uses for communication.  From a purist's perspective, this is part of the view layer of the application and all business logic belongs in the service layer.  We had to take the existing facade delete methods and all logic related to processing the actual delete operation down into the service layer while retaining in the facade all logic related to validating the request and serializing the results for the client.

We also needed a generic endpoint for line item deletes that would determine which delete logic needed to be used.  To handle this, we added a new method or our base ProjectElementService interface as shown below:

public interface ProjectElementService extends KennewickService {
    ...
    public Collection<ProjectElementLineItem> deleteLineItem(String ids);
}
This is a pretty simple method whose operation should look relatively straightforward.  The tricky part is in the base service implementation as shown in the next code sample:

    @Override
    public Collection<ProjectElementLineItem> deleteLineItem(String id) {
       
       
        /*
         * Delegates to the appropriate service with class specific logic.
         */
       
        ProjectElementLineItem lineItem = getLineItemService().findById(id);
        String lockId = lineItem.getProjectElement().getObjectIdentifier();
        try {
            getConcurrencyService().acquireLock(lockId);
            ProjectElementDefinition def = lineItem.getProjectElement().getElementDefinition());
            ProjectElementService service = context.getBean(def.getExtension().getInstanceServiceId());
           
            return service.deleteLineItem(id);
        }
        finally {
            getConcurrencyService().releaseLock(lockId);
        }
       
    }
This takes advantage of the extension concept already built into Flex, where we can always get a reference to the Spring Bean ID of the service used to process project elements of a specific type.  For example, if this method were being invoked for a quote line item, the object returned from the Spring application context using the extension's instance service ID would be of type FinancialDocumentService (which is a subinterface of ProjectElementService).  The deleteLineItem() method is overridden on that service (and for all other project element classes with line item support) to include the delete logic specific to quote line items.

Delta Processing

When a quote or pull sheet is changed, the server sends back a packet of information to the Flash client to reflect the change.  We call this a delta.  The delta information is unique to the type of document.  Quote deltas are different from pull sheet deltas.  They aren't interchangeable. 

When you delete line items from a quote - and are presented with the option to also delete pull sheet line items or lines from other financial documents (like Rental PO's), the delta sent back to the client should only contain information for line items that actually belonged to the quote you were originally working with.  Otherwise, the delta would contain superfluous information and might even cause a fatal error.  To work around this issue, we added some sorting logic to the facade.  Here's a snippet from the facade method used to process quote line item deletes:

       //this first loop delegates to the generic deletion logic now moved to the service layer
        for (String id : lineIds) {          
            deleteLines.addAll(getElementService().deleteLineItem(id));

        }
       
        //this loop iterates over all line items processed in the deletion (including cascaded deletes)
        for (ProjectElementLineItem deleteLine : deleteLines) {
            deleteLine = (ProjectElementLineItem)ProxyUtils.getProxyTarget(deleteLine);
                        /*
                        The first two conditionals ensure that the each line item is a quote line and that it 
                        belongs to the quote.  Otherwise, we skip it and leave it out of the delta result.
                        */
            if (deleteLine instanceof FinancialDocumentLineItem) {
                FinancialDocumentLineItem docLine = (FinancialDocumentLineItem)deleteLine;
                if (docLine.getProjectElement().getObjectIdentifier().equals(documentId)) {
                    if ((docLine.getParentLineItem() != null)
                            && (docLine.getParentLineItem().getChildLineItems() != null)) {
       
                        docLine.getParentLineItem().getChildLineItems()
                                .remove(deleteLine);
                        getDocumentService().processFullRecalc(doc);
       
                        results.add(toDeleteDTO(docLine, collab));
                        FinancialDocumentLineItemDTO parentLineDTO = toDTO(
                                docLine.getRootLineItem(), doc, collab, false);
                        parentLineDTO.setTriggersRecalc(true);
                        results.add(parentLineDTO);
       
                    } else {
                        results.add(toDeleteDTO(docLine, collab));
                    }
            }
            }

        }

We have similar code in the pull sheet facade for managing the serialization/delta processing quirks presented there.

Progress So Far

This new dialog will be in version 4.4.33 of Flex.  Implementation for line item deletes is nearly complete and adding delete confirmation support to workflow and automation profiles will start on Monday.

Friday, April 6, 2012

Fun With Deadlocks

We recently had a customer in St. Louis report that the system would occasionally freeze when trying to scan in the same job from multiple workstations.  My first instinct when I heard about this was that it was probably the "deadly embrace" problem or a deadlock, which is what we call it when two threads are waiting for locks held by the other.

Let say Thread 1 has an exclusive lock on RED and Thread 2 has an exclusive lock on BLUE.  If Thread 1 tries to acquire a lock on BLUE, the thread will stop and wait for Thread 2 to release the BLUE lock.  If Thread 2 is trying to acquire a lock on RED, it too, will wait for Thread 1 to release the RED lock.  Since neither thread can proceed without one of the other threads releasing a lock, both threads are stuck, or in nerd parlance, deadlocked.

Duplicating Deadlocks

Deadlocks are rare, even in poorly designed software (unless that software is under a heavy concurrent load), because they usually rely on a rare confluence of timing events between the threads and usually specific quirks related to the underlying dataset.  Thread 1 has to ask for a lock before Thread 2 releases it.  More threads and heavy usage can increase the probability of generating a deadlock, but whenever I suspect a deadlock, I always dread the tedious process of recreating it on my developer workstation.

We developed a test utility some time back that was designed to help us understand how a large number of concurrent users impacts scan response time or performance.  Finding deadlocks wasn't exactly what we built it for, but the tool did allow us to simulate high load and many concurrent users, so it seemed worth a try.

I downloaded the St. Louis customer's database and ran a load test simulating 5 concurrent users and a 1-5 second delay between scans.  The test locked up around 17 percent.  I tried it again.  This time it locked up around 7 percent.  Every time we ran the load test, it would get off to a good start and eventually lock up.  When you're trying to recreate a bug, this is a good thing.

Here's a screenshot from the load test utility:

Deadlock Test
In this screenshot, you can see that this particular test run failed right around 39%.  (You might also note from one of the open browser tabs that I'm hosting a party this weekend for my brother-in-law, who's moving to Austria.) And this lock up happened every time, not always at the same percentage of completeness, but eventually.  A consistently reproducable deadlock is a rare thing.  This was a good point to start from.

The next step was to confirm that the freezing was, in fact, a deadlock and to figure out which threads were deadlocking each other and from where in the code.  I hooked the process up to the YourKit profiler and did another test run.  When the test run hit the wall, soon after YourKit popped up a deadlock warning with copies of the call stack from both threads.

I didn't think to grab a screenshot during the failed test run, but here's a screenshot from the YourKit thread monitor during a successful test:


You can see little red blocks in the thread status lines that show small periods where a thread is blocked waiting for a lock, but the blocks are short and eventually the thread gets back to business.  Occasional locks are good; deadlocks are not.

In our failed test run, the profiler told us that one deadlocked thread was in the warehouse facade's updateLineItems() method and the other was stuck in updateLineItemProgress().
 
These methods work together to generate new manifest line items, update the completion status of any line items impacted by a scan, and set audit trail information.  It makes sense that these functions are subject to locks because two threads trying update the same line item at the same time could lead to inconsistent data (more on that later), but since updateLineItemProgress() is called by updateLineItems(), it made it possible to pull all the lock acquisition required by updateLineItemProgress() up into updateLineItems().  I made this change and also moved the lock acquisition logic into a synchronized code block as shown below:

            synchronized (elementLockId) {
                              
                getConcurrencyService().acquireLock(elementLockId);
                             
                if (request.getManifest() != null) {
                    getConcurrencyService().acquireLock(manifestLockId);
                }
                  
              
            }

The synchronized block ensures that in situations where multiple locks are needed, that this logic is somewhat transactional.  No thread has to worry about a edge case thread timing issue creating a deadlock during lock acquisition.  I don't think this change would impact the customer issue, but it seemed like a logical housekeeping task while in the code.

After making this change, I did another test run and the test completed without a hitch.  I varied up the conditions a bit and made the test progressively more aggressive, increasing the factors that might induce a deadlock.  And while performance was degraded (as expected) the deadlock issue never returned.


Concurrency Versus Liveness

All this effort to clean up locking issues begs the question about why we bother with locks in the first place.  To answer that question, consider what might happen when scanning out barn doors, for example.  If there is one line item on the pull sheet for, say, 50 barn doors, and two workstations are scanning out barn doors, and both workstations scan 25 barn doors at exactly the same time -- and with no server side locking for line item updates -- you might scan out 50 barn doors, but because the data updates happened at exactly the same time, one progress update would override the other, and only show a progress of 25/50 instead of 50/50.  This would be bad.

We call this a concurrency issue.  The simplest way to fix this issue in our system would have been to add the synchronized modifier to the processScan() method.  This would have made concurrency issues all but impossible, because only one thread would be permitted in that method at the same time.  Scan 2 would have to wait until Scan 1 completed before being processed.  And this would not only apply to scanning the same job with multiple workstations, but with all scanning, period.  Only one scan at a time would be permitted to go through.

As you may be able to intuit, this would make things very slow, and trade off a concurrency issue for a performance -- or liveness -- issue.  Writing multi-threaded software is all about balancing these two potential issues.  Living with concurrency glitches for the sake of speed is never acceptable, and living with slow performance for the sake of ultra-conservative concurrency restrictions usually isn't acceptable either.

The scan processing code in Flex has many different components, only a fraction of which pose concurrency risks.  Checking the availability of an orphan scan, for example, or checking for prompt conditions.  Blocking other threads while doing things that have no impact on the data is unnecessary and would have severe performance restrictions.  In fact, most of the effort we've applied into improving scan performance relates to locking only when necessary. 

And the only place in the scan process we have potential concurrency issues is when updating the line item data.  So, when the process enters that part of the code, we acquire the necessary locks and release them as soon as the data update is complete.  And we don't lock all threads across the board, only threads trying to update the same manifest or pull sheet.  Workstations attempting to work on different pull sheets should experience no major slow down due to lock-induced request serialization.  (In reality, this isn't 100% true because MySQL has it's own system of page and table locks, but Flex owned locks would not block each other.)

Next Steps

We feel reasonably confident that we've nailed the issue, but due to the uncertain nature of deadlocks, there's no way of knowing 100% until we've tested the fix in the real world.  We'll be building a release candidate soon and conducting a test in the St. Louis customer's warehouse and see if we "really" fixed it.

Update:

The St. Louis customer has tested the fix in their warehouse and reports all is well.