Tuesday, January 31, 2012

The Availability Cache


Having done software development in other industries ranging from insurance to manufacturing to nuclear security, it always amazes me how well the rental industry stacks up to those industries in terms of complexity and uniqueness.  In many respects, the rental industry is like any other: we make customer contacts, market our wares, provide quotes, take orders, fill orders and ship orders.  Pretty standard business well within the capability of SAP, Oracle Fusion, or QuickBooks, right?  Not quite.  There is one little wrinkle that makes the rental business different.  The stuff we ship comes back.  That one caveat renders the big ERP packages a poor fit for our industry and borderline useless in many cases.

The number one benefit anyone purchasing rental software hopes to derive from it is a clear way of managing availability.  Rental software is only as good as the availability math and the availability math is complicated.  For a single location without purchasing or subrentals factored in, it's not so bad.  You can just run a database query over a date window, find which point in the date range the utilization is the highest and subtract that number from the quantity owned.  You can also use a running total technique.

This is complicated enough, but it's woefully oversimplified given the realities of the average rental house.  In that reality equipment is transferred between shops, often directly from show to show without returning to the shop.  Equipment breaks and must be taken out of service, which can have ripple effects on future jobs.  Amps can live inside cases and still be a part of inventory, although for the purposes of availability, should be invisible unless the customer wants the whole amp rack -- or not as with our new storage container feature.  Items tracked by serial number still need to be factored into the math when querying availability for the parent item.  In some cases, an item isn't really an item at all, but a composite or kit whose availability is the minimum availability of its component parts.  And so on.  We have a giant pile of special edge cases we account for in the Flex availability algorithm.  These are but a few.

Then there's the matter of what you, as the user, will do with the availability math.  Most of us don't say 'no' when a customer inquires about availability and gear is out.  We say yes and try to cover the shortage with a subrental or by purchasing the equipment, maybe transferring it from another shop.  Any system that attempts to give a true picture of availability has to account for all of this.

At Flex, as we added these layers of detail to our availability calculation, we noticed that the calculation took longer and longer.  On a per calculation basis, the calculation is pretty fast, but when hundreds of calculations have to be done for larger quotes or pull sheets, it can take quite a while for all the calculations to complete.

Last week we set out to address this issue.

The Developer's Toolkit

When I first started working in professional software shops I hoped to learn some neat technical tricks for improving performance, some secret sauce the pros use to make things zip.  Over time, as I learned these tricks, particularly those available to us who write database driven business software, I was a bit disappointed.

In most database applications, not surprisingly, the database is the bottleneck.  Scrutinizing the code for Big 0 improvements is fine, but tuning, in my experience, always comes down to how the database is queried and when.  With this in mind, the performance tuner's list of options is limited:

  • Do something only at the exact moment you need to do it (lazy fetching)
  • Do something before you need to it (eager fetching)
  • When you do something, preserve it so you don't have to do it again (caching)
You can also combine eager fetching and caching by loading commonly used data or resources when the system starts up.  When you launch Photoshop for example -- and the splash screen pops up and you wait for it to open -- this is what it's doing.

That's our toolkit. 

Profiling

The first rule of performance tuning in software is not so different from carpentry: measure first, fix once.  This stems from the tendency of developers to make assumptions about what aspect of the code might be slow and spend fruitless hours iterating over something that may look inefficient, but in practice contributes very little to the overall slow down.

At Flex we use a profiling tool called YourKit.  When we started profiling the availability speed, using data from the customer that brought the issue to our attention, the profiler allowed us to quickly identify some big - and easily fixed bottlenecks - usually by switching something that was formerly eager fetched to being lazy fetched and vice versa.

The one nut we could not crack by tuning data fetch strategies was something we call the double conflict filter.  In situations where a quote is used to generate a pull sheet and the scan process creates a manifest, control of availability is passed down the line to what we call downstream elements.  The double conflict filter ensures that if a quote, a pull sheet and a manifest are all present, that a given item is not counted more than once.  This is a recursive operation.

When we fired up the test availability scenario with in our profiler, we saw that the real bottleneck appeared to be the hasDownstreamMatch() method, as shown in this screenshot:



The other, longer running methods shown in the profiler all wrap each other; it's almost like a reverse view of the call stack, so we know that the hasDownstreamMatch() method is the longest running single component of the higher level methods.

The hasDownstreamMatch() method is a recursive method a line item can use to determine if there's a matching line item downstream in the workflow.  The method checks all downstream lines for the matching criteria (resource types, status, etc), and all the downstream lines for their downstream matches -- and so on.  The downstream line items used by the calculation are lazy fetched, meaning they are retrieved on demand.  Since this method is recursive, each recursion triggers it's own lazy fetch operation, making this a somewhat atypical variant of the N+1 select problem, though in this case N is the depth of the tree (if we're lucky).  Worst case N could run much higher, into the exponential order, though in practice the downstream line item graphs don't branch out much and N stays closer to the tree depth.

Seemed like a perfect opportunity to take a case where lazy fetching is causing an N+1 select problem and turn that N+1 into constant time with a eagerly fetched join.  The problem is this would require something called a recursive join, and recursive joins aren't supported by Hibernate (our O/R mapping framework).  Some databases do support recursive joins, but not MySQL.  Even so, the typical tree depth never runs much higher than four or five, so we could potentially join a table to itself a fixed number of times and get close enough.  We tried this.  This replaced several fast queries with one slow query and bypassing Hibernate required hydrating the ids returned by the query into objects. The net effect was that the performance bottleneck moved elsewhere, plus the join query was slow -- not surprising since we were pushing our luck with MySQL in the first place.

So we found ourselves with a bottleneck that could not be fixed with tuning fetch strategies.  It was time to contemplate caching.


Caching

We cache things all over the place in Flex.  If you check out the Flex Admin Console, you can peek under the hood and see information about what information Flex is caching and even clear the caches if you want to.

The difference is we tend to cache things that hardly ever change, like project element definitions and configuration settings.  There's no point going back to the database every time we need a configuration setting that might change once every four years.  We've also recently introduced a second level Hibernate cache, which helps Hibernate do its internal work faster.

We'd always thought about caching availability information -- after all, there's no point recalculating something if the underlying data used to calculate it hasn't changed -- but something about this always made us nervous.  Caching availability is one thing, but what about keeping the cache up to date?  Stale availability data is unacceptable.  The concern for us was always missing a condition that should trigger a cache eviction.

Yet, there was no other way around our recursive join issue.  Caching availability had to be on the table.  We knew right away the real trick to making an availability cache fast and reliable was the right cache eviction strategy.

Cache Eviction

Cache eviction refers to the process of throwing stale data out of the cache.  For the availability cache, it means determining which events would render which availability calculations invalid.  All the caveats and edge cases that make the calculation slow enough to warrant a cache also make the cache eviction events complicated.  We found ourselves facing a situation where just determining when to evict something from the cache might be slow.  I believe Joseph Heller coined a phrase for situations like that.

Then we hit upon an idea: what if instead of trying to build the perfect, surgical cache eviction system, we opted for a simple approach that erred on the side of caution and evicted all data that might be stale?  This would mean evicting some data from the cache that doesn't need to be evicted, but it would ensure that all stale data goes along with it while still preserving most of the benefits of an availability cache.

Our strategy became a simple one: whenever any information about a given resource changes, evict all cache entries that reference that resource.  (Resource is the generic term in our architecture for schedulable things like contacts, facilities, serial numbers and inventory items.)

If a serial number is added, evict all cache entries that reference the serial number and its parent item (because adding a serial number would change the on hand quantity).  Whenever a line item is modified or created, evict all cache entries that reference whatever resource the line item referred to.  Notice I said evict all resource entries, not just the line item.  That means if you have a quote with a VL-2500 on it, and you change the quantity from 1 to 2, all other line items that reference VL-2500 will have their cache entries evicted.  Same thing if you change a quote's date range or locations; the cache entries for all items referenced on the quote will need to get purged from the cache, not just the cached records for the quote you changed.  This should come as no surprise.  After all, rental availability is about how different jobs impact each other; no job exists in a vacuum.


Implementation

First off, we opted for a database cache with the potential for an in memory cache to come along later.  (High Availability Flex will have a distributed cache and we'll tackle the problem then.)  We started, rather naively, by adding a column to the line item table for the cached value and the cache eviction process would simply set this value to null.  Needless to say, we bypassed Hibernate and went with straight SQL.

We built the whole cache architecture around this approach, hooked in all the eviction events and began testing it.  The problem we ran into in beta testing was MySQL table locks.  The line item table is the most active table in Flex.  Cache eviction queries were failing intermittently or locking other operations out.  This manifested itself in beta testing with occasional scan failures.  Needless to say, this approach wasn't working.

Luckily, the solution wasn't all that terrifying.  We just moved the cache to its own table with no foreign key relationships (although we did add indexes in place of foreign keys for speed).  The reason we didn't do this before is that adding a cache entry for a line item for whom no cache information has ever been stored requires two queries: one to determine if the line is present and another to optionally insert or update it.  We fine tuned this assumption by skipping the select query and using the number of updated rows returned by the standard update query to determine if an insert is necessary.  If the number of updated rows is zero, we know we need to run an insert query to insert a cache entry.  This has worked well so far.

Cache Eviction Paranoia

Even so, all those queries that failed due to table locks during beta testing left us a little wary of just trusting that every eviction query would magically succeed now that we had a brand new table just for the availability cache.   If adding an entry to the cache failed, no biggie; we'll just have to recalculate it next time.  If a cache eviction query fails, however, that could mean users getting stale availability numbers.  So we built a rather elaborate system for retrying failed cache evictions. 

To begin with, a cache eviction triggering event can tell the cache management service whether or not the cache eviction must happen before the user transaction action that triggered it completes.  A line item quantity change is a good example of this.  The cache must be evicted immediately to trigger a recalculation.  In other situations, such as changing a non serialized quantity or adding a serial number, it might be okay if the cache eviction doesn't happen for a second or two.

However the cache eviction is initiated, either in the user's thread or in another thread, if it fails, it must be reattempted; and when this happens we want to give the system some time to breathe.  Trying the exact same thing the millisecond after it failed is likely to produce the same result, and a cascade of deadlocks if things get really out of hand.

So we built a retry queue and made it a persistent JMS queue.  This means that if the first cache eviction attempt fails, information about the cache eviction request is serialized as XML and placed on a retry queue.  A process listens to the queue, and after a respectable delay (breathing room), unpacks the message and tries the eviction attempt again.  If it fails again, it goes back on the queue until the max retry threshold is reached.  (Which is 10, by the way.)  The idea here is to stretch the retry attempts out over a fairly long period and increase the chance that whatever intermittent condition that caused the first eviction attempt to fail will have cleared up.


Testing Cache Eviction

Testing when and if the cache was evicted presented a unique problem for Courtney, our one woman QA team.  The availability numbers in quotes look the same to end users whether they were retrieved from cache or not.  Courtney needed an easy way to tell which numbers came from cache and which numbers were recalculated on demand.

Flex already has a number of test utilities and hooks built into the client, but you have to put the client in debug mode to access them.  How do you do that?  You add &debug=true to the query string.  When you do this and load up a quote, you'll be able to tell if the availability numbers you're looking at came from cache or not by hovering over the number; the tooltip will say either 'Cached' or 'Recalc'.

(Note that if you double click on the availability numbers and open the conflict dialogue, this will trigger a recalculation and update the cache -- because the conflict dialog requires detailed information that isn't stored in the cache.)

Asynchronous Cache Updates

Because large cache evictions have the potential, in theory, to hold on to table locks for a few seconds, we didn't want recalculations that trigger cache updates to have to deal with this possible lock contention.  The solution was to hand the cache update off to separate thread and let the user go on with their life.  This means that even if the cache eviction from hell locks the cache table for five seconds or more, that waiting is done by the thread and not by the user.  We also decided to serialize cache updates to control the number of simultaneous update threads, thereby minimizing the number of concurrent threads requesting table locks.  Here's a snippet of code that deals with cache updates:

         if (threaded) {
            executor.execute(new Runnable() {
                               
                @Override
                public void run() {
                    doCache(resourceId, lineId, availability, compositeAvailability, expiry);
                   
                }
            });
        }
        else {
            doCache(resourceId, lineId, availability, compositeAvailability, expiry);
        }

To anyone familiar with Java concurrency, this is very basic stuff.  We have a method elsewhere in the service class that does the work and we simply declare an inline Runnable and throw it on the Executor, which manages the threads for us.

Precalculation

With Chris away doing trade shows, there was no one around to stop us from gilding the lily, so we took the cache one step further and created a background agent to precalculate availability.  You can see the settings for this agent (in 4.4.17 and up) under Projects > Availability Cache. 

What the precalculation agent does is wait until the server is idle, make an educated guess about which quotes or jobs are likely to be viewed by a user, and precalculate the availability for those jobs.

There's two bits of logic that make this work.  The first is determining when the server is idle; we wouldn't want to waste CPU on speculative calculations when users are on the system.  The other is the guesswork used to determine which calculations are most useful to perform because it would be pointless to precalculate everything.

For the first part, determining when the CPU is idle, we present the user with three important settings.  The first is Max CPU Threshold, or the percentage of CPU utilization above which precalculation should not be running.  The second setting is Agent Check Interval and this determines how often the agent should check the CPU load.  This is used in conjunction with the Min Quiet Period setting to determine when to start the precalculation agent.  For example, if you have a CPU threshold of 20%, an agent check interval of 5 minutes and a quiet period of 3, this means that the agent will check the CPU every five minutes and if the CPU is below 20% three times in a row, the precalculation will start.  In practice, this would mean that precalculation would start approximately fifteen minutes after the users go home for the day.  When the CPU goes back up above the max threshold, the process will stop.

The next bit of logic, guessing which jobs to precalculate, is done by first looking at the recent items for all users.  Any job a user has recently looked at will get calculated first.  Once all these are done, jobs starting within a certain number of days of the present will be calculated.  The default is 30 days, but this can be adjusted by changing the read ahead period.  The read ahead period can be changed to months -- or even years ahead if you can afford the electricity.

In Conclusion

The availability cache has cleared testing and will be deployed as part of the Flex 4.4.17 release, which is due to be pushed this week.  In testing we've noticed a significant increase in the speed and overall usability of the system.  It was a fun project to work on and we hope Flex users will find it a welcome addition, even if they never really know it's there.

No comments:

Post a Comment