Friday, May 10, 2013

Fixing The Double Conflict Filter

The main priority right now at Flex is making things fast.  There are things we did when we originally designed Flex (and Shoptick before it) that we knew were suboptimal, but at the time we felt going all the way would divert resources away from more pressing concerns and more importantly, would have been wildly optimistic about our company's future prospects.

It would have been akin to a company that makes irrigation valves overdesigning their products on the off chance they might be used in a nuclear power plant.

We never expected Flex to get so big so quickly and to have customers with as much concurrent throughput as we have now.  We'd always hoped to get there, but when you're starting a company you have to focus on the here and now, on the needs of the customers you have - and keep your goals realistic.

This is why we chose Hibernate as an ORM framework instead of building our own micro-optimized persistence layer.  Hibernate is ubiquitous, fast to develop in, and although not as performant as rolling your own, usually good enough.

Whatever our reasons were initially, the landscape has changed and the time for hand optimized persistence and caching code has come.

Faster Availability

Much of the work we've done so far on system performance has related to the scan process.  We've fine tuned much of the scan bottlenecks (the .14 and .15 releases include a lot of scan performance improvements) and now our attention turns to availability calculations.

The Flex availability calculation process is very complicated, but there are two main phases governed by separate modular components: The Conflict Engine and The Availability Engine.

The Conflict Engine's job is to retrieve and process line items from the database that might be relevant to an availability calculation.  The Availability Engine then takes the output from The Conflict Engine and applies all the ship, return, container, subrental, transfer, and expendable return logic to produce a final result.

The purpose of this design was to isolate the I/O intensive part of the calculation in one place (The Conflict Engine) and leave The Availability Engine to focus on relatively high speed in memory computation.

We've known for some time that the bottleneck in availability performance is the conflict engine and learned over time that the database query used to retrieve line items is fast.  The work Hibernate does to turn that query into line item objects however - is not.

Another main bottleneck in the The Conflict Engine is what we call The Double Conflict filter.  This filter's job is to remove related line item entries from the conflict result, else an item might get counted more than once.  Consider the following graph of a typical line item relationship in Flex:

This shows a pretty conventional process where a line item is placed on a quote, the pull sheet is generated, and as the show is scanned out, two manifest line items are created with the specific serial numbers.

But there are four line items in the system referencing the console for a total quantity of 6 conflicts - when only two consoles are actually in use.  In Flex, we address this problem by assuming that only the line items furthest downstream in the workflow are in control of availability.  This deals with the double conflict problem, but is also intended to handle the problem of the plan diverging from reality.

What would happen in this situation where the L1 made a judgement call in the warehouse and only decided to take one console or maybe decided to take a lower end console as a spare?  If we went solely off the plan, other shows would show a shortage and you might end up subrenting a piece of gear that you had sitting on the shelf the whole time.

The performance problem with this approach is that the current algorithm uses recursive I/O to crawl down the downstream object graph, necessitating a database hit for each level of the graph.  This is slow.

Bypassing Hibernate

The first step (already completed) is to bypass hibernate in the conflict engine.  We did this by replacing the fully mapped line item objects with a simple light weight DTO object that only contains the line item fields relevant to an availability calculation. (Fields like location, ship date, return date, etc.)

Instead of running a database query to pull back a list of line item ids and feeding those ids into hibernate for hydration, all the fields come back in one query and get copied directly onto the DTO object.  This was pretty straightforward.

Adjacency Matrix

The next step is to reform the double conflict filter by getting rid of the recursive IO needed to retrieve the graph. (I/O buried inside Hibernate, I might add).  To accomplish this, we're introducing a persistent adjacency matrix to represent the upstream/downstream relationships.  We also decorate this adjacency matrix with the status of the downstream line item and whether or not the status is conflict creating, which saves yet another Hibernate lookup.

Each line item has a reference to the adjacency matrix used to represent upstream/downstream line items and we can retrieve all the relationships in a single query - and store them in a small and easy-to-cache object.  The caching will further reduce database I/O.

Local Caching / Distributed Cache Version Control

We're also introducing a new cache version feature that will be needed for high availability Flex.  We learned a few months ago when we first started playing around with memcached that serializing our domain objects for the trip over the wire to memcached and back was slow.  Not as slow as not caching at all, but still less than ideal.  It would also necessitate a large cache and since we'll be using Amazon's Elasticache, this comes with a pricetag.

What we decided to do was stick with an in memory cache, but use memcached to help us know when an object cached in memory was stale (modified by another server in the cluster).  We do this by giving cacheable objects the ability to implement an interface whose single method returns an SHA1 hash that represents the version or state of the object.  It could be a message digest based on the properties of the object, or, when generating a string to base the digest on would be too expensive, it could simply be a unique (and persistent) hash that changes when the object is mutated (like a git commit).

The cache lookup code will pull the object it has in the local cache and compare it's version hash to the one in memcached for the same cache key (and do this in separate parallel threads). If they match, all is well, and the cached object is returned.  If they don't match, then the in memory object is no good and the cache lookup will return null.

A side effect of this approach is that it could lead to memory churning (lots of space related cache evictions) if our high availability architecture uses round robin load balancing.  We'll try to optimize the architecture by using hostname affinity for load balancing, session affinity as a last resort.  Unlike most "sticky session" based load balancing, this approach will just be a performance optimization and shouldn't impact reliability or failover.  We don't rely on http sessions, so losing session variables won't hurt the user at all.

Wrap It Up, Kid

This work is currently slated for release in Flex 4.6.16, although given the risk and magnitude of the change, that version number could slip to .17 or .18.

The single slowest part of Flex has always been availability calculations and this batch of enhancements so far have been very encouraging.  We haven't done a formal comparison yet with unoptimized versions of Flex, but on my machine the availability calculations appear to be virtually instantaneous.  Here's hoping the fix holds up in regression testing and the performance boost we're seeing so far holds up, too.

No comments:

Post a Comment