Wednesday, December 26, 2012

Everything You Always Wanted To Know About Hibernate Performance Tuning But Were Afraid To Ask

One of the things I've learned over years developing software is that there's no such thing as a black box.  If you pick a framework or library off the shelf, just hope it works forever and that you'll never have to learn much about it, you're due for a day of reckoning.  The more complicated the problem this library solves, the sooner this day of reckoning will come.

In my opinion, there are fewer technologies more prone to black box syndrome than Hibernate.  If you browse forum posts and discussion threads on Hibernate, most of the posts are from people who just refuse to look inside the black box.  It's possible to get up and running on Hibernate without learning much - that is true.  But to use it in the real world, at web scale (whatever that means) requires a much deeper understanding of it - how it fetches data, how it caches data, etc.  Otherwise, your software will be dirt slow.

We've tinkered with Hibernate performance over the years, but always shied away from blocking off a few weeks to bear down and really dial it in.  Over the last few weeks, we took one of our high volume customer systems as a test case and worked through six key use cases and through a fairly agonizing and time consuming process, were able to achieve 20X-30X performance boosts.  We were hoping for 100X, and there are still some things we can do to (unrelated to Hibernate) that can possibly get us to the next order of magnitude, but we'll have to defer those for another day.

On XML And Religion

One area unrelated to performance tuning, but nonetheless a big philosophical debate within the community of Hibernate users, is whether to define the object relational mappings via JPA annotations or an external XML file.  The cool kids these days seem to prefer Annotations. Come to think of it, the cool kids seem to hate XML for just about anything these days.  And of course, anyone, especially someone who works in technology, who has absolute love or hatred for any technology, isn't practicing science or engineering - they're practicing religion.  This is a well known tendency of developers, which is why you'll often hear technology subjects referred to as "religious issues". 

The alternative to being a religious developer is being someone who evaluates technologies in a less satisfying, but more practical way.  XML's a good example.  You hear a lot of developers these days slam XML outright, preferring JSON or plain text files.  My take is that XML is bad for data, but good for configuration.  XML is self documenting which makes it easy to configure servers or object/relational mappings via XML, but it's not the sort of thing I'd want to send over the wire when speed or bandwidth are factors.  In those situations, JSON or a bitpacked format like AMF is more appropriate.  Right tool for the job.


I think this religious hatred of XML, even in situations like configuration where size isn't a factor, has driven a lot of young developers to go the annotations route with Hibernate.  One of the problems with religious zeal is it tends to blind people to all other considerations.  For example, there are all kinds of reasons why it's bad to co-mingle a system's domain model with the persistence mechanism, but all these reasons get smothered by XML hatred.

For example, suppose you believe, as I do, that a domain object or entity shouldn't know about the table used to persist it.  Why do I believe this?  It doesn't involve a bearded prophet.  It's because you might change your mind one day about the persistence mechanism or a customer might force a change on you.  All the persistence code needs to be contained in a single layer that can be swapped out if need be.  The annotation camp will usually say "nobody will ever really change the database" or "JPA is the persistence abstraction".  Maybe.  But you might change the ORM.  You might upgrade it (as we just did.)  You might decide to switch from HQL queries to criteria objects.  It's nice to know that all the code you'd need to change might be in a set of related DAO classes instead of scattered all over the application.  An even more practical example is that caching and indexes might be slightly different for different deployments of the application.  Being able to swap in a different set of Hibernate mapping files for a self-hosted deployment versus a cloud hosted deployment is of real value to us now, not in the abstract.  You can mix annotations with XML files, but I tend to think this creates confusion.

So we prefer XML configuration over Hibernate/JPA annotations at Flex because it decouples the persistence details from the domain.  I will concede, however, that if I were working on an internal application at a corporate IT shop where only one instance of the application will ever be run, the prospect of ever needing a plug-and-play persistence layer is pretty remote.  In that case, putting all the mapping configuration in annotations might make the code easier to understand.  I actually used this approach at the company I worked at prior to Flex.  Like most technology choices, XML vs Annotations is a choice that has to be informed by rational considerations specific to the environment and the project.  I hate XML is not a rational consideration (though you're still free to hate it.)

How Hibernate Works

With the philosophical debate out of the way, we can move on to the technical issues of making Hibernate fast.  We'll begin with a brief description of what Hibernate is and how it works - which should come as a great relief to those readers who've muddled through all the jargon to get this far in the post.

Most Java software (and most modern software) is based on an Object Oriented design model.  This means we represent "nouns" or data as objects, or more formally "classes".  For example, in Flex we have classes like ScanRecord, InventoryItem, SerialNumber and so on.  There are actually hundreds of different classes in Flex.  We do this because it's much easier and logical to work with objects than it is to work directly with database queries and recordsets.  Contrary to what a lot of Database Administrators might think, we developers do this because it's mundane and repetitive to work directly with databases, not because we're afraid of SQL.

Hibernate's job is to translate these domain objects/classes into SQL for us, and in so doing remove much of the drudgery of database interaction so we can focus on higher order thinking skills like business logic and user interfaces.  To make this work, we configure Hibernate (via Annotations or XML files) to know which tables go with which classes.  Then we can use Hibernate provided classes to save objects, delete them and query them.  Hibernate generates the SQL and takes care of the details.

To make this fast Hibernate provides several different caching mechanisms.  There's the query cache, which enables Hibernate to intelligently decide when a query doesn't need to be rerun.  There's a session cache, which caches objects for a brief time in the Hibernate session (typically a session has the lifespan of a single HTTP request), and a second level cache, which can cache objects between sessions.

If you examine a commonly used method on the Hibernate session like byId(), which retrieves an object instance by it's identifier or primary key, you'll see that Hibernate checks for the an instance of the object in the session cache and the second level cache before resorting to running a database query.

Let's assume that we all have a working knowledge of Hiberate now and dive into some optimization tips we uncovered over the last few weeks.

The Session Cache

You get the session cache for free.  There's nothing to configure or turn on and it's just a HashTable with a real reference to persistent object (as opposed to the second level cache - which caches a serialized representation of the object.)

In essence, if a session attempts to retrieve an object twice, it will only result in one database hit.

The session cache is faster than the second level cache because it doesn't have to deserialize or "hydrate" objects.

The Second Level Cache

The second level cache is used to cache objects with a lifespan longer than the session.  Any meaningful performance tuning will usually involve extensive use of the second level cache.  In order to use the SLC, you have to configure an external cache provider like EHCache or SwarmCache (we use EHCache) that plugs into Hibernate and handles all the details of sizing, eviction, disk overflow, etc.

You also have to tell Hibernate which classes to cache.  In our XML based approach, that requires adding a cache tag to each class we want to cache, like this:

    <class name="ShippingMethod" table="st_biz_shipping_method">
        <cache usage="nonstrict-read-write"/>
        <id column="id" length="36" name="objectIdentifier" type="string">
            <generator class="alto-uuid"/>
        </id>
        <property column="method_name" length="128" name="name"/>
        <property column="method_code" length="16" name="code"/>
        <property column="method_type" length="16" name="type"/>
        <property column="min_days" name="minimumDays"/>
        <property column="max_days" name="maximumDays"/>
        <many-to-one column="cost_rule_set_id" name="costRuleSet"/>
        <many-to-one column="waybill_template_id" name="waybillPrintTemplate"/>
        <set name="disabledPricingModels" table="st_biz_rc_disabled_pricing_models">
            <cache usage="nonstrict-read-write"/>
            <key column="inventory_item_id"/>
            <many-to-many class="com.shoptick.bizops.domain.PricingModel" column="pricing_model_id"/>
        </set>
                  
    </class>

This example will ensure that ShippingMethod's get cached in the second level cache - and just ShippingMethods.  A common misconception about the second level cache relates to what happens to the object graph under an object that's been cached.

Let's clear up the confusion.  Caching a class only caches instances of that class.  It will not cache any associated classes.  In this example, costRuleSet and the waybillPrintTemplate values will not be cached.  The ID of the related object will be cached, but not the object itself.

This is actually a really good design from the standpoint of concurrency.  We don't have to worry about dozens of objects in the cache that all refer to the same related object, and that each cached instance might have a stale version of the related object.

Under the hood, when Hibernate chooses to place an object in the second level cache, it takes the class (and just the class for which caching is enabled) and serializes it to a simple string based format that represents simple data types: strings, numbers, dates, etc.  If one of the properties refers to another object, that object's identifier is stored in the cache - but not the object.

When an object is loaded from the cache, Hibernate instantiates an instance of the class and sets all it's properties using the serialized values stored in the cache.  This is called hydration.  If one of those values is an identifier for another object, that object will be loaded using the normal three step process: session cache, second level cache, database.  This happens recursively until the whole object graph is loaded (assuming the object graph is eager fetched.)

List, map, set and other collection style associations are not cached by default.  In the previous example, you'll see that we have a cache tag as part of the set declaration.  In this case, a separate cache is used just to store collections.  But note that collection caches are just mappings of ids to ids.  The key for cache will be the parent object's identifier and the values will simply be a list or set of id's.  All that's cached is the association, not the referenced objects at either end of the association.

Caching Modes

There are three cache modes supported by Hibernate: read-only, read-write and nonstrict-read-write.  These modes relate to how caches are locked and used to enforce concurrency. 

Read only is the fastest, since there is no locking overhead.  The downside is that objects cached as read-only will not get refreshed if an instance of the object is changed.  Read-write is the slowest because every read operation takes out a lock to prohibit a write operation.  It also uses locks to prevent concurrent write operations.  The nonstrict version of read-write assumes there won't be concurrent write operations and as a result, has a much lower overhead in terms of lock synchronization.  We use this version extensively for configuration data and read-write caches for data subject to frequent changes like line items.  We don't use read-only caches at all.

Lazy Fetching and Preloading

Most systems that start to make extensive use of the second level cache will inevitably need a preloader that initializes the cache with frequently referenced objects.  In this case it makes good sense to switch a lot of the objects and the portions of their associated object graphs likely to be frequently needed to eager fetch such that the entire object graph gets preloaded and not just the top level object.

We ended setting a large number of properties to eager load that we ordinarily wouldn't have because we noticed a little quirk where lazy loaded associations were missing the cache.  For example, if getPricingModel() were set to lazy load and the associated pricing model was in the cache, the getter would go back out to the database anyway.  We think this could be the fault of the Hibernate Transaction Manager provided by Spring, but we aren't sure.  In short, our advice is to eager fetch if the associated object class is also cached.

The Join Fetch Problem

Let's consider a many-to-one association on a class like the one configured here:

  <many-to-one column="customer_id" name="customer"/>

If you want to load the parent object and the customer object, you can either run one query or two.  You can run a single query that joins the parent class table and the customer table or you can run two separate queries: one against the parent class table and a second query against the customer table.  Hibernate lets you choose which way to do this using a fetch strategy.  The snippet below shows the two options in context.

    <many-to-one column="customer_id" name="customer" lazy="false" fetch="join"/>
   <many-to-one column="customer_id" name="customer" lazy="false" fetch="select"/>

In a normal use case, where neither the parent object or the child object are cached, one query is generally better than two, so the default fetch mode is join.

But once you move to a configuration where the child object is likely to be cached, a fetch strategy of join can cause problems.  Think about it.  The purpose of the cache is to avoid reading information from the database unless absolutely necessary, especially information that doesn't change that often, like product descriptions or customer contact information.  Assuming the parent object is not cached and we're definitely going to need a query to fetch the parent object, we have no way of knowing what the id of the associated customer object is without first running a query.  So Hibernate runs that query with a join that brings back all the child object's fields along with the parent's.  In short, if you use a fetch strategy of join, you negate the benefit of using the cache because Hibernate will end up hitting the cached object's table anyway.

The solution is counter-intuitive: use a fetch strategy of select.  For the optimizing mind this seems scary because a select strategy means two database queries instead of one.  If neither object were in the second level cache, that would be true.  But if the associated object is highly likely to be in the cache, only one query gets run because the first query brings back the id (because the id is on the parent object's table).  Once that ID is in hand, Hibernate can check the caches for it before resorting to running a query.

If you want to guarantee a second level cache hit for many-to-one associations, make the properties eager fetch with a fetch strategy of select.

The Query Cache

One of the more common mistakes when configuring Hibernate is to enable the query cache and do nothing else.  This doesn't work.  You have to tell Hibernate in the code (or though saved queries) which queries or criteria objects can be cached. 

A lot of developers are gunshy about caching queries because of concurrency fears.  This is valid when something other than Hibernate writes to a table being queried, but if all the database I/O that can update the database is controlled by Hibernate, it's okay to be very aggressive with query caching.

Hibernate is smart enough to figure out when a cached query should be evicted.  It does this by checking the query cache every time a object is updated or deleted and evicting all queries that reference one or more of the updated tables.

If some other process writes to the tables, query caching can be dangerous (so could caching in general).  Otherwise, it's pretty useful, but you must manually tell Hibernate which queries to cache by calling setCachable() on the query or criteria object.


Use byId() Instead of load() Or Queries

One of the really dumb things we found in the code is that we were using queries to retrieve objects by identifier.  We did this to enable additional HQL to be added to queries for soft deletes, etc.  It's a dumb idea and ensures that in simple situations where someone calls findById() on a service or DAO, that an SQL query gets executed, even if the object is cached.  One could hope that Hibernate would be smart enough to know that the only field in the where clause is the object's identifier and check the cache before running a query, but it doesn't work that way.

The solution is to use the byId() method on session instead of running a query.  This will ensure that Hibernate checks the cache first.  There's also a method on session called load() that seems semantically equivalent to byId().  The difference is that load() will never return null.  It will return a clean instance of the object if no object matching the id exists.  This is usually not what you want.  Stick to byId().

Use Natural Keys

Under the hood, we're a big believer in surrogate keys, meaning primary keys that are random, immutable and have no meaning at all in business terms.  We use 36 character UUID's as primary keys instead of sequential integers.  If you need to generate sequential numbers, you need locks and when you cluster the database you need expensive locks enforced with network I/O.

Every domain object in Flex and therefore every database table has a UUID as a primary key.  But sometimes you need to lookup an object by an alternate identifier, or a natural key.  In computer science lingo, a natural key would be a unique identifier that has some kind of meaning to the user.  Examples would be social security numbers, bar codes, user id's, or job numbers.

The most common natural key lookups in flex are user id's for logins and barcodes for inventory.  In the previous version of Flex, barcode lookups were done using an HQL query, albeit a simple one.  This has the same disadvantages of using a query to retrieve an object by it's surrogate key: even if the object you're looking for is cached, Hibernate runs a query anyway.  In the case of natural keys, it doesn't know which ID is associated with a given barcode, so it has to go to the database, and by default, when Hibernate goes to the database, it gets the whole object, even if it's cached.

To work around this issue, you can configure one property of an object as a natural key.  Here's the real configuration we use to support barcode lookups:

    <class name="ManagedResource" table="st_biz_managed_resource" lazy="false">
        <cache usage="read-write" include="all"/>
        <id column="id" length="36" name="objectIdentifier" type="string">
            <generator class="alto-uuid"/>
        </id>
        <natural-id mutable="true">
            <property column="bar_code_id" length="64" name="barCodeId" not-null="false"/>
        </natural-id>
This example flags the barCodeId property as a natural key and enables you perform a natural key lookup and take full advantage of the cache - as shown in this example:

return session.bySimpleNaturalId(InventoryItem.class).load(barcode);

This bypasses the normal query approach and saves a query - in theory.  In reality, it results in a simpler query and in time no queries.  When you invoke bySimpleNaturalId(), the first thing Hibernate does is try to determine which primary key matches the given natural id.  There is a natural id cache and this is checked first.  If there's no id in the natural key cache, Hibernate will run a very simple query like this:

         select id from inventory_item where bar_code = ?

This just gets the primary key and if the object isn't cache, the system will run a second query to retrieve the object.  Otherwise, the object will be retrieved from cache and the second time the same natural key is looked up, the natural-key to primary-key mapping will be in cache and there won't be any database queries at all.

Hibernate's natural key feature is really just a special technique for improving cache efficiency.  If you're not using the cache or have an object that you're not caching, natural keys can actually slow things down because you'll always get two queries with a natural key lookup: one to get the natural key to primary key mapping and another for the main object lookup.

Key Generation

Another good way to get a speed boost, especially if you define speed as reducing the number of database hits, is to use an in-process mechanism for generating new primary keys.  This is virtually impossible to do with sequential integers or other database driven key generation mechanisms.

Prior to 4.6, we used MySQL's built in GUID generator to generate primary keys, even though we weren't using sequential integers.  This meant that every insert operation required an extra select query to get a new primary key.

As part of this release we developed our own pure Java UUID generator and configured Hibernate to use it.  We released the UUID generator as part of the open source multi-tenancy project we've launched here: http://code.google.com/p/flex-alto/


In Conclusion

This long post hopefully chronicles some of the lessons we've learned in the process of getting Hibernate up to high speed.  Like so many things in performance tuning software, the solutions are rarely clever or exotic.  You either do things before you need to (preloading), wait until the last minute (lazy-fetching) or do them only once (non-preloaded caching).

The next version of Flex will be 4.6.0 and will include these performance improvements.  We're in the process of doing QA rework and should be in regression testing by the end of the week.  We're planning a short beta to test the memory footprint of the bigger cache in production and will do a wider release after that hurdle is cleared.

Wednesday, December 5, 2012

Hibernate 4

As part of our new "flight plan" for the end of the year, we decided to take this opportunity to upgrade Hibernate, the object/relational mapping tool responsible for most of the database interaction in Flex.

Upgrading our technology stack is fairly common at Flex.  We frequently upgrade Spring, tons of smaller libraries like apache-commons and we recently upgraded Jasper Reports.  Upgrading Hibernate, however, is not something to be undertaken lightly.  Is very rare that you can upgrade Hibernate without doing some kind of refactoring or finding strange and serious regressions.

We Fear Change

Part of this is cultural.  The folks at Hibernate and JBoss have a possibly undeserved reputation for not respecting backwards compatibility.  Gavin King, the original developer, had a paternalistic attitude about how Hibernate should be used and expressed this attitude through a notoriously abrasive personality.  Steve Ebersole, the current lead for Hibernate, is a bit more diplomatic and it seems that Hibernate now has a kinder and gentler culture.  Another reason Hibernate may have struggled with reverse compatibility issues is that the problem they're solving - mapping complex object graphs to two dimensional database tables - is incredibly difficult.  With so much abstraction and propeller-head algorithms required to build something like Hibernate, I'm not surprised in the least that abstractions leak and have to be revised as things move forward.

I would never even attempt to develop something like Hibernate and I have a tremendous amount of respect for what the Hibernate people have accomplished.  Not only did they build a great and useful ORM tool that's become the defacto standard - they forced the entire Java establishment including the folks at Sun and now Oracle to redefine Java persistence.  JPA is Hibernate and wouldn't exist without it - make no mistake.  Were it not for Gavin (love him or hate him) and Hibernate, Sun would no doubt still be trying to cram Entity Beans down our throats.

I mention this "upgrade friction" not to bag on Hibernate, but to highlight the fact that you don't upgrade Hibernate when you have a project that's up and running unless you expect to realize some kind of tangible benefit from it.  For us, we liked the new service based approach in Hibernate because it seemed more Spring friendly (but not quite, as we shall see) and we liked the new caching architecture that got introduced over several iterations of Spring 3.  But the big thing for us was multi-tenancy support.  The big project on the horizon for us at Flex is converting the architecture to a one-instance/many-customers approach.  I didn't even know the lingo for we were about to do was "multi-tenancy" until I started reading release notes for Hibernate 4.  That sealed the deal.  However painful it might be to upgrade Hibernate, it had to be less pain than developing our own JDBC layer multi-tenancy system (which we may still have to do, but the likelihood is far less now.)

Can't We All Just Get Along

We, like many people who use Hibernate, also use Spring.  Hibernate folks, in their talks and blogs, tend to downplay a Spring/Hibernate stack, even suggesting in an oblique way that Spring/Hibernate is an uncommon architecture that kind of annoys them.  Of course, anybody with their eyes open in the world of Java Architecture knows that Spring/Hibernate architectures are incredibly common, perhaps the most common Java technology stack these days.  It wouldn't surprise me if the people that attend JBoss conferences don't reinforce this reality, but it's a reality nonetheless.

The Spring/Hibernate feud, to whatever extent it really exists, reminds me of an interview I once saw with Stephen Morrissey, known to his fans as just "Morrissey" from the Smiths.  In this interview Morrissey talks about how he saw a fan of his in the airport wearing a Cure T-Shirt.  Morrissey gave the fan a talking to and goes on to talk about how much he hates The Cure, which was a shock to me.  It's very rare to find a CD collection with Morrissey that doesn't also include The Cure.  Another analogy might be the South Park / Family Guy feud.  It's disappointing because fans of one are usually fans of the other and it also reflects a certain cluelessness on the part of the belligerents.  If you think you can convince a typical Morrissey fan to pick a side and abandon The Cure, you don't have a clue, you don't understand your fans.  Likewise, every time Hibernate takes a dig at Spring, they're taking a dig at their own users.  We're going to use both together as long as there is a Spring and a Hibernate - and it's time they embraced the idea.  I'd like to see the Venn diagram of Spring and Hibernate contributors.  If they don't touch, that's a problem.  (And maybe they do.  I haven't checked.)

Square Pegs and Round Holes

With all this intrigue and background well established, let's talk the actual upgrade process.  A key issue for us was the ability to define a RegionFactory (Hibernate's cache abstraction) in Spring and inject it into the Hibernate Session Factory.  We had no problem doing this is in Hibernate 3 using one of Spring's factory beans.

There was no facility for doing this in Spring's SessionFactoryBean for Hibernate 4.  I read that Hibernate 4 permits services to be swapped out using the ServiceRegistry, so I assumed that Spring's Hibernate 4 factory bean was just a little stale.  So, I subclassed it to support service injection and used the ServiceRegistry as documented to inject our RegionFactory.

Problem is it didn't work.  We'd get an error during startup warning us to use Hibernate's preferred method of defining a region factory - which is by classname as a configuration property.  You can't inject dependencies in things you define by class name, which is why we don't want to do it that way.

I poked around in the Hibernate source for several hours and found this little gem in a class called CacheImpl, which is the class that Hibernate uses to implement their cache integration.

public CacheImpl(SessionFactoryImplementor sessionFactory) {
    this.sessionFactory = sessionFactory;
    this.settings = sessionFactory.getSettings();
    //todo should get this from service registry
    this.regionFactory = settings.getRegionFactory();
    regionFactory.start( settings, sessionFactory.getProperties() );
    ....
}
So, there it is, proof positive that you can't inject a RegionFactory in Hibernate no matter how hard you try.  Although it does suggest that the folks at Hibernate are aware of the issue and will fix it relatively soon.

But we needed a fix now so I spent several more hours trying to find something I could subclass or swap out that would allow us to sneak in our RegionFactory, but through a combination of default scoped interfaces and final classes, it was all for naught. In the end, we had to fork Hibernate and change that one line of code ourselves. As it turned out, it wasn't just one line of code - several different places where RegionFactory is looked up had to be changed. I thought about submitting a patch to Hibernate, but my fix might be simplistic given other ways Hibernate is used, so I figured it was better to wait.

So, for now we have a fork of Hibernate.  We'll switch back to a standard version as soon as they fix the issue.

Sessions

If other major issue we've seen with Hibernate 4 - and in this case I think the issue concerns both Hibernate and some of the code Spring provides to work with Hibernate - is that Sessions aren't always readily available.  We no longer have a "get a session and create one if one doesn't exist" method of getting sessions.  We use the HibernateTransactionManager provided by Spring and the workaround so far has been to make facade or service methods that don't need transactions (things that Spring generates transactional proxies for in our architecture) transactional.  We also use the Spring OpenSessionInView filter which handles this for all browser initiated requests, leaving this session issue relegated to scheduled tasks, JMS queue consumers and startup stuff.  Personally, I think the Spring Hibernate Transaction Manager needs to be tweaked to handle this issue, but that's just my take on it.  I don't know the inner workings of Hibernate or Spring well enough to know if that's really the solution.  It's something we've been able to easily work around, so no biggie.

Our Region Factory

Since we went through so much trouble to inject our own Region Factory, it's reasonable to ask why we'd go through all that trouble.  The answer is that we're moving toward a custom distributed cache that's a hybrid of EhCache and Memecached.  Some caches are small and static enough that simply having a local in memory cache (provided via EhCache) makes perfect sense.  In other cases, we need to move the memory footprint of the cache to another server.  We need a lot of flexibility and most of the current cache implementations impose a kind of all or nothing proposition.  We're not even 100% sure how this will work yet.  More to come on that front....
 

Friday, November 30, 2012

A New Flight Plan

Flex is slow, much slower than it should be or could be.  We've known this for a while and we've also known the solution is ramping up the use of a second level cache.  The reason we've held off on implementing a solution is that we planned to move to a distributed cache when we moved Flex to the new high availability architecture.  We let the problem fester for a bit because most of our customers aren't big enough or high volume enough to really notice the problem - and we knew we would be fixing it as part of the high availability redesign.

But for the customers that are impacted by it, the impact is severe, very severe and it got to a point where pride of authorship just wouldn't let the problem go untreated anymore.  So, today, we decided to throw out the flight plan and dedicate a solid two weeks of development time to nothing but performance tuning.

The Crux

Flex uses an object/relational mapping tool called Hibernate.  Hibernate is never fast unless you're doing something very simple or you use the second level cache, which is a cache that stores data in between requests.  Hibernate is widely used and odds are you interact with a system that uses it everyday, even if you never login to Flex.  That would not be true if Hibernate was slow as an absolute.  It's possible to make Hibernate very fast, but the trade off is memory.

We do use the Hibernate Second Level cache, just not as extensively as we could.  The reason for this is that each cloud instance of Flex is currently allocated 1024 MB of heap space.  Without a distributed or remote cache, any data that we add to the second level cache would have to come out of this heap space or we'd have to reallocate the amount of heap space we use per cloud instance, which would increase our hosting costs and the monthly subscription fee along with it.  This is the classic trade off between fast and cheap in action.

The high availability architecture we have planned for the first quarter of next year would change the fundamental deployment architecture of Flex such that cloud instances share a large pool of memory instead of each install having a small 1024 MB sandbox.  Unfortunately, under high availability that memory can't really be used for write sensitive caching because of concurrency.  Our high availability architecture is also intended to provide redundancy such that a single server failure would not take a customer system down.  The new Flex will run a minimum of two servers behind a load balancer with many more servers in use during peak load times.  If we used a conventional in memory second level cache, if you updated a piece of information on server A, a message would have to be sent to server B to ensure that the cached information is invalidated or updated - otherwise you could end up with stale data.

Using a distributed cache or remote cache like memcached is an increasingly common way to address this issue and we planned to implement this (along with several other scalability features) in the first quarter of 2013.

Advancing The Timeline

Unfortunately, we've been persuaded this week that we can't really wait that long to get a performance boost.  We don't have time to completely rearchitect Flex and introduce all the aspects of high availability we have planned - we need to stay on the current 1024 MB heap space deployment architecture for the time being.  But we also need to ramp up the second level cache.  In order to do this without blowing through the 1024 MB of heap space, we've decided to introduce a remote or distributed second level cache now, before the next major release of Flex.

Roger's already started some of this work by adding database analysis features to the code.  (The first rule of performance tuning is to measure performance first.)  We have to ensure non-regression (at least for N+1 select problems) and so we'll add metric based exceptions to QA environments.  In English, these means that every time Flex interacts with the server, we'll count the number of times the database is hit for each interaction and if it exceeds a certain number - like 25 for starters - the server will throw an exception.  This will help us identify where to focus our tuning efforts and quickly catch it in QA if a change reintroduces a N+1 select issue.

We'll add the same validation to our test automation system and ensure that the number of database queries per transaction is below a certain threshold - otherwise the test will fail.

Ensuring Flexibility

The current plan is to use Amazon's Elasticache service as our second level cache.  Today we'll test this and evaluate whether or not this will be a good fit for us.  We also need to ensure that self-hosted or dedicated instances - where the 1024 MB of memory issue isn't a factor - can bypass a remote cache and just use local memory.  And we want to toggle between a remote cache or a in-memory cache using a simple JNDI property injection.

Final Thoughts

The toughest thing for the engineering team here at Flex is choosing how to spend our time.  We have a rapidly growing customer base with a diverse range of opinions on what our development priorities should be.  We have a massive backlog of new features folks would like to see implemented and when we work on arcane system or architectural issues, it always feels like we're stealing time from something more tangible that a customer has specifically asked for.  We estimate that only 10% of our customers are using Flex in a way that makes the performance issues a factor in their day to day operations.  So, by focusing on performance now one can make the argument that we're giving 90% of our customers short shrift by delaying work on Fast Track issues or new features.

But, sometimes you just have to stop and pay the piper.  It think that's where we're at.  I think all our customers will benefit from the performance boost - although some may prefer a slower system to one without multi-session event planning, for example.

I for one think it's smart to stop occasionally and take a hard look at performance, giving yourself enough time to take an uncompromising view of the issue.  I'm hesitant to mention the performance targets we've set for the next few weeks in fear that we might not be able to reach them, but they are aggressive.  The goal is not just to make Flex a little faster, but orders of magnitude faster.  The good news is we know what's slow and the techniques for speeding those things up are well known.  They just take memory - either locally or in a remote cache.



Wednesday, September 26, 2012

Time Zones and Checkstyle

After the release of 4.5 and the first maintenance release (4.5.4), we started on the long delayed and much needed project of upgrading our internal build and management tools.  We upgraded the version of Ubuntu we use for the development server and a whole host of tools we depend on, including Jira, Nexus, Bamboo, Confluence and Subversion.  (The move to GIT is scheduled for next year.)

Elastic Build Agents

We also changed the way we do our continuous integration builds.  Previously, the builds would run on the same machine as the continuous integration server (Bamboo).  The downside of this approach is that the builds would drain resources from the build server, slowing it down (which made things slow for us - not customers, which are on different servers.)  We've also wanted to start running performance statistics on regression tests in order to catch sudden deviations from previous performance patterns.  This is hard to do accurately in anything but a cleanroom environment - so having a dedicated build server made the most sense.

On the other hand, we felt that running a dedicated build server might be wasteful because, on average, we're only running builds for 3-4 hours per day, which means that the rest of the day the server would sit idle.

Luckily, the new version of Bamboo has been designed to work with EC2 and by defining a server image for running builds, Bamboo will automatically spin up a new build server when builds are in the queue and drop the instance when the build queue is empty.  After some trial and error to find the right build server image, we got this up and running and now all builds are running on dynamically instantiated or elastic build servers.

The Final Four

Once we got the kinks worked out of the new build environment, all modules were building and all tests were passing - except four availability tests.  These tests worked fine on the old servers and were suddenly failing in the new environment.  We felt it was some kind of environment issue and weren't overly concerned about it, but we don't screw around with availability math and so we weren't about to proceed with development until we knew for sure, until all tests were green.

After examining the specific tests, I first suspected that Hibernate wasn't flushing sessions and therefore the conflict query might have been getting stale data.  I tried to deal with this by creating a new AOP aspect that would intercept the test fixture setup and assertion methods for the functional test suite - and force each one to use its own Hibernate session.  This mirrors how the code being tested runs in real life - and we probably should have done it a long time ago.

The new session management aspect worked fine - except that it didn't fix the four failing tests.

Time Zones

That brings us to about 3:30 AM on Tuesday morning.  Just as my head hits the pillow, I start to wonder if maybe running the build servers on Universal Coordinated Time (UTC) might have something to do with it.

The next morning (five hours later), I get up and check all the test code for spots where the Java Calendar object is used without setting a time zone.  I also go back to our localization code and revert a change I had made to fix a failing test during the build server changeover.  Here it becomes obvious that running the build servers on UTC is the issue - or rather unmasked the issue.

Working most of yesterday trying to find code where un-localized dates and times might have been worked with, it becomes obvious that we're dealing with a large issue and lots of spots in the code that will need to be touched.

In essence, we need to make sure that every instance in the code of...

Calendar cal = Calendar.getInstance();

...becomes...

Calendar cal = Calendar.getInstance(timezone, locale);

Not only that, but we need to make sure this mistake can never get made again.

Checkstyle

It's similar to what we did with number and date formats way back when we purged a great many of the America-centric assumptions about date and number parsing from the code.  To prevent us from reverting to bad habits, we added a custom module to checkstyle that caused a build failure if java's built-in date and number parsing classes were used directly instead of our newly created localization service.  This worked well and has kept peace for some time.

So, this morning I added a new checkstyle module that applies the same philosophy to Java Calendars.  It scans the code for every spot where calendars are instantiated.  If the developer uses the no argument insantiator instead of the one that takes a timezone and locale, the build will fail.  I also added checks for TimeZone.getDefault() because UTC is never a valid timezone in Flex user space.  And I wrapped it up by making sure nobody ever calls setTimeZone() on an instantiated calendar  - because that could undo all the good things done by instantiating the calendars with time zones.

I did this by knocking together a quick little module that uses Checkstyle's parser architecture.  The relevant bit is shown here:

    protected void visitMethod(DetailAST token) {
             
       
        DetailAST methodCall = token.getFirstChild();
        DetailAST invocationTarget = methodCall.getFirstChild();
        DetailAST methodName = invocationTarget.getNextSibling();
       
       
        if (invocationTarget.getText().equals("Calendar")) {
            if (methodName.getText().equals("getInstance")) {
                DetailAST argList = methodCall.getNextSibling();
                DetailAST arg = argList.getFirstChild();
                if (argList.getChildCount() != 3) {
                    log(token.getLineNo(), "Unsafe Calendar instantiation.  Use Calender.getInstance(TimeZone, Locale) instead.");
                }
            }           
        }
        else if (invocationTarget.getText().equals("TimeZone")) {
            if (methodName.getText().equals("getDefault")) {
                log(token.getLineNo(), "Default Time Zone Instantiation.  Use LocalizationService.getSystemTimeZone() instead.");
            }
           
        }
        else if (invocationTarget.getText().length() >= 3 && invocationTarget.getText().toLowerCase().endsWith("cal")) {
            if (methodName.getText().equals("setTimeZone")) {
                log(token.getLineNo(), "Time zone override in calendar.  Initialize calendar using desired time zone instead of setting later.");
            }
        }
       
       
    }
It's not fancy, but it does the job.

With this new module in place, the build server will flag all the spots in our code that need to be changed and we'll simply iterate over the code until all modules pass the checkstyle test.

The Bug

Switching the build server environment from Mountain Time to UTC revealed a pretty serious availability bug - one that we think impacts calendars (as opposed to quotes and pull sheets) because the time zone glitch may have impaired the system's ability to determine when days start and end - potentially turning your days into British days since all cloud servers also run on UTC.

It's never fun to find a glitch in availability or calendar math, but this is exactly why we do regression testing - to quickly discover any negative ripple effects from code or environment changes.  The hope at this point is that once all the unlocalized references to calendars are cleaned up, that the final four tests will pass.

Going forward the build environment should prevent further regressions and the new checkstyle module should prevent the return of bad localization habits.

Monday, September 17, 2012

The Wonderful World of Auto-Scans

All in all, the release of 4.5 last week went pretty smooth.  As much QA as we do -- which is not much for software in general, but quite a bit for the rental software industry -- we always worry that our testing will miss a serious show stopping issue and we'll have to slam in an emergency fix for something.

There were some issues, but nothing quite that serious.  Mostly dinky edge case null pointers that we were able to fix quickly.  The most serious problem that arose after the release of 4.5 pertained to what we call autoscans, or the automatic scanning of a container's contents when the container is scanned.

Inside Contents

At a basic level, autoscans are pretty simple.  For example, when you scan an amp rack, it stands to reason that the rack's component amps and any accessories or power cables usually stored inside the rack also get scanned.  Flex has supported that model of autoscans for a long time.

Where things fell short were situations where the contents of a container change during the prep process.  The most obvious example is free pick containers, though anything can be tweaked or changed during the prep process.

We added the ability to take this into account when autoscanning.  This would mean that if you had a free pick container out on a show, that when the show came back, scanning the container would effectively scan all of its contents back in.

The only problem here is that contents configured for an inventory item and contents specified as child line items are not really the same thing.  We wrote code to analyze the data and try to guess whether the user wants the configured contents or child line items to be scanned, or both.

Shortly after the release of Flex 4.5 we discovered that this code often guesses wrong and effectively scanned the same item twice - once for the contents and another time for the child line items.  When the parent item is a serialized item, you end up with a separate scan operation for each serialized unit, which could exacerbate the problem.

No More Guesswork

We looked at this issue of autoscan processing and decided that it just wasn't possible to make educated guesses about configured contents versus child line items that could fit every situation.

So, we chucked the old yes/no autoscan flag some of you may have seen in the equipment list element configuration screens and replaced it with a set of three Autoscan Modes: All Contents, Permanent Contents, and Child Line Items.

This means that you can configure how you want the autoscan process to work for each warehouse mode.  By default any situation where autoscans were enabled will be set to All Contents, which mirrors the system's behavior prior to 4.5.  It is recommended that anyone who wants line item autoscans change the autoscan mode for manifest returns to Child Line Items.  If you use a two stage check out process (with a prep and ship scan), you might also considering setting the autoscan mode for ship scans to Child Line Items as well.

One Last Guess

The only place in the system where some measure of autoscan guess work is still in play relates to when gear is flowed from one show to another.  When you flow gear, the system will default to Child Line Items mode if you have any autoscan mode configured.  If you have an autoscan mode of None, there will be no autoscans, even if you are flowing gear from show to show.

This fix, along with a number of fixes to clear some of the annoying error messages some of you may have been seeing, is in QA now and will deploy as part of Flex 4.5.3.  You won't have to wait months for this one.  This version will regression test today and should be deployed tomorrow or Wednesday night depending on how regression testing goes.

Tuesday, September 11, 2012

With No Further Ado, Here's Flex 4.5

Almost four months in the making, Flex 4.5 has just been cleared by QA for general release and we've scheduled a full push of the new version just after 12:00 AM Pacific Time on Wednesday, September 12th. 

That being said, the release is done and ready.  Some beta testers got the release last night and anyone who'd like to get it a few hours early should feel free to contact us at support@flexrentalsolutions.com.

The Big Picture

Flex 4.5 includes a large number of new features, enhancements and bug fixes as detailed in the release notes here.  This release began as a simple release designed to added tiered pricing rules in order to support the complex labor calculations common in our industry.  This aspect of the release was completed fairly early on in the process.  What derailed the schedule, the usual frequency of our releases, was a redesign of the calendar system.

We had a customer fast track issue asking if certain columns could be added to the daybook screen.  We could have done it as a one off hack for this customer and the specific column they wanted.  Instead, we decided to redesign the system such that any field could be added to the Daybook.  In general, we felt the technical architecture of the calendar was outdated and inflexible, so we pulled it apart and put it back together the right way.  In addition, the old filter tree Shoptick users may remember had been ripped out when the Ajax calendar was moved to Flash.  We redesigned it and brought it back.  Chris is preparing a video tutorial on the new calendar system with more detail.  I'll update this post when it's ready.  It's also been discussed at length in the support forums here.

At a high level the new calendar system isn't wildly different from the old system - it merely adds more personalization and flexibility.  We take an old Flex concept of Calendar Templates and better integrate it with default calendars.  You can now configure any number of different calendar templates, determine whether or not the list (daybook), traditional calendar, or Gantt views are considered relevant for a particular template.  You can use the filter tree to determine what types of elements and statuses are shown on a calendar - and you can also determine what fields are shown in the list or daybook view.  You can even change the name of the daybook view to something else if you have different lingo.

Other Stuff

A significant amount of time was spent in the last month or so fine tuning the scan and availability process, particularly for non-serialized items and subrentals.  Any fixes related to pricing math and availability usually take a little longer because we build regression tests for these kinds of fixes (to make sure what gets fixed stays fixed).

You can also now cross scan items from one list to another without manually scanning each item in and back out.  This should cut scan time in half in certain warehouses.  This is the first stage in a series of improvements planned to support alternate warehouse workflows.  We'll soon be introducing something to the scanning process called Concurrency Mode.  Right now there's only one concurrency mode: Real Time.  In the coming months we'll be introducing two others for fast paced warehouse environments.

One of my favorite enhancements in this release involves reworking the way availability is calculated for suggestions.  Some customers with lots of suggestions were reporting long wait times as availability was calculated.  We redesigned this to be a lazy loading process on a per suggestion type basis.  We also added the availability meatball to the suggestion dialog so the way availability is displayed remains consistent.

Those of you who frequently interact with administrative screens may notice that many of the admin console menu options have been moved into the workbench.  This is part of an ongoing process to phase out Struts, as Roger Diller - the developer heading up this effort - noted in his blog post dated August 31.  Struts is an older MVC framework introduced with Shoptick E.  As we get ready to move toward a REST/JSON oriented architecture for our mobile back end, we felt it was wise to start clearing out the cobwebs so we don't need to run two MVC frameworks side by side.  For the curious, we're probably going to run the REST back end we'll use to support iPhone/iPad devices on Spring-MVC.

A little Easter Egg that's come out of this refactoring relates to how the Performance Monitor was moved from HTML to Flash.  We introduced the first glimpse of our new dashboard architecture.  To access it, put the workbench in debug mode by adding ?debug=true to the URL and goto Flex > Performance Monitor.  This was a nice bit of initiative on Roger's part and I think it turned out great.  Can't wait to see how it looks when we bring back the dashboard.  (Remember Flex/Shoptick had the industry's first dashboard way back in 2007.)

Coming Up Next

The calendar redesign put us in the weeds and a large number of Fast Track projects have stacked up, so the next few releases will be dedicated to custom development for Fast Track customers, many of which are small tweaks and enhancements.  We'll also be monitoring customer logs for error reports and I would anticipate a fairly frequent number of maintenance releases over the coming weeks.  It's been several months without a release, which I want to emphasize is not really how we prefer to do things.  We prefer small, frequent, incremental releases and until the next big redesign project, we're planning on getting back to that way of doing business.

If the upcoming part of the development roadmap has any overall themes, they would include adding scheduling and planning tools to the labor/crew list section, adding scan concurrency modes for fast paced warehouse environments, and multi-session event planning.  We're also planning a big upgrade of our internal development tools and a redesign of the security architecture we use to administer and support customer systems.

I'm also sneaking in support for lighting paperwork.  Since I worked myself into a desk job a decade ago supporting people who do what I used to do, I've missed being out in the field and I volunteer with a local community theatre as a lighting designer a few times a year to stay current.  I really don't want to buy Lightwright just for two shows a year, so with no offense intended to John McKernon, I'll be introducing configuration options that will enable lighting paperwork to be incorporated into pull sheets - with custom reports for Instrument Schedules, Channel Hookups, etc.  Our customers may find this useful, but it's really just for me at this point.  One of the perks of being the software developer is that you can occasionally slip something in for yourself.

In Conclusion

I'm proud of what we've accomplished in this release.  Roger, Suman and Courtney have worked incredibly hard to get it ready and tested.  The calendar redesign did blow our schedule and make it tough to set expectations about when the release would be ready.  For those who wondered what was taking so long - it was the calendar and subrental testing.  In the future we'll introduce code branching for major rework so that critical bug fixes aren't delayed by other unrelated work and we're also introducing some process changes to ensure that we get back to the tight, frequent release schedules our customers had come to expect prior to 4.5.

Thanks to our growing customer base for their support thus far and their patience.  If any problems arise tomorrow morning, please let us know.  If you'd prefer to evaluate the release prior to deployment, please contact support and we'll take you off the automatic push list and arrange a private beta site.

Friday, August 31, 2012

Phasing Out Struts

While Jeff & Courtney have been focused on getting the upcoming 4.5 release ready for production, Suman & I have been making a push to get all the existing Struts stuff (i.e. Admin Console) moved over to Flash. This large task has been pushed off for awhile because there has always been more important things to work on and this stuff has always been accessible from the Admin Console. However, we are trying to get ourselves positioned for mobile development and Step 1 towards that goal is too remove old frameworks in preparation for a new web framework. Also doing this migration stuff is not risky to work on when trying to get a release out since it doesn't affect the crucial areas of the system.

As part of this migration work, we have introduced an exciting new Dashboard component from Flexicious. I expect we will have a bigger use for this component in the future, but for now it is being used for the Performance Monitor which is viewable in debug mode at Flex > Performance Monitor.

As of this writing we have everything migrated in terms of createable, update-able, & deletable stuff. This includes all system settings, workflow, etc. There are some other Struts things remaining such as report generation, the status page, etc but for all intents & purposes the admin console is no longer needed. We view this as a major step forward for the system since a big part of moving forward is getting rid of the old.



Wednesday, August 15, 2012

The Meatball

We've been working a lot lately on something we refer to internally as "the meatball".  Some of you may have heard us use this expression in presentations or on support calls.  Others may not have heard this lingo before.  The meatball, as we define it, refers to the little square you see in quotes or pull sheets (and now the suggestion dialog) that turns red or green depending on whether or not you have a shortage.

Here's a little screen shot showing our meatball in action as a reminder.


Now, it might be logical at this point to ask why we'd call this a meatball - seeing as there's nothing round or meaty about it.  As fate would have it, I was flirting with general aviation at the time I developed the first version of the meatball.  I'd taken a few flying lessons (I've since given it up.) and leaned about something called the Precision Approach Path Indicator or (PAPI).

If you've every flown and noticed those four little red lights off to the right of the runway, that's the PAPI.  It's a visual glide slope indicator and ets the pilot know if he's coming in too low or too high.  Now it turns out the Navy has a souped up version of the PAPI on every aircraft carrier with extra information about horizontal position and some visual ways of communicating with the pilot under radio silence.  They call this the Optical Landing System or the "meatball" for short.  We just borrowed this expression from the Navy and call our red/green indicator the meatball as well.

Qualitative Availability

The purpose of the meatball is go beyond providing numeric information about equipment shortages and also provide some clues about what the numbers mean.  The work we've done lately on the meatball involved revising in what context the meatball should be red or green.  For example, when you're on a quote or a pull sheet, the meatball is green if the availability is zero.  This is because the item you've dropped on the quote or pull sheet has already been taken out of availability.  Zero indicates that you have just enough to do the job.  Though you may have used all available inventory, you have not created a shortage just yet, so availability is green.

This created a problem when we decided to add the meatball to the suggestion dialog.  Suggestions appear in a different context.  Suggestions are a list of items that you may or may not elect to add to to a quote.  You haven't done it yet.  In this context, selecting an item with zero availability will create a shortage where none existed before, so showing zeros as green in the suggestion dialog would be misleading.  You'd select an item, thinking you could add it because of the green meatball, and as soon as you added it, the item would appear on the quote with a red meatball and a quantity of -1.

To deal with this semantic problem, we modified the meatball to switching between using less than zero or less than one to determine when a shortage exists.  It all depends on context.

Another issue with the meatball surrounded packages and kits as suggestions.  When you add kits and packages to quotes, their contents go on the quote with them and you can see at a glance when a shortage of one of the components creates an availability problem for the kit as a whole.  You can swap out contents with a suitable replacement or remove it entirely.  In the suggestion dialog, you don't have that same visibility into the availability status of each package component.

When calculating availability for the suggestion dialog, the server inspects the contents and uses the item with the lowest availability (in conjunction with the quantities specified for the kit) and returns the kit's availability accordingly.  In essence, when you see the availability for a package, you're seeing the availability for the scarcest resource in the package.  And this is all well and good for when the availability is zero or greater - it indicates how many of those packages you can make with available inventory.  But what if the availability is less than zero?

Let's assume you have a package with a DMX cable component and that you have -5 DMX cables  available.  Let's further assume that every other component of that package is available and plentiful.  Barring any special tweaks to the meatball, the availability of the package would show up as -5 with a red meatball.  The red meatball is good.  We want it to be red, but in the context of package availability, what does the -5 really mean?  Since packages are virtual - they aren't finite items, what does it mean to have less than zero of something that doesn't really exist in the first place?  Our thought is that interpreting what this -5 number means would create more confusion and that if there isn't enough inventory available to make a kit, just turning the meatball red to indicate a shortage is enough.  So, with this modification in place, a positive availability will be shown with a green indicator along with the actual number - since showing you how many kits you can make makes sense semantically.  But, there's not much sense in showing you how many kits you can't make, so in this case we just turn the meatball green with no quantity - since that quantity would confuse more than it would illuminate.

So, there you have it, a little inside baseball on our weird in-house lingo and some insight into how we try to make the meatball not just display availability numbers, but explain them as well, analyze them in a way that saves you the hassles and potential errors that may come from trying to interpret the numbers yourself.

Sunday, July 1, 2012

Storm Cloud

This weekend a major storm hit the Washington, DC area and knocked out power for over a million people.  It also knocked out power to Amazon's Ashburn, Virginia data center, or the us-east-1d availability zone.  Since Flex runs on the Amazon cloud, I thought it might be a good idea to explain how the Amazon outage impacted Flex.

Luckily, Flex fared much better in the outage than better known companies like Netflix and Instagram - both of which sustained major service interruptions.  Only one of our customers experienced any downtime during the outage, and this was for less than ten minutes.  We have no support tickets on file for the outage, which is likely due to the fact that the outage occurred on Friday night, after normal working hours.  This customer was on a dedicated server and is one of only two customers whose systems are located on the east coast.

All other customers are hosted in Amazon's Dublin, Ireland and Oregon data centers: Dublin for all European and African Customers and Oregon for everyone else.  As a result, none of those customers were impacted by the the weather event in Virginia.



So it looks like we got off easy this time.  When you see future reports about cloud outages, there's unlikely to be any impact on Flex or your systems unless the outage is in Oregon or Dublin.  We'll also be moving Pacific Rim customers to Amazon's new data center in Sydney, Australia once construction of that facility is complete.

Common Sense

A lot of media reports about the outage are speculating about whether or not this outage and other recent outages like it might spell the end of the cloud, as if exposing some fatal flaw in the technology.  As with anything in technology, there is no such thing as a magic bullet.  Without fault tolerant software architecture that properly leverages the cloud technology, there's no practical difference between cloud hosting and traditional server co-location.  The cloud is just a room full of servers, just like any other data center, and subject to the same risks.

Having a truly fault-tolerant architecture means building in some measure of geographic redundancy, such that a total failure in one physical location will not bring the whole system down.  At Flex, we're working on a new high-availability architecture intended to make Flex more fault-tolerant, even with respect to catastrophic events like the one that happened this weekend.  For the time being, we are still subject to single-point-of-failure issues, especially if a whole Amazon availability zone goes down. We could use load balancers to split load between multiple availability zones now and make our system resistant to single availability zone outages, but not without adding significant cost.

We do have tools and procedures that can facilitate rapid recovery, but there would still be downtime for some customers if we lost a whole availability zone in Oregon, for example.  Over the coming months, you may hear some talk about Flex Version 5.0, or Flex Alto as we've started calling it.  This project is aimed at changing the way Flex uses computing resources like CPU and memory in order to make Flex faster, more reliable, and fault tolerance without increasing our prices to offset extra hardware costs.

So, in short, this particular outage didn't really impact us, but the next one could and the challenge for us is to make our technology resilient enough to withstand a major outage without raising prices.  In the meantime, if anyone wants to move to a fault tolerant architecture before Flex Alto is released in 2013, contact Chris for pricing on a dedicated load balanced cluster.


Monday, June 18, 2012

Time Slicing Organized Labor

Flex has supported multiple pricing models for some time, and the ability to make one pricing model a multiple of another, but as we started taking a second look at Flex's labor functionality we realized we'd need to introduce more flexible ways of calculating pricing without requiring customers to use the overly-technical embedded rules engine.

We needed something with the flexibility and power of a full blown rules engine, but simple enough that non-technical users could reasonably understand and maintain the pricing logic.  We came up with something we call Tiered Pricing - where the pricing logic can be broken into any number of independent tiers.  Each tier has rules that determine when it will be in force for a given calculation and what values will be used in the pricing calculation.

The screen shot below shows a sample pricing tier:


On the left side are the rules that govern when this tier is matched or selected and on the right the rules for how the tier is calculated.

This example shows a tier that, as part of a set of labor pricing tiers, calculates holiday overtime.

The matching criteria determine what counts as holiday overtime and the formula criteria determine what the pricing engine will do once it determines the matching criteria have been met.  If you look at the Multiplier on the right hand side, the formula criteria tells us that holiday overtime will be charged at double time and the Time Quantity Offset tells the system not to count the first eight hours.

(What happens with the first 8 hours?  That's a different tier: Holiday Straight Time.)

Less Abstract, Please

It might be easier to understand pricing tiers if we consider a few examples of how they might be used.  Since the impetus behind designing pricing tiers was handling the complex labor pricing rules we often see in the entertainment and production industry, let's start with those.

For example, it's common for local labor unions like IATSE to have rules about minimum hours, penalties for early or late call times and the higher rates for working on holidays or weekends.

Suppose a local stagehand's union has a four hour minimum call time and charges time and a half for working before 6 AM.  Then assume you need stagehands for a load in scheduled from 5AM to 8 AM.  This is only three hours, so it should trigger the minimum requirement and the call time is before 6 AM, so that first hour should be charged at time and a half.  For this example to work, you'd need two tiers: one matching the after hours times and one matching normal working hours.  On the calculation side, both tiers would probably need a four hour minimum.

Using our tiered system, when you add this call to a quote, you'd see the following:


In this example, you can see that the first hour is billed time and a half and the next two hours are billed at straight time.  The system also detected the four hour minimum and added an extra hour to the last tier.

Time Slicing

As it turned out, getting this simple example to behave as expected was quite a challenge.  We designed our pricing tier system around matching the call time information to a pricing tier.  Time based matching in the default mode is based on the start time of the call.  Without some extra elbow grease, the previous example would have matched the after hours tier since the call starts in that tier's matching criteria - and not the straight time tier.  This would have changed our separate After Hours and Normal tiers to just one instance of the After Hours tier - with all four hours placed in that tier and billed at that rate.  Some union rules are based on when the call starts and this would have been the appropriate way to handle it.  Most situations, however, would require the system to transition to a different set of pricing tiers once the call shifts into normal working hours.  To implement this functionality, we created an alternate mode of matching pricing tiers called time slicing.

The diagram above shows an example of a very long call typical (and perhaps a little optimistic) for stagehands working all day for a tour load-in through load-out, from when the riggers show up early in the morning to when the show is back on the trucks.  To properly price the labor for a long call like this, we're looking at four different pricing tiers: After hours for the early call, straight time, normal overtime, and after hours overtime.

To make this work, our time slicing code has to figure out what times of day trigger changes in the pricing rules, divide the call up into slices corresponding with those times, and run the matching logic separately for each slice.  It has to somehow keep track of the cumulative time while doing all this so that overtime rules get triggered correctly, even if the hours that contribute to triggering overtime come from different time slices.

The algorithm we ended up implementing looks like this:

        if (model.getTimeSlicing()) {
            boolean lastTimeSlice = false;
            float unprocessedTime = params.getTimeQuantity();
            float processedTime = 0f;
            PricingCalculationParameters batchParams = (PricingCalculationParameters)params.clone();
            int reliefValve = 0;
            Map<String, PricingTierCalculationResult> resultMap = new LinkedHashMap<String, PricingTierCalculationResult>();
            while ( (unprocessedTime > 0) && (++reliefValve < 1000) ) {
               
                //first matching attempt is really just to help determine the time slice
                //once the slice duration is established, we rematch
               
                Collection<PricingTierCalculationResult> calcResults = matchTiers(tiers, null, batchParams);
                DateTime startTime = new DateTime(batchParams.getStartDate());
                DateTime endTime = new DateTime(batchParams.getEndDate());
                DateTime nextEndTime = null;
                DateTime candEndTime = null;
                for (PricingTierCalculationResult cand : calcResults) {
                    if (!cand.isMatched()) {
                        continue;
                    }
                    if (cand.getTier().getEndEffectiveTime() != null) {
                        LocalTime time = new LocalTime(cand.getTier().getEndEffectiveTime());
                        candEndTime = new DateTime(startTime.getYear(), startTime.getMonthOfYear(), startTime.getDayOfMonth(), time.getHourOfDay(), time.getMinuteOfHour());
                        if (candEndTime.isAfter(startTime)) {
                            cand.setAdjustedTimeQuantity(dateDiffInHours(startTime, candEndTime));
                            if ( (nextEndTime == null) || candEndTime.isBefore(nextEndTime)) {
                                nextEndTime = candEndTime;
                            }
                        }
                    }
                }
               
                if (nextEndTime == null) {
                    //set the next end time to midnight - this allows us to check for holidays, etc
                    nextEndTime = new DateTime(startTime.getYear(), startTime.getMonthOfYear(), startTime.getDayOfMonth(), 0, 0);
                    nextEndTime = nextEndTime.plusDays(1);
                }
               
                //this means we're on the last time slice and we're done
               
                if (nextEndTime.isAfter(endTime)) {
                    nextEndTime = endTime;
                    lastTimeSlice = true;
                }
               
                double sliceDuration = dateDiffInHours(startTime, nextEndTime);
               
                processedTime += sliceDuration;
                unprocessedTime -= sliceDuration;
                batchParams.setTimeQuantity(new Float(processedTime));
                calcResults = matchTiers(tiers, null, batchParams);
                           
                if (unprocessedTime <= 0) {
                    lastTimeSlice = true;
                }
               
                //adjust time quantities for slice duration
                PricingTierCalculationResult existingResult = null;
                for (PricingTierCalculationResult cand : calcResults) {
                    System.out.println(cand.getTier().getMatchingExpression());
                    cand.setAdjustedTimeQuantity(sliceDuration);
                    cand.setTimeQuantity(sliceDuration);
                    cand.setIgnoreTimeMinimums(true);
                    existingResult = resultMap.get(cand.getTier().getObjectIdentifier());
                    if (existingResult != null) {
                        cand.setTimeQuantity(cand.getTimeQuantity() + existingResult.getTimeQuantity());
                        cand.setAdjustedTimeQuantity(cand.getTimeQuantity());
                    }
                                       
                    double previousProcessedTime = processedTime - sliceDuration;
                    if ( (cand.getTier().getMaxTimeQty() != null) && (cand.getTier().getMaxTimeQty() > 0) && (cand.getTier().getMaxTimeQty() < previousProcessedTime) ) {
                        continue;
                    }
                    if ( (cand.getTier().getTimeQuantityOffset() < 0) && (Math.abs(cand.getTier().getTimeQuantityOffset()) <  previousProcessedTime)) {
                        cand.setIgnoreTimeOffset(true);
                    }
                    resultMap.put(cand.getTier().getObjectIdentifier(), cand);
                    //minimums are only enforced on the last matching time slice
                    if (lastTimeSlice && cand.getTier().getMinTimeQty() != null) {
                        if (  processedTime < cand.getTier().getMinTimeQty()) {
                            cand.setTimeQuantity(cand.getTier().getMinTimeQty().doubleValue() - processedTime + sliceDuration);
                            cand.setAdjustedTimeQuantity(cand.getTier().getMinTimeQty().doubleValue() - processedTime + sliceDuration);
                        }
                    }
                   
                }
               
               
                if (lastTimeSlice) {
                    break;
                }
               
               
               
                batchParams = (PricingCalculationParameters)batchParams.clone();
                batchParams.setStartDate(nextEndTime.toDate());
                batchParams.setTimeQuantity(unprocessedTime);
            }
           
           
            results = resultMap.values();
           
        }
        else {
            matchTiers(tiers, results, params);
        }
In a nutshell, this code starts by determining the total length of the call, initializes a variable called unprocessedTime with that value and then enters a while loop that keeps executing until the unprocessedTime value is 0.

In each loop iteration the matcher is executed twice, first to determine the length of the current time slice and a second time with the calculated time slice duration as an input value.

The length of the time slice is determined by selecting the earliest of any effective end times that may be configured for matched tiers.  A hard break is also placed at midnight so the time slicing logic can detect any transitions between holidays or weekends.  The next end time value is considered the end of the time slice and is compared with the start time to determine its total length, then this value is added to the currently processed time and fed into the matching algorithm.  This weeds out situations where overtime tiers may have matched for the total call duration, but not for a shorter time slice.

Once the results come back from the second matcher invocation, some logic is used to bypass standard minimum calculations (you don't want each time slice triggering minimums; the call has to be considered a whole).  You can see a branch where minimums are enforced if the loop is on the last pricing tier.  We also consolidate tiers that are matched in multiple time slices to prevent confusing clutter in the finished quote.

In Summary

So there you have it, a sneak preview of our new labor pricing system and a look at one of the more interesting bits of code required to make everything work.  We still have some polishing to do, but the new labor system is essentially finished and drops to QA this week.  The next major release of Flex 4 will include this new functionality.  After that, we'll veer off into high speed asynchronous scanning modes and then it's back to Phase 2 of labor where the emphasis will shift from financials to scheduling and conflict resolution.