Wednesday, December 26, 2012

Everything You Always Wanted To Know About Hibernate Performance Tuning But Were Afraid To Ask

One of the things I've learned over years developing software is that there's no such thing as a black box.  If you pick a framework or library off the shelf, just hope it works forever and that you'll never have to learn much about it, you're due for a day of reckoning.  The more complicated the problem this library solves, the sooner this day of reckoning will come.

In my opinion, there are fewer technologies more prone to black box syndrome than Hibernate.  If you browse forum posts and discussion threads on Hibernate, most of the posts are from people who just refuse to look inside the black box.  It's possible to get up and running on Hibernate without learning much - that is true.  But to use it in the real world, at web scale (whatever that means) requires a much deeper understanding of it - how it fetches data, how it caches data, etc.  Otherwise, your software will be dirt slow.

We've tinkered with Hibernate performance over the years, but always shied away from blocking off a few weeks to bear down and really dial it in.  Over the last few weeks, we took one of our high volume customer systems as a test case and worked through six key use cases and through a fairly agonizing and time consuming process, were able to achieve 20X-30X performance boosts.  We were hoping for 100X, and there are still some things we can do to (unrelated to Hibernate) that can possibly get us to the next order of magnitude, but we'll have to defer those for another day.

On XML And Religion

One area unrelated to performance tuning, but nonetheless a big philosophical debate within the community of Hibernate users, is whether to define the object relational mappings via JPA annotations or an external XML file.  The cool kids these days seem to prefer Annotations. Come to think of it, the cool kids seem to hate XML for just about anything these days.  And of course, anyone, especially someone who works in technology, who has absolute love or hatred for any technology, isn't practicing science or engineering - they're practicing religion.  This is a well known tendency of developers, which is why you'll often hear technology subjects referred to as "religious issues". 

The alternative to being a religious developer is being someone who evaluates technologies in a less satisfying, but more practical way.  XML's a good example.  You hear a lot of developers these days slam XML outright, preferring JSON or plain text files.  My take is that XML is bad for data, but good for configuration.  XML is self documenting which makes it easy to configure servers or object/relational mappings via XML, but it's not the sort of thing I'd want to send over the wire when speed or bandwidth are factors.  In those situations, JSON or a bitpacked format like AMF is more appropriate.  Right tool for the job.


I think this religious hatred of XML, even in situations like configuration where size isn't a factor, has driven a lot of young developers to go the annotations route with Hibernate.  One of the problems with religious zeal is it tends to blind people to all other considerations.  For example, there are all kinds of reasons why it's bad to co-mingle a system's domain model with the persistence mechanism, but all these reasons get smothered by XML hatred.

For example, suppose you believe, as I do, that a domain object or entity shouldn't know about the table used to persist it.  Why do I believe this?  It doesn't involve a bearded prophet.  It's because you might change your mind one day about the persistence mechanism or a customer might force a change on you.  All the persistence code needs to be contained in a single layer that can be swapped out if need be.  The annotation camp will usually say "nobody will ever really change the database" or "JPA is the persistence abstraction".  Maybe.  But you might change the ORM.  You might upgrade it (as we just did.)  You might decide to switch from HQL queries to criteria objects.  It's nice to know that all the code you'd need to change might be in a set of related DAO classes instead of scattered all over the application.  An even more practical example is that caching and indexes might be slightly different for different deployments of the application.  Being able to swap in a different set of Hibernate mapping files for a self-hosted deployment versus a cloud hosted deployment is of real value to us now, not in the abstract.  You can mix annotations with XML files, but I tend to think this creates confusion.

So we prefer XML configuration over Hibernate/JPA annotations at Flex because it decouples the persistence details from the domain.  I will concede, however, that if I were working on an internal application at a corporate IT shop where only one instance of the application will ever be run, the prospect of ever needing a plug-and-play persistence layer is pretty remote.  In that case, putting all the mapping configuration in annotations might make the code easier to understand.  I actually used this approach at the company I worked at prior to Flex.  Like most technology choices, XML vs Annotations is a choice that has to be informed by rational considerations specific to the environment and the project.  I hate XML is not a rational consideration (though you're still free to hate it.)

How Hibernate Works

With the philosophical debate out of the way, we can move on to the technical issues of making Hibernate fast.  We'll begin with a brief description of what Hibernate is and how it works - which should come as a great relief to those readers who've muddled through all the jargon to get this far in the post.

Most Java software (and most modern software) is based on an Object Oriented design model.  This means we represent "nouns" or data as objects, or more formally "classes".  For example, in Flex we have classes like ScanRecord, InventoryItem, SerialNumber and so on.  There are actually hundreds of different classes in Flex.  We do this because it's much easier and logical to work with objects than it is to work directly with database queries and recordsets.  Contrary to what a lot of Database Administrators might think, we developers do this because it's mundane and repetitive to work directly with databases, not because we're afraid of SQL.

Hibernate's job is to translate these domain objects/classes into SQL for us, and in so doing remove much of the drudgery of database interaction so we can focus on higher order thinking skills like business logic and user interfaces.  To make this work, we configure Hibernate (via Annotations or XML files) to know which tables go with which classes.  Then we can use Hibernate provided classes to save objects, delete them and query them.  Hibernate generates the SQL and takes care of the details.

To make this fast Hibernate provides several different caching mechanisms.  There's the query cache, which enables Hibernate to intelligently decide when a query doesn't need to be rerun.  There's a session cache, which caches objects for a brief time in the Hibernate session (typically a session has the lifespan of a single HTTP request), and a second level cache, which can cache objects between sessions.

If you examine a commonly used method on the Hibernate session like byId(), which retrieves an object instance by it's identifier or primary key, you'll see that Hibernate checks for the an instance of the object in the session cache and the second level cache before resorting to running a database query.

Let's assume that we all have a working knowledge of Hiberate now and dive into some optimization tips we uncovered over the last few weeks.

The Session Cache

You get the session cache for free.  There's nothing to configure or turn on and it's just a HashTable with a real reference to persistent object (as opposed to the second level cache - which caches a serialized representation of the object.)

In essence, if a session attempts to retrieve an object twice, it will only result in one database hit.

The session cache is faster than the second level cache because it doesn't have to deserialize or "hydrate" objects.

The Second Level Cache

The second level cache is used to cache objects with a lifespan longer than the session.  Any meaningful performance tuning will usually involve extensive use of the second level cache.  In order to use the SLC, you have to configure an external cache provider like EHCache or SwarmCache (we use EHCache) that plugs into Hibernate and handles all the details of sizing, eviction, disk overflow, etc.

You also have to tell Hibernate which classes to cache.  In our XML based approach, that requires adding a cache tag to each class we want to cache, like this:

    <class name="ShippingMethod" table="st_biz_shipping_method">
        <cache usage="nonstrict-read-write"/>
        <id column="id" length="36" name="objectIdentifier" type="string">
            <generator class="alto-uuid"/>
        </id>
        <property column="method_name" length="128" name="name"/>
        <property column="method_code" length="16" name="code"/>
        <property column="method_type" length="16" name="type"/>
        <property column="min_days" name="minimumDays"/>
        <property column="max_days" name="maximumDays"/>
        <many-to-one column="cost_rule_set_id" name="costRuleSet"/>
        <many-to-one column="waybill_template_id" name="waybillPrintTemplate"/>
        <set name="disabledPricingModels" table="st_biz_rc_disabled_pricing_models">
            <cache usage="nonstrict-read-write"/>
            <key column="inventory_item_id"/>
            <many-to-many class="com.shoptick.bizops.domain.PricingModel" column="pricing_model_id"/>
        </set>
                  
    </class>

This example will ensure that ShippingMethod's get cached in the second level cache - and just ShippingMethods.  A common misconception about the second level cache relates to what happens to the object graph under an object that's been cached.

Let's clear up the confusion.  Caching a class only caches instances of that class.  It will not cache any associated classes.  In this example, costRuleSet and the waybillPrintTemplate values will not be cached.  The ID of the related object will be cached, but not the object itself.

This is actually a really good design from the standpoint of concurrency.  We don't have to worry about dozens of objects in the cache that all refer to the same related object, and that each cached instance might have a stale version of the related object.

Under the hood, when Hibernate chooses to place an object in the second level cache, it takes the class (and just the class for which caching is enabled) and serializes it to a simple string based format that represents simple data types: strings, numbers, dates, etc.  If one of the properties refers to another object, that object's identifier is stored in the cache - but not the object.

When an object is loaded from the cache, Hibernate instantiates an instance of the class and sets all it's properties using the serialized values stored in the cache.  This is called hydration.  If one of those values is an identifier for another object, that object will be loaded using the normal three step process: session cache, second level cache, database.  This happens recursively until the whole object graph is loaded (assuming the object graph is eager fetched.)

List, map, set and other collection style associations are not cached by default.  In the previous example, you'll see that we have a cache tag as part of the set declaration.  In this case, a separate cache is used just to store collections.  But note that collection caches are just mappings of ids to ids.  The key for cache will be the parent object's identifier and the values will simply be a list or set of id's.  All that's cached is the association, not the referenced objects at either end of the association.

Caching Modes

There are three cache modes supported by Hibernate: read-only, read-write and nonstrict-read-write.  These modes relate to how caches are locked and used to enforce concurrency. 

Read only is the fastest, since there is no locking overhead.  The downside is that objects cached as read-only will not get refreshed if an instance of the object is changed.  Read-write is the slowest because every read operation takes out a lock to prohibit a write operation.  It also uses locks to prevent concurrent write operations.  The nonstrict version of read-write assumes there won't be concurrent write operations and as a result, has a much lower overhead in terms of lock synchronization.  We use this version extensively for configuration data and read-write caches for data subject to frequent changes like line items.  We don't use read-only caches at all.

Lazy Fetching and Preloading

Most systems that start to make extensive use of the second level cache will inevitably need a preloader that initializes the cache with frequently referenced objects.  In this case it makes good sense to switch a lot of the objects and the portions of their associated object graphs likely to be frequently needed to eager fetch such that the entire object graph gets preloaded and not just the top level object.

We ended setting a large number of properties to eager load that we ordinarily wouldn't have because we noticed a little quirk where lazy loaded associations were missing the cache.  For example, if getPricingModel() were set to lazy load and the associated pricing model was in the cache, the getter would go back out to the database anyway.  We think this could be the fault of the Hibernate Transaction Manager provided by Spring, but we aren't sure.  In short, our advice is to eager fetch if the associated object class is also cached.

The Join Fetch Problem

Let's consider a many-to-one association on a class like the one configured here:

  <many-to-one column="customer_id" name="customer"/>

If you want to load the parent object and the customer object, you can either run one query or two.  You can run a single query that joins the parent class table and the customer table or you can run two separate queries: one against the parent class table and a second query against the customer table.  Hibernate lets you choose which way to do this using a fetch strategy.  The snippet below shows the two options in context.

    <many-to-one column="customer_id" name="customer" lazy="false" fetch="join"/>
   <many-to-one column="customer_id" name="customer" lazy="false" fetch="select"/>

In a normal use case, where neither the parent object or the child object are cached, one query is generally better than two, so the default fetch mode is join.

But once you move to a configuration where the child object is likely to be cached, a fetch strategy of join can cause problems.  Think about it.  The purpose of the cache is to avoid reading information from the database unless absolutely necessary, especially information that doesn't change that often, like product descriptions or customer contact information.  Assuming the parent object is not cached and we're definitely going to need a query to fetch the parent object, we have no way of knowing what the id of the associated customer object is without first running a query.  So Hibernate runs that query with a join that brings back all the child object's fields along with the parent's.  In short, if you use a fetch strategy of join, you negate the benefit of using the cache because Hibernate will end up hitting the cached object's table anyway.

The solution is counter-intuitive: use a fetch strategy of select.  For the optimizing mind this seems scary because a select strategy means two database queries instead of one.  If neither object were in the second level cache, that would be true.  But if the associated object is highly likely to be in the cache, only one query gets run because the first query brings back the id (because the id is on the parent object's table).  Once that ID is in hand, Hibernate can check the caches for it before resorting to running a query.

If you want to guarantee a second level cache hit for many-to-one associations, make the properties eager fetch with a fetch strategy of select.

The Query Cache

One of the more common mistakes when configuring Hibernate is to enable the query cache and do nothing else.  This doesn't work.  You have to tell Hibernate in the code (or though saved queries) which queries or criteria objects can be cached. 

A lot of developers are gunshy about caching queries because of concurrency fears.  This is valid when something other than Hibernate writes to a table being queried, but if all the database I/O that can update the database is controlled by Hibernate, it's okay to be very aggressive with query caching.

Hibernate is smart enough to figure out when a cached query should be evicted.  It does this by checking the query cache every time a object is updated or deleted and evicting all queries that reference one or more of the updated tables.

If some other process writes to the tables, query caching can be dangerous (so could caching in general).  Otherwise, it's pretty useful, but you must manually tell Hibernate which queries to cache by calling setCachable() on the query or criteria object.


Use byId() Instead of load() Or Queries

One of the really dumb things we found in the code is that we were using queries to retrieve objects by identifier.  We did this to enable additional HQL to be added to queries for soft deletes, etc.  It's a dumb idea and ensures that in simple situations where someone calls findById() on a service or DAO, that an SQL query gets executed, even if the object is cached.  One could hope that Hibernate would be smart enough to know that the only field in the where clause is the object's identifier and check the cache before running a query, but it doesn't work that way.

The solution is to use the byId() method on session instead of running a query.  This will ensure that Hibernate checks the cache first.  There's also a method on session called load() that seems semantically equivalent to byId().  The difference is that load() will never return null.  It will return a clean instance of the object if no object matching the id exists.  This is usually not what you want.  Stick to byId().

Use Natural Keys

Under the hood, we're a big believer in surrogate keys, meaning primary keys that are random, immutable and have no meaning at all in business terms.  We use 36 character UUID's as primary keys instead of sequential integers.  If you need to generate sequential numbers, you need locks and when you cluster the database you need expensive locks enforced with network I/O.

Every domain object in Flex and therefore every database table has a UUID as a primary key.  But sometimes you need to lookup an object by an alternate identifier, or a natural key.  In computer science lingo, a natural key would be a unique identifier that has some kind of meaning to the user.  Examples would be social security numbers, bar codes, user id's, or job numbers.

The most common natural key lookups in flex are user id's for logins and barcodes for inventory.  In the previous version of Flex, barcode lookups were done using an HQL query, albeit a simple one.  This has the same disadvantages of using a query to retrieve an object by it's surrogate key: even if the object you're looking for is cached, Hibernate runs a query anyway.  In the case of natural keys, it doesn't know which ID is associated with a given barcode, so it has to go to the database, and by default, when Hibernate goes to the database, it gets the whole object, even if it's cached.

To work around this issue, you can configure one property of an object as a natural key.  Here's the real configuration we use to support barcode lookups:

    <class name="ManagedResource" table="st_biz_managed_resource" lazy="false">
        <cache usage="read-write" include="all"/>
        <id column="id" length="36" name="objectIdentifier" type="string">
            <generator class="alto-uuid"/>
        </id>
        <natural-id mutable="true">
            <property column="bar_code_id" length="64" name="barCodeId" not-null="false"/>
        </natural-id>
This example flags the barCodeId property as a natural key and enables you perform a natural key lookup and take full advantage of the cache - as shown in this example:

return session.bySimpleNaturalId(InventoryItem.class).load(barcode);

This bypasses the normal query approach and saves a query - in theory.  In reality, it results in a simpler query and in time no queries.  When you invoke bySimpleNaturalId(), the first thing Hibernate does is try to determine which primary key matches the given natural id.  There is a natural id cache and this is checked first.  If there's no id in the natural key cache, Hibernate will run a very simple query like this:

         select id from inventory_item where bar_code = ?

This just gets the primary key and if the object isn't cache, the system will run a second query to retrieve the object.  Otherwise, the object will be retrieved from cache and the second time the same natural key is looked up, the natural-key to primary-key mapping will be in cache and there won't be any database queries at all.

Hibernate's natural key feature is really just a special technique for improving cache efficiency.  If you're not using the cache or have an object that you're not caching, natural keys can actually slow things down because you'll always get two queries with a natural key lookup: one to get the natural key to primary key mapping and another for the main object lookup.

Key Generation

Another good way to get a speed boost, especially if you define speed as reducing the number of database hits, is to use an in-process mechanism for generating new primary keys.  This is virtually impossible to do with sequential integers or other database driven key generation mechanisms.

Prior to 4.6, we used MySQL's built in GUID generator to generate primary keys, even though we weren't using sequential integers.  This meant that every insert operation required an extra select query to get a new primary key.

As part of this release we developed our own pure Java UUID generator and configured Hibernate to use it.  We released the UUID generator as part of the open source multi-tenancy project we've launched here: http://code.google.com/p/flex-alto/


In Conclusion

This long post hopefully chronicles some of the lessons we've learned in the process of getting Hibernate up to high speed.  Like so many things in performance tuning software, the solutions are rarely clever or exotic.  You either do things before you need to (preloading), wait until the last minute (lazy-fetching) or do them only once (non-preloaded caching).

The next version of Flex will be 4.6.0 and will include these performance improvements.  We're in the process of doing QA rework and should be in regression testing by the end of the week.  We're planning a short beta to test the memory footprint of the bigger cache in production and will do a wider release after that hurdle is cleared.

No comments:

Post a Comment