Wednesday, February 27, 2013

Growing Pains and Performance Measurement

There's no point in hiding it - we're in the throes of growing pains and trying to simultaneously handle ongoing feature development and modify Flex to work efficiently under scale.

Right now we're putting the finishing touches on the next release of Flex, which takes into account some lessons learned upgrading Hibernate and performance tuning the system for large customers.  Some of our ideas worked, some didn't, and some only work if we can throw large amounts of memory at the problem - which we can't.  We decided to stop the performance experiments and get all the new features released, then resume work on experimental performance improvements once we initiate code branching and establish an ongoing performance measurement regime.

Facing Growth

Flex is growing like wildfire, and while that's a good problem to have, it's still a problem.  From a technical standpoint, the key issue is that Flex was never designed to make efficient use of a cloud environment.   Our current architecture involves running 11 customers on each cloud server and allocating each customer instance around 2 Gigabytes of memory between heap space and permanent generation space.  Building out an auto-scaling fault-tolerant multi-tenant architecture when we first started would have been wildly optimistic in terms of customer growth and divert resources from developing the new features our customers actually care about.  But now it's essential.

Most customers don't make use of all this memory and it goes idle while others strain to fit their operations into the 2 GB footprint.   We've always known this was an opportunity - if we could re-architect Flex to optimize memory allocation based on real usage, we'd be able to make sure the higher volume customers got the memory they needed without wasting it on the customers that don't.

One way would have been to maintain the seperate-instance-per-customer model, but add some analysis tools that tweak the memory allocation based on usage.  But with customer growth knocking on the door of 300, with over 30 production servers in use between our Virginia, Oregon, Ireland, and Sydney data centers, we're at a scale where even that optimization would only be a band aid - and it wouldn't address issues like fault-tolerance or give us the ability to isolate resource intensive tasks like reporting generation on their own servers (where their memory use won't impact bar code scan response times, for example.)

We've established a series of near term and intermediate term objectives for the coming year, with the two most critical being performance improvements for high volume systems and shorter maintenance release cycles.  Our longer term goal of moving Flex to a fault-tolerant, multi-tenant architecture has to be considered now because technology choices made to solve short term problems must take it into account.

Faster Releases

We've suffered from some issues related to code branching lately, or rather, a lack thereof.  We've always thought we were small enough to work out of a trunk or master branch instead of forking the code in parallel branches.  We were small enough to avoid branching at one time, but no longer.

The problem is that simple bug fixes and feature enhancements get delayed by more elaborate or experimental work  that takes longer to shake out.  Our first order of business once this release gets out and stable is to move the entire codebase from Subversion to Git and once moved, all major work on the code will be carried out in a "major-version" branch with the master branch reserved for maintenance work.

The net effect of this change will be more frequent releases with minor tweaks and bug fixes out the door sooner.  We're about to start working on a large number of experimental performance improvements and architecture changes - exactly the kind of thing that tends to delay a release.  We'll move our code to parallel minor/major branches before work starts on any of that.

Performance Measurement

Our number one priority right now is improving performance.  We've made some major gains on that front as part of this release, but it's still not where we want it to be.  We want performance improvements measured in orders of magnitude.  4-6 times faster is good; 100-1000 times faster is better.  To achieve this, we'll have to do experimental things like bypass Hibernate, perhaps even consider non-relational databases like BigTable, Cassandra or MongoDB.

We also learned over that last few months that making something fast in one area can make it slow in another.  We need a good set of performance benchmarks so we know what works and what ripple effects it has.

To achieve this, we revived an old idea about Regression Testing Performance and started an open source project on github called Perforate.  (You can view or download the source code here:  Perforate integrates with the TestNG framework we use for unit and integration testing, tracking the historical running time for each test and failing it if the test exceeds the historical mean by a configurable property (the default is by three standard deviations.)

Roger Diller took this a step further and cooked up an internal user interface we use to analyze performance data.  We've posted a few screen grabs below:

This is a great tool for the weeks ahead, because we can make a code or architecture change we think will have a good impact on performance and just look at the graphs for sharp drops to know that it worked - or sharp spikes to know our plan backfired.

Our goal over the coming weeks - once we've branched the code - is to make as many of those curves have sharp drop offs as possible.

The Options

We've learned the hard way that Martin Fowler is right when he calls Object Relational Mapping the Vietnam of Computer Science.  Hibernate can make mundane things easier, but often with an overhead cost that's unacceptable.  We think Hibernate is the major barrier right now to blinding fast performance.  We've tinkered with it enough to make 4.6 faster than 4.5, but we think we've gone about as far as we can go with it.

Hibernate isn't going away, but we're going to start phasing it out in certain areas by going with straight JDBC for line items and scan records.  We're also looking at storing an adjacency matrix for the upstream/downstream line item hierarchy so it can all be pulled back in one query instead of recursive queries.  We think this will address the majority of our problems, but we're also looking at ways to optimize resources and project elements.

The scan log and line items are simple - they don't come with the same complicated object graph structure that project elements and resources do.  For example, Inventory Items, a resource, have quantity records, contents, OOC records and serial numbers.  Going straight JDBC for objects with complicated graphs might be a boost over Hibernate, but for the seismic boost we want, we might have to look at other options, like NoSQL databases.  As of now, MongoDB looks like a good candidate for optimizing project elements and resources.

We also need to improve the speed of search execution and indexing.  We also think our existing search is a little kludgy and could benefit from a major overhaul.  We're currently looking at moving our entire search process to Apache Lucene, the same search technology that powers Twitter and a number of other marquee sites.  For self-hosted customers, we expect to use an embedded Lucene engine.  For the cloud, we'll offload search indexing and execution to a central server using Apache Solr.

Of course, all of these options are experimental.  They might work; they might not.  There's really no way to know for sure until we try them.  But to do that, we need to isolate experimental code in its own branch as to not interfere with routine maintenance and Fast Track work.  We also needed a reliable way of measuring performance to accurately gauge our successes and failures.

No comments:

Post a Comment