Friday, November 30, 2012

A New Flight Plan

Flex is slow, much slower than it should be or could be.  We've known this for a while and we've also known the solution is ramping up the use of a second level cache.  The reason we've held off on implementing a solution is that we planned to move to a distributed cache when we moved Flex to the new high availability architecture.  We let the problem fester for a bit because most of our customers aren't big enough or high volume enough to really notice the problem - and we knew we would be fixing it as part of the high availability redesign.

But for the customers that are impacted by it, the impact is severe, very severe and it got to a point where pride of authorship just wouldn't let the problem go untreated anymore.  So, today, we decided to throw out the flight plan and dedicate a solid two weeks of development time to nothing but performance tuning.

The Crux

Flex uses an object/relational mapping tool called Hibernate.  Hibernate is never fast unless you're doing something very simple or you use the second level cache, which is a cache that stores data in between requests.  Hibernate is widely used and odds are you interact with a system that uses it everyday, even if you never login to Flex.  That would not be true if Hibernate was slow as an absolute.  It's possible to make Hibernate very fast, but the trade off is memory.

We do use the Hibernate Second Level cache, just not as extensively as we could.  The reason for this is that each cloud instance of Flex is currently allocated 1024 MB of heap space.  Without a distributed or remote cache, any data that we add to the second level cache would have to come out of this heap space or we'd have to reallocate the amount of heap space we use per cloud instance, which would increase our hosting costs and the monthly subscription fee along with it.  This is the classic trade off between fast and cheap in action.

The high availability architecture we have planned for the first quarter of next year would change the fundamental deployment architecture of Flex such that cloud instances share a large pool of memory instead of each install having a small 1024 MB sandbox.  Unfortunately, under high availability that memory can't really be used for write sensitive caching because of concurrency.  Our high availability architecture is also intended to provide redundancy such that a single server failure would not take a customer system down.  The new Flex will run a minimum of two servers behind a load balancer with many more servers in use during peak load times.  If we used a conventional in memory second level cache, if you updated a piece of information on server A, a message would have to be sent to server B to ensure that the cached information is invalidated or updated - otherwise you could end up with stale data.

Using a distributed cache or remote cache like memcached is an increasingly common way to address this issue and we planned to implement this (along with several other scalability features) in the first quarter of 2013.

Advancing The Timeline

Unfortunately, we've been persuaded this week that we can't really wait that long to get a performance boost.  We don't have time to completely rearchitect Flex and introduce all the aspects of high availability we have planned - we need to stay on the current 1024 MB heap space deployment architecture for the time being.  But we also need to ramp up the second level cache.  In order to do this without blowing through the 1024 MB of heap space, we've decided to introduce a remote or distributed second level cache now, before the next major release of Flex.

Roger's already started some of this work by adding database analysis features to the code.  (The first rule of performance tuning is to measure performance first.)  We have to ensure non-regression (at least for N+1 select problems) and so we'll add metric based exceptions to QA environments.  In English, these means that every time Flex interacts with the server, we'll count the number of times the database is hit for each interaction and if it exceeds a certain number - like 25 for starters - the server will throw an exception.  This will help us identify where to focus our tuning efforts and quickly catch it in QA if a change reintroduces a N+1 select issue.

We'll add the same validation to our test automation system and ensure that the number of database queries per transaction is below a certain threshold - otherwise the test will fail.

Ensuring Flexibility

The current plan is to use Amazon's Elasticache service as our second level cache.  Today we'll test this and evaluate whether or not this will be a good fit for us.  We also need to ensure that self-hosted or dedicated instances - where the 1024 MB of memory issue isn't a factor - can bypass a remote cache and just use local memory.  And we want to toggle between a remote cache or a in-memory cache using a simple JNDI property injection.

Final Thoughts

The toughest thing for the engineering team here at Flex is choosing how to spend our time.  We have a rapidly growing customer base with a diverse range of opinions on what our development priorities should be.  We have a massive backlog of new features folks would like to see implemented and when we work on arcane system or architectural issues, it always feels like we're stealing time from something more tangible that a customer has specifically asked for.  We estimate that only 10% of our customers are using Flex in a way that makes the performance issues a factor in their day to day operations.  So, by focusing on performance now one can make the argument that we're giving 90% of our customers short shrift by delaying work on Fast Track issues or new features.

But, sometimes you just have to stop and pay the piper.  It think that's where we're at.  I think all our customers will benefit from the performance boost - although some may prefer a slower system to one without multi-session event planning, for example.

I for one think it's smart to stop occasionally and take a hard look at performance, giving yourself enough time to take an uncompromising view of the issue.  I'm hesitant to mention the performance targets we've set for the next few weeks in fear that we might not be able to reach them, but they are aggressive.  The goal is not just to make Flex a little faster, but orders of magnitude faster.  The good news is we know what's slow and the techniques for speeding those things up are well known.  They just take memory - either locally or in a remote cache.



No comments:

Post a Comment