Monday, November 7, 2016

Making Flex Faster

Speed is something that is one of the top engineering priorities at Flex even as we transition more and more effort towards the Flex 5 HTML5 rewrite effort. We recognize and understand that as awesome as it will be to one day be free from the old Flex 4 Flash-based platform, our users still need to get their work done between now and then.

Speed is something that we have been making small incremental improvements on most releases throughout this year, but it is something that is difficult to really move the needle on. We have also worked on memory leak issues, but those tend to be quite a bit easier. Normally with memory leaks you just take a memory snapshot of a customer who is struggling with memory leaks, and with a little detective work with a tool like YourKit, you can usually find what is hogging the memory and somewhat easily make a fix. Speed is completely different ballgame, as it's hard to get good visibility and the fixes are much more difficult and time consuming.

For us, our speed issues revolve almost entirely around database IO (input/output). Most of the time it's one of the following type of problems.

  • N+1 SQL queries. Queries in a loop, where you get one database trip per iteration in a loop.
  • Large SQL queries. Just sheer size, like many MB's of SQL output.
  • Bad SQL queries. E.g. lack of indexes, bad joins, etc

Focused Speed Work

In light of all this, back in late September we spent at least a week of focused developer time to try to get some speed relief on some targeted high use areas of Flex. We started by having the support team compile a list of top slow areas from customers. Development then took that detailed information and began the process of trying to understand what was really happening in those slow areas.

So even while we have known about these problems for some time, the fixes are not so straightforward. The reasons for this is a bit hard to explain, but the short answer is we have a huge domain model (think all the many tables & fields in the database) and use an ORM (Object Relation Mapper) tool called Hibernate inside our Java application that maps database tables & columns to Java objects.

Hibernate is a "great" tool when you start because it allows a developer to rapidly add new tables & fields and the SQL for them will automatically be generated when you ask the database for something (such as a Quote or Inventory Item). However this really accumulates over time and gets completely out of hand. You might just be after one or two fields from a table for your business logic, but Hibernate will fetch everything because it has no idea what we are really after. 

In the end, you just end up with tons of overhead with the database getting absolutely hammered with sheer amounts of SQL (e.g, I've seen 30MB or greater of SQL generated for a single line item edit action) which in the end is mostly of no use and just gets garbage collected inside the Java application.

So again, we've known for awhile what the general problems are. The problem is getting the right kind of visibility and even knowing what to change to help it.

The Breakthrough

The breakthrough back in September was bringing in a tool called P6Spy. It's an open source tool you can plugin without the application even knowing about it. Basically it intercepts all of the raw SQL that is being sent to the database. It has many configuration options, but one of the coolest settings you can enable is the application stack trace. With that enabled, in addition to being able to see the raw SQL output, you can see the exact line of code inside of the application that generated the SQL!

With this in place, it was like we suddenly had eyes into what was going on. We rapidly discovered some obvious issues, like some unexpected N+1 select issues that were going on. Usually these fixes involved some caching tweaks so that a database hit wasn't needed or moving find by id fetches into some kind of one time batch fetch.

Collection Batching

We fixed those obvious ones and then moved onto other improvements. Specifically, we started doing "collection batching". Let me explain... in the application you can have a Java domain object with a collection hanging off of it (e.g. an Inventory Item has a collection of Serial Numbers) and with Hibernate those are always lazily loaded by default. That means you could pull that item from the database, but the serial numbers won't load from the database until you actually call getSerialNumbers() on the item object.

This is fine sometimes, but what if you were looping over 1000 inventory items? Yeah, you'd be hitting the database every time you call getSerialNumbers() on an inventory item. That is what we call an N+1 select issue, and they are an absolute performance hog.

The nice thing is we discovered a little known Hibernate setting known as "collection batching". What this means, is say you have the serial number collection as above, but you set the collection batching size on the collection to say 100, when you call getSerialNumbers() it will fetch up to 100 other serial number collections (that are already in the Hibernate session) in a single call. This means for the 1000 inventory items, you might only get 10 hits to get all the serial numbers. See for more info on this batch size setting. 

That is a factor of 100 reduction in database trips, instead of 1000 individual hits it might be as low as 10 trips for all the serial number collections. This was a huge breakthrough and we implemented this strategy in key document editing areas.

We rolled out out these multiple speed fixes in version 4.18.2 in mid-October. We have have heard directly that it is faster, specifically with document editing, which is exactly what we were after.

Where do we go from here?

Monitoring is our next big step. We have set up a tool stack with InfluxDB, Grafana, and Telegraf that will collect metrics sent from Flex. We have version 4.19.0 queued up for deployment this week. With that release, Flex will begin shipping metrics to this new tool stack. We will be able to setup dashboards that will get us all kinds of different types of visibility into the application. We will use this info to make more targeted fixes. 

So we intend to rinse & repeat with targeted fixes, until we get Flex 4 to reasonable speed levels in the "day to day" high use areas. With the Flex 5 rewrite, we are contemplating a whole new way of database access which will be fast and be the ultimate solution to Flex 4's speed woes.

Hopefully we'll have a blog post here in the future on the new monitoring/metric tool stack and the results we get from that! Stay tuned!

No comments:

Post a Comment