Friday, December 20, 2013

A Flex Spin Off Is In Development

Years ago I spent a few days in a backroom in Van Nuys, California knocking together the abomination that is our barcode label printer integration.  My wife, then a teacher, was off for the summer and we were staying at the Oakwood wedged between Universal and Warner Brothers in some nebulous quarter of LA that was either Hollywood, Glendale or Burbank. Nobody's quite sure. 

This particular Oakwood would turn out to be an incubator for child stars and had provided accommodation for the likes of Kirsten Dunst and Doogie Howser, MD.  I learned this somewhat after the fact - after asking the staff why there were so many kids running around with expensive haircuts and bad attitudes.

My wife decided to maximize her LA experience and signed up to be an extra.  She did an all night shoot for Californication in high heels.  Needless to say, neither of us were in a place to make good decisions that summer.  It was in this atmosphere that our raw templating approach to bar code printer integration was born.  It works, but it's about as brittle and user unfriendly as you can get.

Holding Our Nose

We've let this problem fester for a while - but lately there's been a lot of interest in modernizing barcode printer integration.  We've had character encoding issues that can't be solved without moving to ZPL2 in addition to the general hassle of working with label templates in the raw printer language.

We recently hired a developer (Erik), who in a previous life, worked for Seagull Scientific on their label workflow product, Commander.  He ran point on an effort to find barcode label design/automation software we could easily integrate Flex with.

The report was not good.  Everything out there seems to be from another time.  Much like our own A/V rental software industry, it's ten to twenty years behind the times.  It's all CD-ROM's in boxes in an era when fewer and fewer computers even ship with optical drives.  It's a scenario we know all too well. 

We saw little point in investing our scarce resources into integrating with software platforms that are surely destined for extinction.

In The Meantime

So it occurred to us that it's recently become possible to do a lot of powerful things in browsers without asking the user to download any software or install a plugin.  There's SVG, D3 and HTML 5's Canvas to name a few options.  Someone could - and inevitably would - build a barcode label design tool entirely in Javascript.  What if we built it first?  What if we gave it away for free as an open source library?

We'd just need a little server application to go with it that could store the label templates, talk to the printers and expose an API for triggering print operations.  It occurred to us pretty quickly that the utility of such a system would extend well beyond equipment rental software - so adding this as yet another feature to Flex made little sense.  We decided to spin it off as a separate project, brainstormed for names, checked our list against available domain names and Label Ninja was born.


A Green Field

Work on Label Ninja will start early next year and the focus will be replacing our existing raw templating system as soon as possible. We won't shoot for the kitchen sink - the focus will be clarity and simplicity.

For us, it will be a way to gain experience developing HTML 5 applications without putting our core product at risk.  For our customers, it will make changing barcode labels and hooking up printers ten times easier.  It will also give customers a taste of what to expect when our effort to replace Flash begins in earnest - an effort that won't be a shot-for-shot remake of Flash based Flex, but a full reevaluation of the user interface, one that incorporates attention to aesthetics as well as functionality along with a new emphasis on simplicity.

All Flex customers will be first in line when Label Ninja rolls off the assembly line next year.  There's no need to sign up for the mailing list because you're already on it.  But you might check out Label Ninja's coming soon page (labelninja.net) anyway, as it gives a hint of where we're headed.




Tuesday, November 12, 2013

A Self Hoster's Guide to 4.6.16

This week we start deploying version 4.6.16 of Flex.  If you haven't looked at the release notes, check them out here.

While the release notes may not signal a big release, 4.6.16 introduces two new technologies into our stack: RabbitMQ and Riak.  Another notable feature of this release is that it requires 13.04 of Ubuntu.  More on that later.

RabbitMQ

RabbitMQ replaces our old ActiveMQ JMS server for asynchronous messaging.  This was a major undertaking because it meant getting rid of JMS as a standard for messaging in favor of AMQP, a protocol developed by the financial industry for high speed trading, etc. 

We use asynchronous messaging throughout Flex.  We use it for Quickbooks Integration, pushing data to Production Exchange, eager availability recalculations, search indexing, and many other applications.  Switching to RabbitMQ makes this messaging faster and more reliable - and is a necessary step for our high availability architecture where asynchronous message processing will be handled by dedicated servers.

Riak

Roger Diller did most of the work on our new Riak search implementation and he did a great write up of Riak earlier today.  I'd only add that Riak is perhaps the best of the so called NoSQL databases that have become popular in recent years as a way around the scalability problems of relational databases. 

The new version of Flex uses Riak instead of Lucene to process project element searches and store the indexes.  Over time our use of Riak will likely increase.  We're kicking around the idea of using Riak for document storage and will likely leverage it for inventory and contact search before long.

Ubuntu 13.04

As of this release, we will require all servers to run Ubuntu 13.04.  For customers that run cloud based instances, there's nothing to worry about.  We're already running 13.04 in our Sydney, Chicago, Montreal and Roubaix clusters.

All self-hosted customers will have to upgrade to 13.04 before 4.6.16 can be deployed.  If you run Flex on a self hosted system, we'll contact you shortly to schedule a maintenance window to perform the Ubuntu upgrade and install Flex 4.6.16.  Part of this process will include upgrading MySQL, installing Riak and RabbitMQ - in addition to basic upgrade process.  We'll also archive your Flex data and configuration prior to any OS upgrade just in case there's a problem that may require reinstalling the operating system from scratch.

Home Grown Fault Tolerance

Something we'd like system administrators to start thinking about is how they might want to deploy a high availability system on site once our software supports it.  To get started, you'd need 2 or more servers  - and some method of distributing load across the different servers and redirecting traffic when a server fails. 

The most common way of doing this is with a hardware load balancer.  Flex won't require any particular kind of load balancer, but you'll get better performance if the load balancer supports sticky sessions or server affinity that can be tied to configurable http headers or cookies (and most do).  Flex doesn't and won't use volatile user sessions (which means a server crash won't interrupt user logins), but it will use a hybrid local/remote caching system - and sticky sessions will increase the efficiency of this cache - and overall performance by reducing cache misses.

You could also use a software load balancer if you have an old server lying around.  We recommend HA Proxy if you go that route.

A poor man's load balancing would be to simply use round robin DNS.  This wouldn't make server crashes as painless for users as a true load balancer, but it would help break up your user load across multiple servers.

You'll also want to make sure any servers you provision for a multi-server Flex install have two network interfaces as the second interface will be used to form a private network between the servers for Riak synchronization and database IO.

And, if you wanted to be really cool, you could deploy a four server Flex configuration - two database servers configured as master and slave - with two applications servers in front of those - with load balancers in front of those.  This configuration would closely mirror the hardware configuration we'll use in our next generation cloud.

If all this sounds daunting or like overkill, don't worry.  Even once Flex has been modified to run on a multi-server high availability architecture, it will still run just fine on a single server, just like it always has.


Introducing Riak

In the upcoming 4.6.16 release, we've made some major changes to the search system. In fact, we introduced a brand new external NoSQL database system to use for search. Everyone, meet Riak.

Riak is a high speed/low latency key/value database that is masterless, distributed, & fault tolerant. Let me break that down.

Masterless means that in a cluster of Riak nodes, there is no primary or master node. All nodes are equal. This means it doesn't make sense to use Riak as a single node. Riak should be ran with at least 3+ nodes per cluster. Anyone of these nodes can go down without hurting anything. You can fix the node, bring it back up, and it will automatically begin participating in the cluster again. Riak uses a "gossip" protocol to spread all the data around among the nodes in the cluster.

Distributed just means the data is replicated among the nodes without any node being a single point of access or failure. If one of the nodes is not present, you can just ask another node. This eliminates the need for backups since the data is replicated on every node in the cluster.

Riak is fault tolerant because there is no single point of failure. If a node fails, you can just fix it and bring it back. Or, you could create a new node & add it to the cluster. If you need more performance, you can just continue to add more nodes to add capacity to the cluster.

There are a few drawbacks as well. One is that Riak is an eventually consistent database. This means you can store a value using one node but the other nodes in the cluster won't immediately have the value. It takes a bit to spread the data around in the cluster. Riak also does not have transactions, so there are no read or write locks to guarantee consistency. This means you need to use Riak with prudence. It is not suited for everything. For example, we would not want to use it to store Quote info since we need Quotes to be in a guaranteed consistent state. It is however very well suited for search documents & status report data since this info is either temporary or can be rebuilt.

The actual Riak storage interface is very simple. First you have the concept of "buckets" to add whatever namespace you need to your data. So you could have a "person" bucket to store all the data about a people. Within a bucket, everything is just a key & a value. The key is a just a string, and the value can be anything. It could be plain text, XML, JSON, etc or it could be binary data like an image or PDF file.

We are/will be using Riak for several things. Currently we are already using it to store customer status data that comes in from our Nagios monitoring system. This creates quite a lot of data since a status report comes in from each customer every 5 minutes. Riak has no trouble handling the volume since it can write very fast.

The second use is for search. Riak has a built in full text search engine called Riak Search. Once the 4.6.16 releases goes out, the search documents will be generated and stored into Riak. We took our existing High Speed Search Interface and created a Riak implementation. The search document is a JSON value. All the JSON fields are indexed automatically by Riak and we can search on any combination of the JSON fields.

We believe that Riak will create a more robust search system that will make it easier to find your data. Our prior search implementation was based on Lucene which seemed too search engine like (fuzzy searches, ranking, etc) for our purposes. Riak Search is more straightforward and should just find the data your are looking for.

Friday, October 25, 2013

IOS Full Throttle and the Open Source Office Mates

Flex is growing like crazy.  We're trying to simultaneously add new features like ROI and Multi Session while refactoring our core system to run in a multi-tenant fault tolerant architecture.  We're also trying to minimize the amount of front end work we do in Flash since we plan to start work on a complete HTML 5 rewrite of the front end shortly.  With all this stuff going on, the can that keeps getting kicked down the road is IOS apps.

We've done a proof of concept for iPad/iPhone apps, but that's about as far as its gone.

We've always maintained that the lack of Flash compatibility on IOS is a blessing is disguise because it forces us to address the mobile user experience directly instead of just letting Flex be a web app that runs in Safari (although it will be eventually).  A mobile app should be designed expressly for the form factor of a mobile device: for thumbs, not fingers.

We've made a lot of decisions about our approach to IOS: native instead of web based, device specific HMAC authentication, bar code scanning via the camera or bluetooth scanner, etc.  But what we haven't had is the time to really work on it given all the other things going on.

Erik The Viking

That's why I'm pleased to announce that on November 1, Erik Ralston will be joining our team as a Lead Software Development Engineer - and focused (almost) exclusively on mobile applications.  He'll be joining us on site in our new engineering office in Richland, Washington and he comes to us by way of Danish CMS consultancy Addisolv.   Prior to that he worked on super secret projects for Pacific Northwest National Laboratory and did a post college stint at Seagull Scientific working on their popular BarTender label printing software. And the guy has an actual Computer Science degree, so I look forward to having long discussions with him about graph theory, finite state automata and logical algebra to make up for my lack of technical education (My major may have been technical theatre, but that doesn't mean I have a technical degree).  He's also one of the organizers of Tri-Cities Startup Weekend and active in a whole slew of technology meetup groups around town.  He's a good guy to even know and we're thrilled that he'd sully his reputation by deigning to write mobile apps for us roadies full time.

The Benefits of Elbow Abrasion

Now Erik has been contracting with us for a bit already.  He's one of a number of great developers and designers we've met since opening up The Flex Code Space, which is what we call our new engineering office.  Our new space has seven private developer offices and a cozy lounge / conference space for meetings.  We didn't need all the offices right away,  so we decided to rent private offices out to freelance developers and designers until we grow into our space.  The offices got snapped up almost immediately and has made our office a regular stop on the Tri-Cities coworking circuit.  We've hosted meetings of the infamous Doctype Society, hackathons, technical and business talks put on by Room2Think and even the odd play reading.

It's been great for Flex to have regular interaction with all these smart creative people.  We met Erik and a number of other people who've made their brains available for picking.  We've also enlisted their help in our mobile effort and the open source initiatives we've launched to lay the technical foundation for our high availability architecture.

Open Source Role Call

We've launched two open source projects in recent months and contracted with some of our office mates to help with them.

blockd

One of these projects is Blockd, an open source lock server intended to help us maintain concurrency for barcode scans and other atomic operations once we start running Flex on multiple load balanced servers.  Source and some basic documentation is on Github here.

Erik wrote the server and node client while another office mate, a Java veteran named Brian Manley, is working on the Java client we'll eventually use in Flex.

alto

The other big open source project we're sponsoring is called alto, a library intended to provide Java developers with a pile of tools for writing multi-tenant high availability Java web applications.   I've done a lot of work on alto myself and recently enlisted Brian's help writing a multi-tenant JDBC driver.

The source and documentation for alto is available on Github here.

Now we're kicking around the idea of writing a Atlassian Crowd client in Erlang and adding Crowd support to RabbitMQ.  That's a pretty obscure one, even for the open source world, but it would make our own lives a little easier as we roll RabbitMQ out across our install base.

Flex is pleased to be sponsoring all these open source projects but our motives are pragmatic.  These tools solve problems for us and keeping them open source holds out the possibility that they might evolve and grow, perhaps spreading out the maintenance burden should the projects become successful.  The chief benefit of giving away our tools is that the people who use them might improve them for us.

But success in the open source world requires it's own kind of marketing and while we'd be thrilled if the projects we're sponsoring gain widespread adoption, we aren't going to actively chase open source glory - not just yet.

The Long Neglected User Experience

The Code Space office mate with the most aesthetically coherent office is predictably a designer, and a rather good one named Doug Waltman.  In addition to his freelance work, he's been working with Flex on UI/UX designs for our IOS apps.  Here's a little preview:


We've never really had the resources before now to incorporate graphic and user experience design into our development process.  Having Doug down the hall is going to be a big help to us as Erik works through all the UI problems we'll have to solve for the small screens that run IOS.  He'll also be a big help as we tackle the HTML 5 rewrite.

Having a usability advocate and someone with a real sense of design in the loop, even if only on a freelance/contract basis, makes us very excited about the IOS apps and future versions of Flex.

Who knows, maybe our web site and this blog might even look presentable some day.

Honorable Mention

We have a few more office mates who have contributed nothing whatsoever to Flex, but since they're good company and their checks clear, I'd hate for them to feel left out.  And those maligned souls would be Nate and Joe, both masters of Javascript Ninjitsu, 3D printing and indoor agriculture.

I typically start to roll my eyes when people throw around words like "community" (because I think it's a word authoritarians use to mask their control), but I have to reluctantly admit that I might be part of one now.  It's only a matter of time until Flex and its users start reaping the benefits.

If you need me, I'll be in the bathroom throwing up rainbows.





Tuesday, July 23, 2013

Telecommuters No More

For a long time we swore we'd never open an office, but circumstances have overtaken us.  We're now the fastest growing rental software company in the market and by some estimates the largest.  It's odd for a small group of folks who think of themselves as the scrappy underdogs to find ourselves in that situation, but there we are.

When a company passes through the startup and validate-the-market phase of the business, old assumptions about roles and processes have to mature and evolve.  The way you do business when you have three hundred customers is quite a bit different than when you only had twelve.

I for one think we've missed a lot of opportunities over the last year.  We've let old problems fester for too long and allowed ourselves to get so mired down in the day to day that we've neglected some strategic planning.  This summer has been spent working through some of those issues and we're ready to make a few announcements.

A New Cloud

We've recently inked a deal with C7 data centers in Utah to move our entire North American server infrastructure from the Amazon Cloud to a much faster VMWare based cloud.  This is the first of several steps toward a next generation deployment environment which will eventually include a hybrid of dedicated and virtual servers - and a mult-tenant software architecture.

Testing of the new cloud begins next week and full migration of the production network should be completed by September 1.  Customers won't have to do anything or even notice in most cases.  Things will work just as they always have, hopefully just better and faster.

Faster Availability

This week QA approved the release of Flex 4.6.15, our first major release in several months.  This release includes a major redesign of the availability engine as a first step toward addressing performance issues.  The next step in our quest for a faster Flex will involve refactoring line items to use direct JDBC instead of Hibernate.  After that, we'll turn our attention in earnest to high availability and multi-tenancy.

Brick and Mortar Flex

We're also pleased to announce that our first physical office will open on August 1 in Richland, WA.  Called the Flex Code Space, this office will serve as a base for Flex's research and development efforts for the next three years.


Formerly the offices of a great local company and friends of Flex &yet, this 1700 square foot office will allow Flex to recruit and train a new crop of software engineers to help with our nascent mobile efforts and our plans to retire Flash.

While our location in Richland, Washington may not be silicon valley, we are spitting distance from Pacific Northwest National Laboratory, making Richland home to the largest per capita number of PhD's in the country.  All these scientists and engineers in a small, relatively isolated rural community gives Richland a culture that simultaneously values academic rigor and the simple pleasures of family life.  We're hoping to build a happy meritocracy at Flex and we think basing the engineering team in Richland gives us a good chance of doing that.

Decorate Our Office

We want our engineers and candidates to know what our customers are all about, see firsthand the cool things our clients do with our technology.  As a small part of this effort, we'd like to invite our customers to help us decorate the walls.


As you can see, the walls are pretty bare.  What we'd like to do is have our clients submit high resolution photos of their work: big outdoor rigs, arena rigs or any shot of a job you're particularly proud of - especially if it includes pictures of you or your people.  We'd also like to see Flex at work in the warehouse: maybe your Flex terminal with mountains of flight cases, truss and equipment in the background.  Maybe an action shot of you with scanner in hand, prepping the next big show.

We'll take the most visually striking images, have them printed on large canvases or framed and hung in the new office. 

So if you have an old archive of photos of your team or gigs, feel free to send over the highest resolution files you have to jeff [at] flexrentalsolutions [dot] com.  Or, if you have a nice camera on the shelf, head on out to the shop and take a few quick shots for us.






Friday, June 21, 2013

The Pitfalls Of Virtualization

This week we took a little road trip down to San Diego and helped one of our clients migrate from our cloud architecture to their own in house server.  Not many customers choose this option, preferring the convenience of the cloud, but we're always happy to support anyone who does.

I've always cautioned customers to have realistic expectations with regard to performance when migrating from the cloud to an in house server, assuming that most performance issues are in the code and not the hardware.  We decided to use this particular install as an opportunity to conduct a real test of this theory and were fairly blown away by the results.

We started by loading a large quote on the EC2/cloud based system and executing some common workflow tasks: adding line items, generating pull sheets, scanning things out and back in.  We did all this while remotely profiling the Java virtual machine. 

The Sawtooth Pattern

I wish I'd thought to grab a screen shot at the time, but while quotes were loading we saw a zig-zag sawtooth pattern on CPU utilization at regular periods.  This reflects the underlying nature of virtualization.  One physical server hosting multiple virtual servers must find a way to allocate CPU time between them and it usually does this the same way a multi threaded operating system allocates CPU time to threads - through time slicing.  Server A can have the CPU for a few hundred milliseconds, then Server B can have it, and so on.

Real Servers

Then we ran the same tests on the real server - an 8 core HP Proliant with 32GB of memory and software RAID - and with exactly the same versions of Flex.  The results were astonishing.  The client felt as though response times more than doubled and our measurements confirmed that the dedicated servers was 2-10 times faster than the Amazon cloud, depending on what operations the user was doing.  We expected an improvements of 20%-50%, not 200% to 1000%.

Now What?

This information is too compelling not to act on.  This is the summer of speed, after all, a time when we're almost exclusively focused increasing the speed of Flex.  We have some big software related speed improvements coming in Flex 4.6.15 that are now in testing and we've now discovered an opportunity to couple those improvements with the speed boost that comes from moving to dedicated hardware.

We've already started the process of moving away from virtual servers in Europe, where we run two production servers that support our European and South African customers.  We're working with a data center located in Roubaix, France and have moved several EU customers from the Amazon data center in Dublin to France.

Building a New Cloud

Unfortunately, that leaves 30+ servers in the United States to support everyone else.  Moving that many servers is a much larger project.  We've opened up talks with several Tier 4 data center operators in North America about moving our entire remaining infrastructure to dedicated hardware.  We also plan to use this opportunity to beef up security, split the application and database load into separate servers, and introduce a load balancer.  Our new proposed architecture for North America is shown in the diagram below:



The final version of our new network will likely be a bit different as we incorporate feedback from networking and security consultants, but the general idea of a front side network for app servers with a backside network for database I/O will probably survive the process.

Another Look At Self Hosting

The vast majority of our customers use our cloud hosting option.  We've always supported a traditional self-hosted site license deployment model, but have never really pushed it as most customers seem to prefer a modest monthly payment to the up front costs associated with servers and software licenses. 

But our recent experience has prompted us to reevaluate our tendency to downplay self-hosting.  We're currently putting together pricing for servers with Flex preinstalled along with a few variations of on-site help getting up and running.  I believe Chris will be reaching out to the customer base soon with all the details.

A Bird In The Hand

It's never fun having problems with the software, but I'd rather know about them and know the reasons than not know, even if the solution is sometimes complicated and time consuming.  Moving to dedicated high performance hardware seems like a no-brainer given the scale we're running at these days.  We should ink a deal with the new data center by mid July and move everyone over to the new cluster shortly thereafter.

For our European and South African customers, the necessary hardware is already up and running.  We should have those customers still running on the Amazon cloud fully moved over by Monday morning.

Thursday, May 16, 2013

Code For Us

We're Hiring

We have a lot of big projects planned for the next few years at Flex: Advanced crew scheduling, multi-session event planning, the high availability cloud architecture, mobile applications and a full HTML 5 rewrite of our front end.  And we'll need help.  A hardcore Java nerd with production industry experience would be ideal.

Here are the details:

What We're Looking Fo

Flex Rental Solutions is seeking software engineers of all experience levels to help support a rapidly growing customer base around the world.  At Flex, you'll work on technology that powers much of the world's live entertainment and corporate events.  From dynamic cloud computing architecture to mobile applications to rich internet applications, a career at Flex provides a wide array of stimulating challenges for software engineers.

You'll receive a competitive compensation package and will work from the comfort of your own home (although occasional travel may be required for training, conferences or on-site work with customers.)

Above all Flex looks for a rare combination intelligence and maturity, a willingness to adapt to changing circumstances and focus on the customer's needs.  Our customers work in a fast paced, demanding industry and an ability to empathize with customers working under stressful conditions is essential.

Flex is currently looking for candidates with some or all the following skills and qualifications:

Education: A BS in Computer Science or equivalent experience.

Core Languages:  Java, Python, Objective-C, Javascript (client side), Actionscript, some Bash scripting

Theory: Basic knowledge of statistical methods, set and graph theory.  College Coursework that includes Discrete Mathematics is a plus.

IDE's: Eclipse, XCode

Build Tools: Maven 3 and Gradle

Java Frameworks and API's: Spring, Spring-MVC, Spring Security, Hibernate, JAXB, Jakarta Commons

Platforms: Linux (Ubuntu), EC2, Jetty, Memcached

Databases: MySQL, MongoDB, Cassandra

Web Technologies: HTML, CSS, Javascript, jQuery, jQuery-ui, angular.js, socket.io

Mobile Platforms: IOS and Andriod

Interested candidates should send a resume and cover letter to jeff [at] flexrentalsolutions [dot] com.

About Flex Rental Solutions

Flex Rental Solutions is an award winning provider of financial and inventory management solutions for the live event industry.  With an international roster of customers in the corporate a/v, concert touring, television and film industries, Flex technology manages equipment and crew scheduling for nearly 300 equipment rental and production houses worldwide.  In a few short years Flex has catapulted from newcomer to industry leader - with the industry's first web based and only cloud based ERP system for event production.

Friday, May 10, 2013

Fixing The Double Conflict Filter

The main priority right now at Flex is making things fast.  There are things we did when we originally designed Flex (and Shoptick before it) that we knew were suboptimal, but at the time we felt going all the way would divert resources away from more pressing concerns and more importantly, would have been wildly optimistic about our company's future prospects.

It would have been akin to a company that makes irrigation valves overdesigning their products on the off chance they might be used in a nuclear power plant.

We never expected Flex to get so big so quickly and to have customers with as much concurrent throughput as we have now.  We'd always hoped to get there, but when you're starting a company you have to focus on the here and now, on the needs of the customers you have - and keep your goals realistic.

This is why we chose Hibernate as an ORM framework instead of building our own micro-optimized persistence layer.  Hibernate is ubiquitous, fast to develop in, and although not as performant as rolling your own, usually good enough.

Whatever our reasons were initially, the landscape has changed and the time for hand optimized persistence and caching code has come.

Faster Availability

Much of the work we've done so far on system performance has related to the scan process.  We've fine tuned much of the scan bottlenecks (the .14 and .15 releases include a lot of scan performance improvements) and now our attention turns to availability calculations.

The Flex availability calculation process is very complicated, but there are two main phases governed by separate modular components: The Conflict Engine and The Availability Engine.

The Conflict Engine's job is to retrieve and process line items from the database that might be relevant to an availability calculation.  The Availability Engine then takes the output from The Conflict Engine and applies all the ship, return, container, subrental, transfer, and expendable return logic to produce a final result.

The purpose of this design was to isolate the I/O intensive part of the calculation in one place (The Conflict Engine) and leave The Availability Engine to focus on relatively high speed in memory computation.

We've known for some time that the bottleneck in availability performance is the conflict engine and learned over time that the database query used to retrieve line items is fast.  The work Hibernate does to turn that query into line item objects however - is not.

Another main bottleneck in the The Conflict Engine is what we call The Double Conflict filter.  This filter's job is to remove related line item entries from the conflict result, else an item might get counted more than once.  Consider the following graph of a typical line item relationship in Flex:

This shows a pretty conventional process where a line item is placed on a quote, the pull sheet is generated, and as the show is scanned out, two manifest line items are created with the specific serial numbers.

But there are four line items in the system referencing the console for a total quantity of 6 conflicts - when only two consoles are actually in use.  In Flex, we address this problem by assuming that only the line items furthest downstream in the workflow are in control of availability.  This deals with the double conflict problem, but is also intended to handle the problem of the plan diverging from reality.

What would happen in this situation where the L1 made a judgement call in the warehouse and only decided to take one console or maybe decided to take a lower end console as a spare?  If we went solely off the plan, other shows would show a shortage and you might end up subrenting a piece of gear that you had sitting on the shelf the whole time.

The performance problem with this approach is that the current algorithm uses recursive I/O to crawl down the downstream object graph, necessitating a database hit for each level of the graph.  This is slow.

Bypassing Hibernate

The first step (already completed) is to bypass hibernate in the conflict engine.  We did this by replacing the fully mapped line item objects with a simple light weight DTO object that only contains the line item fields relevant to an availability calculation. (Fields like location, ship date, return date, etc.)

Instead of running a database query to pull back a list of line item ids and feeding those ids into hibernate for hydration, all the fields come back in one query and get copied directly onto the DTO object.  This was pretty straightforward.

Adjacency Matrix

The next step is to reform the double conflict filter by getting rid of the recursive IO needed to retrieve the graph. (I/O buried inside Hibernate, I might add).  To accomplish this, we're introducing a persistent adjacency matrix to represent the upstream/downstream relationships.  We also decorate this adjacency matrix with the status of the downstream line item and whether or not the status is conflict creating, which saves yet another Hibernate lookup.

Each line item has a reference to the adjacency matrix used to represent upstream/downstream line items and we can retrieve all the relationships in a single query - and store them in a small and easy-to-cache object.  The caching will further reduce database I/O.

Local Caching / Distributed Cache Version Control

We're also introducing a new cache version feature that will be needed for high availability Flex.  We learned a few months ago when we first started playing around with memcached that serializing our domain objects for the trip over the wire to memcached and back was slow.  Not as slow as not caching at all, but still less than ideal.  It would also necessitate a large cache and since we'll be using Amazon's Elasticache, this comes with a pricetag.

What we decided to do was stick with an in memory cache, but use memcached to help us know when an object cached in memory was stale (modified by another server in the cluster).  We do this by giving cacheable objects the ability to implement an interface whose single method returns an SHA1 hash that represents the version or state of the object.  It could be a message digest based on the properties of the object, or, when generating a string to base the digest on would be too expensive, it could simply be a unique (and persistent) hash that changes when the object is mutated (like a git commit).

The cache lookup code will pull the object it has in the local cache and compare it's version hash to the one in memcached for the same cache key (and do this in separate parallel threads). If they match, all is well, and the cached object is returned.  If they don't match, then the in memory object is no good and the cache lookup will return null.

A side effect of this approach is that it could lead to memory churning (lots of space related cache evictions) if our high availability architecture uses round robin load balancing.  We'll try to optimize the architecture by using hostname affinity for load balancing, session affinity as a last resort.  Unlike most "sticky session" based load balancing, this approach will just be a performance optimization and shouldn't impact reliability or failover.  We don't rely on http sessions, so losing session variables won't hurt the user at all.

Wrap It Up, Kid

This work is currently slated for release in Flex 4.6.16, although given the risk and magnitude of the change, that version number could slip to .17 or .18.

The single slowest part of Flex has always been availability calculations and this batch of enhancements so far have been very encouraging.  We haven't done a formal comparison yet with unoptimized versions of Flex, but on my machine the availability calculations appear to be virtually instantaneous.  Here's hoping the fix holds up in regression testing and the performance boost we're seeing so far holds up, too.

Wednesday, May 1, 2013

Version Numbers, Part 2, Part 1.

Last night we released the first version of Flex with a fourth decimal place in the version number: 4.6.13.2.

Believe it or not, the different decimal places in version numbers do have meaning and our new numbering scheme reflects some underlying changes in how we manage source code and version numbers at Flex.

First off, we recently changed our source control system from Subversion to Git.  As part of this conversion, we restructured our projects and modules to make it easier to branch and merge source code.

Source Control

Under our new regime, we have three main branches of code we work on simultaneously: master, dev-minor and dev-major.  Master is the Git equivalent of Subversion's trunk and for us represents the maintenance or emergency branch.  Most new work happens in other branches, leaving master relatively pristine and unchanged since the last release.  If we're in the middle of coding a big new version of Flex and a severe bug suddenly crops up, we can fix the bug in the master branch and push a release without pushing all the risky new work we're doing along with it.  This allows us to respond to emergencies and push small changes much faster.

Most of the work happens in a branch called dev-minor.  We now do the majority of regular work in this branch and only when we're ready to build a release candidate does the new work get merged over to the master branch for regression testing and release.

Work we consider to be risky or potentially time consuming is done in yet another branch called dev-major.  For example, if we have to rip out parts of the availability engine and rebuild it for performance - a task with the potential to run awry and delay the schedule - we do this in the dev-major branch where it can't delay the work going on in dev-minor.

A Version For Every Branch

The code in each branch has a different version number.  The three digit version numbers we're all used to (e.g. 4.6.14) come out of dev-minor.  Once a routine version has been released, if we find we have to push an emergency or maintenance release out of master, that version has an extra digit tacked on the end like this: 4.6.13.2.  That number tells you that this version was the second emergency release after 4.6.13.

The dev-major branch will get a version like 4.7.0 or even 5.0.0, but it's possible that work done in the dev-major branch could get merged into the dev-minor branch without the version number going with it.  In fact, this happened with version 4.6.14 of Flex.  A number of experimental performance enhancements were done in dev-major and once they were complete and stabilized these changes were merged into the dev-minor branch for the next standard release.

The Point

This may seem confusing to those unfamiliar with source code version controls systems, but the point of all this complexity is to make is easier for us to get our work out sooner.  The schedule is no longer at the mercy of the most complex task on the schedule.  We can isolate the time consuming or risky things in their own branch and get other functionality out faster.  We can also respond faster to emergencies or minor maintenance issues.


This process is already starting to work.  Version 4.6.13.2 went out last night, but version 4.6.14 is almost ready and work on version 4.6.15 started today.  We were also able to retire our old release candidate numbering system in favor of true release candidate builds. 

Monday, April 1, 2013

The Perils of Null

Over the last week or so we've been focused on bug fixes and the next round of performance improvements - with the assumption that we'll be bypassing Hibernate for the persistent objects we use most frequently.

We started by taking one of our high volume power users and profiled the system with their data.  Right away, the profile honed in on two major bottlenecks: The first bottleneck concerned how the relationship between Serialized Containers and their Non Serialized Contents are stored (using an association class called SerialNumberContent).  The other concerned traversing the object model to resolve things like status and date inheritance.

We started by focusing on the contents issue, since it was the chewing up the most CPU cycles - and was the most puzzling of the two because there's no obvious reason why this special relationship would cause performance problems.

ORM'ectomy

We knew from profiling the system that the bottleneck for the contents issue was somewhere inside Hibernate.  So we decided to use this simple association class as a test case for migrating objects out of Hibernate to a lower level JDBC ORM (we're using Spring's JdbcTemplate.)

The first step was to snip all the connections between SerialNumberContent and the rest of the system.  This meant removing one-to-many references from SerialNumber and InventoryItem and removing object references from the association class, replacing them with plain id fields.

Then we moved code for fetching relationships from the domain to the service layer, and these service methods would then invoke the underlying DAO.

Once we successfully isolated the association class - but before removing it from Hibernate, we fixed all the ripple effects and retested.  We noticed a roughly 35% speed boost just by isolating the object.  This was good, of course, but important only from an academic perspective because we needed a 1,000% or 10,000% speed boost.

What we learned was that Hibernate does not cache the result when a query or cached value is null.  If a query returns no results, that "null" result is never cached.  In our use case, this meant that the fewer relationships between serialized containers and non-serialized items there are, the slower a system will run.  It seems counter intuitive that NOT using a feature would cause that feature to introduce performance bottlenecks, but that's exactly what happened.

Next, we developed a new base Data Access Object based on Spring JDBC Template and created a new DAO for SerialNumberContent.  We swapped it out, debugged it, and retested.  Low and behold, we get a 1,000%+ speed boost on that portion of the code.  The serial number content portion of the availability calculation drops way down the list of CPU consumers in the profiler, to the point where something that had previously taken 220 seconds to run drops to 1.5 seconds.

Custom Cache

This was a great result, but we knew it was still inefficient because we were running a live database query every time we looked up a relationship.  There was no caching involved at all.  Depending on the situation, this isn't a bad thing, but we knew in this case that we were running the same queries over and over again.  The information that this association class represents doesn't change all that often - it needed to be cached.

We added some basic cache support to our base dao class (using EHCache) so that our basic save(), delete() and findById() operations all used the cache.  Then, in the new concrete dao, the most frequently called methods (findByContainer() and findByContainerId()) were modified to use a custom cache.

Once this was debugged and plugged into the profiler, that 1.5 seconds dropped to 60 milliseconds, resulting in a performance boost over 3,000 times faster than the original speed before we started.

Ramping Up

This is good news because it proves that a "theory" about how to get big performance gains works in practice.  The next step is to apply this same technique to other performance bottlenecks in Flex - of which there are still many - but one less than there used to be.

Thursday, March 21, 2013

Going Native

Right now we're in a limbo period because we're waiting to shake out the kinks from our latest release before migrating our source code repository from Subversion to Git - and after that we'll branch the code so maintenance, minor revisions and major revisions can go on simultaneously.

So for the time being we want to keep the source tree clean so we can push quick maintenance releases.  This means that aside from bug fixes for the release, we needed a little side project to keep the team busy.

Picking a Mobile Architecture

The can that always seems to get kicked down the road around here (aside from multi-session) is mobile applications.  We talk about it a lot, it always seems like it's just around the corner, but other things always seemed to take priority.

We did some ground work for mobile as part of the last release in that we introduced Spring-MVC to the server side portion of Flex as a platform for building the REST API we'll need for mobile applications (and for the inevitable HTML 5 interface).

After much debate and hand wringing, last week we started in earnest on a version of Flex for IOS devices (iPhone, iPod Touch, and iPad).  For a long time we'd discussed exactly how to go about this.

Adobe and now the Apache Software Foundation supports tools for building IOS and Android apps in Flex/Flash and compiling them as applications for the native platform.  The reason many people run around saying Flash is dead stems from Adobe's announcement that they would not support Flash on mobile devices, which the technology press and the echo chamber took to mean Adobe isn't supporting Flash at all, which became over the last year or so a kind of self fulfilling reality.  Among those who don't know what HTML 5 actually is - which is most people - Flash is dead.

It may not really be dead, but it's future as a platform for mobile development looks pretty bleak, even though the Apache project that inherited the work from Adobe seems committed to mobile development.  It could have a revival, but at this point it's just too risky.

Another common approach to mobile apps is developing them as web applications, usually with jQuery and a healthy does of Safari extensions and deploying them as IOS apps.  For the user, it launches like a dedicated app, but without all the hassle of learning Cocoa, IOS, and Objective-C.

We thought for a long time that this is what we'd end up doing at Flex.  Then we read about the disaster Facebook's web based mobile app turned into.  If they can't get a fast, reliable mobile app going with all their resources, then little old Flex Rental Solutions doesn't stand a chance.

We also need some hardware support for things like bar code scanning, so we decided to bite the bullet and go native.  The first generation of Flex IOS apps will be written in Objective-C using XCode.  It's a learning curve, to be sure, so far it's going well.

We were able to get from an empty project to a first screen in about a day:


Beyond Bullet Points

Once we got the basic "Hello World" screen up and running, it was time to think about some architectural considerations.  One of the things we talk about a lot here is going beyond the minimum effort required to support a feature bullet point.  A key past example is Quickbooks integration.  Most of our competition has accounting integration, but it's poorly thought out, batch oriented and borderline unusable.  We wanted a form of Quickbooks integration that actually works, so we invested a significant amount of time into fine tuning it.  Roger's done great work here.

By the same token we don't just want iPhone and iPad apps just to tick off the bullet point.  They need to be well thought out and not just clones of our existing interface.  A touch based UI is completely different.  We need to start from scratch and not let our preconceived notions about UI's that come from the Flash world contaminate the mobile project.

The first consideration is login.  It's one of my pet peeves when a mobile application makes me type in my entire username and password every time I launch it.  I want it to just pop right up - or at most prompt me for some kind of pin code I can type easily with my thumbs.

With this in mind, we're designing our IOS apps to support a one time login that generates a set of credentials unique to each device that are then used for accessing data.  This means the app will just pop right up where you left off without forcing you to log back in.  The app will also support multiple Flex servers, meaning you can communicate with multiple Flex servers from the same app - which might be a common use case for freelancers.

With freelancers in mind we'll be introducing in the coming months a limited form of access to Flex where freelancers can see their schedules on the web, accept or decline jobs from the web - without Flash and without creating user accounts for each of them.  These same freelancers will be able to connect their mobile devices to Flex using a device code that they'll receive in an invitation email.

Once connected, they should be able to view their schedules and gig requests direct from the app.

Caching Data Locally

Under the hood, we've decided to take advantage of the IOS platform's ability to store data and cache as much data as possible on the device.  This will speed things up by reducing network communication and enable users to view information even if the network connection is sketchy or dead - key for freelancers who just need to double check call times.

To support this we're adding a version hash code to most elements of the data model.  When a piece of data is loaded on the iPhone, we'll send a refresh request to the server along with the version hash. The server will check this hash against the one it has on file and either send and updated copy of the data or indicate that the phone's data is up to date and no refresh is needed.  For things that seldom change - like pricing models, resource types and project element definitions, this should be a valuable optimization.

Session Controllers

We love the Model-View-Controller pattern at Flex and use it just about everywhere.  In IOS apps, you have no choice but to use it.  The controller in an IOS app starts with something called an Application Delegate - and this can be the controller abstraction for most IOS apps, because most IOS apps are simple and dedicated to one or two basic tasks.  Flex is very complicated, and we needed a way of breaking up the complexity into manageable chunks.

We also need an abstraction that hid the device specific user interface details.  For example, a user interface that might only be one screen on the iPad might be a whole nest of related screens on the smaller iPhone.  We wanted a single controller abstraction that encapsulated all that. 

For that we created a session controller class, the header file of which is shown here:

#import <Foundation/Foundation.h>

@class FRSAppDelegate;

@interface FRSSessionController : NSObject


-(void)launchInitialView:(FRSAppDelegate*) delegate;

-(UIViewController*)launchIPhoneView:(FRSAppDelegate*) delegate;

-(UIViewController*)launchIPadView:(FRSAppDelegate*) delegate;

@end
We have a session controller for account setup and navigation - and will likely create session controllers for warehouse functions, contacts, quotes, inventory management, etc.

 In the application delegate (the root object of an IOS application) we have a method for launching a "session" and setting a reference to the active session controller where user interface code can find it.  The user interface code will use the session controller to retrieve data from the server or send updates to the server, and also as a place for storing data temporarily between screen transitions.

Scanning

Roger's role is in this effort was to test different techniques for scanning bar codes on a mobile device.  There are barcode scanner attachments for IOS devices, but they're expensive and we don't think they'll be very common.  We opted instead to support Bluetooth scanners and scanning via the camera.  Using a library called Red Laser, Roger was able to successfully scan barcodes with the camera and Bluetooth scanning has also been successfully tested on our architecture.

Wrapping Up

In spite of our detour into mobile apps, the focus at Flex right now is server side architecture and speed.  4.6 was a good step regarding speed, but until we have a scalable and fast server side infrastructure, we really can't waste time on other big projects.

We only started work on mobile apps because we were stuck in a little donut hole that prevented us from working on big server side changes.  Even so, with the architectural ground work now laid, work on mobile will continue as opportunities arise and will accelerate once we've stabilized our server side platform.  Chris should have something to tease everyone with at Infocomm and we should have something big to announce at LDI, perhaps even before then.





Friday, March 1, 2013

Next Generation Architecture

Earlier this week, I talked about some of the short term work we're doing to establish performance benchmarks and some of the first steps toward speeding things up.  I also mentioned the issues we have with our instance-per-customer deployment model.  I neglected to mention what our proposed solution is for a long term architecture that provides greater reliability, faster performance and efficiently uses resources, so I thought I'd do a short post to catch every one up on our progress toward the next generation architecture.

In A Nutshell

Over the next twelve months we'll be modifying Flex to support something called multitenancy, which means a single instance of Flex will simultaneously support hundreds of customers.  This will enable us to allocate resources based on customer load or usage.  Instead of having one server for every eleven customers, we'll have a minimum of two servers (for failover) in front of a load balancer and dynamically allocate additional server capacity in response to load.  When the load drops off (overnight, for example) we'll spin down the extra servers.

We also plan on adding some extra special purpose servers for things like report generation, caching, search and indexing and overall control of the cluster.  The diagram shows what we're thinking about doing:



Planning the hardware deployment architecture is always fun, but the bulk of the work will be in the software.  There are a lot of interesting software engineering problems we'll be tackling and none of these problems are unique to Flex; they're common to any traditionally developed back office J2EE application migrating to a cloud based mult-tenant architecture.

The Alto Project

We've already started work on the software part of this problem by creating an open source project dedicated to providing frameworks and utilities J2EE projects will need to move to a mult-tenant efficient cloud architecture.  You can find a summary of the engineering problems we'll be addressing by viewing the project's wiki here:  https://github.com/alto-project/alto/wiki

You can also take a look at the project's source code as it develops and even contribute if you're so inclined.

Two Ways

A unique challenge for our architecture is that we always want to maintain support for the single instance back office model.  Just as there's now a big move to the cloud, who's to say that some years down the road there won't be a move back to self-hosted applications.  This prospect, along with the fact that we do have a fair number of self-hosted or dedicated hosting customers, means we never want our architecture to "force" anyone to be in the cloud.  We need an architecture that works both ways without requiring separate builds for a cloud deployed version or a self-hosted version.  Switching between the two should be configured solely through external JNDI parameters.

So, that's the vision for Flex over the next year or so and a big reason why we feel branching the code is so important.  Work on this next generation architecture is ongoing and will ramp up even more once we've addressed the Hibernate performance bottlenecks.  We need to be doing this work in a code branch that won't impact or slow down ongoing Fast Track projects or routine maintenance.

I, for one, am looking forward to the next twelve months.  I'll get to work on some pretty cool stuff.


Wednesday, February 27, 2013

Growing Pains and Performance Measurement

There's no point in hiding it - we're in the throes of growing pains and trying to simultaneously handle ongoing feature development and modify Flex to work efficiently under scale.

Right now we're putting the finishing touches on the next release of Flex, which takes into account some lessons learned upgrading Hibernate and performance tuning the system for large customers.  Some of our ideas worked, some didn't, and some only work if we can throw large amounts of memory at the problem - which we can't.  We decided to stop the performance experiments and get all the new features released, then resume work on experimental performance improvements once we initiate code branching and establish an ongoing performance measurement regime.

Facing Growth

Flex is growing like wildfire, and while that's a good problem to have, it's still a problem.  From a technical standpoint, the key issue is that Flex was never designed to make efficient use of a cloud environment.   Our current architecture involves running 11 customers on each cloud server and allocating each customer instance around 2 Gigabytes of memory between heap space and permanent generation space.  Building out an auto-scaling fault-tolerant multi-tenant architecture when we first started would have been wildly optimistic in terms of customer growth and divert resources from developing the new features our customers actually care about.  But now it's essential.

Most customers don't make use of all this memory and it goes idle while others strain to fit their operations into the 2 GB footprint.   We've always known this was an opportunity - if we could re-architect Flex to optimize memory allocation based on real usage, we'd be able to make sure the higher volume customers got the memory they needed without wasting it on the customers that don't.

One way would have been to maintain the seperate-instance-per-customer model, but add some analysis tools that tweak the memory allocation based on usage.  But with customer growth knocking on the door of 300, with over 30 production servers in use between our Virginia, Oregon, Ireland, and Sydney data centers, we're at a scale where even that optimization would only be a band aid - and it wouldn't address issues like fault-tolerance or give us the ability to isolate resource intensive tasks like reporting generation on their own servers (where their memory use won't impact bar code scan response times, for example.)

We've established a series of near term and intermediate term objectives for the coming year, with the two most critical being performance improvements for high volume systems and shorter maintenance release cycles.  Our longer term goal of moving Flex to a fault-tolerant, multi-tenant architecture has to be considered now because technology choices made to solve short term problems must take it into account.

Faster Releases

We've suffered from some issues related to code branching lately, or rather, a lack thereof.  We've always thought we were small enough to work out of a trunk or master branch instead of forking the code in parallel branches.  We were small enough to avoid branching at one time, but no longer.

The problem is that simple bug fixes and feature enhancements get delayed by more elaborate or experimental work  that takes longer to shake out.  Our first order of business once this release gets out and stable is to move the entire codebase from Subversion to Git and once moved, all major work on the code will be carried out in a "major-version" branch with the master branch reserved for maintenance work.

The net effect of this change will be more frequent releases with minor tweaks and bug fixes out the door sooner.  We're about to start working on a large number of experimental performance improvements and architecture changes - exactly the kind of thing that tends to delay a release.  We'll move our code to parallel minor/major branches before work starts on any of that.

Performance Measurement

Our number one priority right now is improving performance.  We've made some major gains on that front as part of this release, but it's still not where we want it to be.  We want performance improvements measured in orders of magnitude.  4-6 times faster is good; 100-1000 times faster is better.  To achieve this, we'll have to do experimental things like bypass Hibernate, perhaps even consider non-relational databases like BigTable, Cassandra or MongoDB.

We also learned over that last few months that making something fast in one area can make it slow in another.  We need a good set of performance benchmarks so we know what works and what ripple effects it has.

To achieve this, we revived an old idea about Regression Testing Performance and started an open source project on github called Perforate.  (You can view or download the source code here: https://github.com/flex-rental-solutions/perforate.)  Perforate integrates with the TestNG framework we use for unit and integration testing, tracking the historical running time for each test and failing it if the test exceeds the historical mean by a configurable property (the default is by three standard deviations.)

Roger Diller took this a step further and cooked up an internal user interface we use to analyze performance data.  We've posted a few screen grabs below:



This is a great tool for the weeks ahead, because we can make a code or architecture change we think will have a good impact on performance and just look at the graphs for sharp drops to know that it worked - or sharp spikes to know our plan backfired.

Our goal over the coming weeks - once we've branched the code - is to make as many of those curves have sharp drop offs as possible.

The Options

We've learned the hard way that Martin Fowler is right when he calls Object Relational Mapping the Vietnam of Computer Science.  Hibernate can make mundane things easier, but often with an overhead cost that's unacceptable.  We think Hibernate is the major barrier right now to blinding fast performance.  We've tinkered with it enough to make 4.6 faster than 4.5, but we think we've gone about as far as we can go with it.

Hibernate isn't going away, but we're going to start phasing it out in certain areas by going with straight JDBC for line items and scan records.  We're also looking at storing an adjacency matrix for the upstream/downstream line item hierarchy so it can all be pulled back in one query instead of recursive queries.  We think this will address the majority of our problems, but we're also looking at ways to optimize resources and project elements.

The scan log and line items are simple - they don't come with the same complicated object graph structure that project elements and resources do.  For example, Inventory Items, a resource, have quantity records, contents, OOC records and serial numbers.  Going straight JDBC for objects with complicated graphs might be a boost over Hibernate, but for the seismic boost we want, we might have to look at other options, like NoSQL databases.  As of now, MongoDB looks like a good candidate for optimizing project elements and resources.

We also need to improve the speed of search execution and indexing.  We also think our existing search is a little kludgy and could benefit from a major overhaul.  We're currently looking at moving our entire search process to Apache Lucene, the same search technology that powers Twitter and a number of other marquee sites.  For self-hosted customers, we expect to use an embedded Lucene engine.  For the cloud, we'll offload search indexing and execution to a central server using Apache Solr.

Of course, all of these options are experimental.  They might work; they might not.  There's really no way to know for sure until we try them.  But to do that, we need to isolate experimental code in its own branch as to not interfere with routine maintenance and Fast Track work.  We also needed a reliable way of measuring performance to accurately gauge our successes and failures.

Wednesday, January 23, 2013

Toward Memcached

This week at Flex we're going through release candidates for version 4.6.X which includes a number of big performance enhancements we recently made to improve speed for high throughput systems or companies that use large quotes.  It's taken longer than expected because we decided to be somewhat aggressive with the performance tuning and tuning of this kind always has ripple effects.  The last week and half or so has been consumed with fixing new issues raised by QA.  Yesterday the new version dropped to QA for regression testing.

This means the new version is feature complete, all known bugs have been fixed and the software is currently in regression resting.  This is the normally the last stage in our release process, but this time we're going to add a brief period of beta testing to test the new version's memory footprint in a production environment.  The reason for this is that one of the key strategies we used to improve performance was to expand the Hibernate Second Level cache.  This made the performance numbers much better, but a potential downside of this strategy is that the increased memory usage could - in theory - exceed available heap, increase garbage collection overhead, and increase swap usage on the servers and all the potential virtual memory paging that goes along with it.

A Distributed Cache

One of the unique conditions for this beta test is that we already know what we'll do if we run into memory problems on the production servers, so there's no need to sit around and wait for the issue to crop up.  We also know that the solution to this hypothetical memory issue is also something we'll have to do once Flex moves to a fault tolerant high availability architecture later this year - which is to use a distributed cache.

This means that the memory we need for caching is offloaded to another server or cluster of servers.  This solves the problem of running out of memory on a local server and it deals with the concurrency issues you'd run into on a multi-server architecture.

There are a number of options for a distributed cache in Java, notably Terracotta, but we decided to go with a much simpler option called memcached - which we'll use via Amazon's Elasticache service.

Memcached

Memcached is a dirt simple in memory cache server originally developed for LiveJournal and later extended and refined by Facebook.   To this day memcached plays a huge role in Facebook's architecture with over 800 memcached servers in production.  You can read more about Facebook's experience with memcached here:

Once the current release of Flex was feature complete, we decided to get a jump start on memcached integration while the release wound its way through QA.  The biggest stumbling block for integrating Flex with memcached turned out to be the lack of an existing memcached library for Hibernate 4 - we had to build it.

Flex Alto

We also had the problem of needing to support customers, particularly those who self host, who don't want or need to use memcached.   We needed an architecture that supported seamlessly switching between an on heap cache implementation and memcached or anything else that might come along.  To support this, we added a number of new features to an open source library we sponsor called alto.

(Feel free to explore the source code on github here.)

We needed three basic features to make memcached work with our system and to support interchangable caching strategies:

  • The ability to switch caching implementations via a JNDI injected parameter.
  • A simplified caching abstraction so we code to an abstraction instead of a proprietary caching architecture.
  • Glue between that caching abstraction and Hibernate 4.

Pluggable Cache Implementations

The first one was easy.  We created a new Spring factory bean that takes a bean id as a property that can easily be injected via JNDI and returns the Hibernate Region Factory (cache implementation) that corresponds with that id when needed.  Source here.

Simple Caching Abstractions

Next, we wanted a simplified abstraction for the generic idea of a cache, one that came without the complexity of JSR-107.  For that we created a simple interface called AltoCache:


public interface AltoCache {
public Object get(String region, String key);
public void put(String region, String key, Object value);
public boolean isCached(String region, String key);
public void remove(String region, String key);
public void clear(String region);
public AltoCacheStatistics getStatistics(String region);
public AltoCacheStatistics getStatistics();
public void get(String region, String key, Future<Object> callback);
public void put(String region, String key, Object value, boolean async);
public CacheKeyGenerator getKeyGenerator();
}
As you can see from the interface, it's a pretty simple key/value pair abstraction with some asynchronous get/put support.

We then created implementations of this interface for a simple in memory HashMap, EHCache, and memcached.

Hibernate 4 Integration

We then took the caching API defined by Hibernate 4 and developed an implementation targeting the AltoCache interface.  It's a complicated API, so this took quite a bit of time and testing to get up and running.

Perhaps the two biggest issues that came up during the implementation process relate to Hibernate's concept of regions and how read/write locks would work in a distributed architecture.

For the first issue, Hibernate likes to divide the cache into sections of related cache entries called regions and Hibernate relies on this region concept to evict entire cache regions that for some reason or another it believes are stale.  Trouble is, memcached doesn't support the idea of key namespaces or regions, so we had to use the memcached append method to maintain a list of keys that define each region, effectively creating virtual regions.  It feels a little ugly to me, but there's no other way to handle it until memcache introduces a region concept, which given the philosophy behind memcached, particularly how it uses key hashes to distribute keys across various servers, seems unlikely.

Local Locks, Distributed Locks

The final issue pertaining to mutex locks during cache reads and writes is a pretty easy problem to solve for now, since we only need local locks.  But we also know that we'll eventually need a more complicated locking mechanism that works on a multi-server architecture.

We first created a simple lock abstraction that can be plugged into our Hibernate 4 Cache implementation or used on its own...

public interface LockProvider {
public String lock(String lockId);
public boolean unlock(String unlock);
public boolean isLocked(String lock); }
Then we created a simple implementation using Java's reentrant locks for use when locks are local in scope - and that's good enough for now.

But anticipating what's coming down the pike with a multi-server architecture, we stubbed in a simple telnet based lock service we called lockd.  The idea is that something with same simple approach used for memcached adapted for mutex locks would be a wonderful thing.  We designed lockd to be easily embedded in an application via Spring for small clusters where the locks replicate to each node in the cluster.  Ultimately, though, we'll wait for another lock service to emerge or develop a C++ implementation that supports the protocol.  We may even fork memcached to start and modify it to be a lock server.

Pragmatically, we really shouldn't be in the distributed lock service business and we hope a propeller head created lock service comes along or our friends at AWS introduce a lock service.

In Conclusion

We know it's been a long wait for the new release.  We really are close now and it should be out of QA later this week or early next week.  We'll immediately beta test it on one of our servers (we're planning on testing it on one of our European Union servers) and get a feel for whether or not memcached will be required for general deployment.

In the meantime, we've gotten a head start on memcached integration and full regression testing of Flex backed by memcached will start as soon as the current release clears QA.

After this release gets out and is running stable, we'll get back to clearing out fast track issues and finishing up our new crew scheduling system.