Thursday, May 16, 2013

Code For Us

We're Hiring

We have a lot of big projects planned for the next few years at Flex: Advanced crew scheduling, multi-session event planning, the high availability cloud architecture, mobile applications and a full HTML 5 rewrite of our front end.  And we'll need help.  A hardcore Java nerd with production industry experience would be ideal.

Here are the details:

What We're Looking Fo

Flex Rental Solutions is seeking software engineers of all experience levels to help support a rapidly growing customer base around the world.  At Flex, you'll work on technology that powers much of the world's live entertainment and corporate events.  From dynamic cloud computing architecture to mobile applications to rich internet applications, a career at Flex provides a wide array of stimulating challenges for software engineers.

You'll receive a competitive compensation package and will work from the comfort of your own home (although occasional travel may be required for training, conferences or on-site work with customers.)

Above all Flex looks for a rare combination intelligence and maturity, a willingness to adapt to changing circumstances and focus on the customer's needs.  Our customers work in a fast paced, demanding industry and an ability to empathize with customers working under stressful conditions is essential.

Flex is currently looking for candidates with some or all the following skills and qualifications:

Education: A BS in Computer Science or equivalent experience.

Core Languages:  Java, Python, Objective-C, Javascript (client side), Actionscript, some Bash scripting

Theory: Basic knowledge of statistical methods, set and graph theory.  College Coursework that includes Discrete Mathematics is a plus.

IDE's: Eclipse, XCode

Build Tools: Maven 3 and Gradle

Java Frameworks and API's: Spring, Spring-MVC, Spring Security, Hibernate, JAXB, Jakarta Commons

Platforms: Linux (Ubuntu), EC2, Jetty, Memcached

Databases: MySQL, MongoDB, Cassandra

Web Technologies: HTML, CSS, Javascript, jQuery, jQuery-ui, angular.js, socket.io

Mobile Platforms: IOS and Andriod

Interested candidates should send a resume and cover letter to jeff [at] flexrentalsolutions [dot] com.

About Flex Rental Solutions

Flex Rental Solutions is an award winning provider of financial and inventory management solutions for the live event industry.  With an international roster of customers in the corporate a/v, concert touring, television and film industries, Flex technology manages equipment and crew scheduling for nearly 300 equipment rental and production houses worldwide.  In a few short years Flex has catapulted from newcomer to industry leader - with the industry's first web based and only cloud based ERP system for event production.

Friday, May 10, 2013

Fixing The Double Conflict Filter

The main priority right now at Flex is making things fast.  There are things we did when we originally designed Flex (and Shoptick before it) that we knew were suboptimal, but at the time we felt going all the way would divert resources away from more pressing concerns and more importantly, would have been wildly optimistic about our company's future prospects.

It would have been akin to a company that makes irrigation valves overdesigning their products on the off chance they might be used in a nuclear power plant.

We never expected Flex to get so big so quickly and to have customers with as much concurrent throughput as we have now.  We'd always hoped to get there, but when you're starting a company you have to focus on the here and now, on the needs of the customers you have - and keep your goals realistic.

This is why we chose Hibernate as an ORM framework instead of building our own micro-optimized persistence layer.  Hibernate is ubiquitous, fast to develop in, and although not as performant as rolling your own, usually good enough.

Whatever our reasons were initially, the landscape has changed and the time for hand optimized persistence and caching code has come.

Faster Availability

Much of the work we've done so far on system performance has related to the scan process.  We've fine tuned much of the scan bottlenecks (the .14 and .15 releases include a lot of scan performance improvements) and now our attention turns to availability calculations.

The Flex availability calculation process is very complicated, but there are two main phases governed by separate modular components: The Conflict Engine and The Availability Engine.

The Conflict Engine's job is to retrieve and process line items from the database that might be relevant to an availability calculation.  The Availability Engine then takes the output from The Conflict Engine and applies all the ship, return, container, subrental, transfer, and expendable return logic to produce a final result.

The purpose of this design was to isolate the I/O intensive part of the calculation in one place (The Conflict Engine) and leave The Availability Engine to focus on relatively high speed in memory computation.

We've known for some time that the bottleneck in availability performance is the conflict engine and learned over time that the database query used to retrieve line items is fast.  The work Hibernate does to turn that query into line item objects however - is not.

Another main bottleneck in the The Conflict Engine is what we call The Double Conflict filter.  This filter's job is to remove related line item entries from the conflict result, else an item might get counted more than once.  Consider the following graph of a typical line item relationship in Flex:

This shows a pretty conventional process where a line item is placed on a quote, the pull sheet is generated, and as the show is scanned out, two manifest line items are created with the specific serial numbers.

But there are four line items in the system referencing the console for a total quantity of 6 conflicts - when only two consoles are actually in use.  In Flex, we address this problem by assuming that only the line items furthest downstream in the workflow are in control of availability.  This deals with the double conflict problem, but is also intended to handle the problem of the plan diverging from reality.

What would happen in this situation where the L1 made a judgement call in the warehouse and only decided to take one console or maybe decided to take a lower end console as a spare?  If we went solely off the plan, other shows would show a shortage and you might end up subrenting a piece of gear that you had sitting on the shelf the whole time.

The performance problem with this approach is that the current algorithm uses recursive I/O to crawl down the downstream object graph, necessitating a database hit for each level of the graph.  This is slow.

Bypassing Hibernate

The first step (already completed) is to bypass hibernate in the conflict engine.  We did this by replacing the fully mapped line item objects with a simple light weight DTO object that only contains the line item fields relevant to an availability calculation. (Fields like location, ship date, return date, etc.)

Instead of running a database query to pull back a list of line item ids and feeding those ids into hibernate for hydration, all the fields come back in one query and get copied directly onto the DTO object.  This was pretty straightforward.

Adjacency Matrix

The next step is to reform the double conflict filter by getting rid of the recursive IO needed to retrieve the graph. (I/O buried inside Hibernate, I might add).  To accomplish this, we're introducing a persistent adjacency matrix to represent the upstream/downstream relationships.  We also decorate this adjacency matrix with the status of the downstream line item and whether or not the status is conflict creating, which saves yet another Hibernate lookup.

Each line item has a reference to the adjacency matrix used to represent upstream/downstream line items and we can retrieve all the relationships in a single query - and store them in a small and easy-to-cache object.  The caching will further reduce database I/O.

Local Caching / Distributed Cache Version Control

We're also introducing a new cache version feature that will be needed for high availability Flex.  We learned a few months ago when we first started playing around with memcached that serializing our domain objects for the trip over the wire to memcached and back was slow.  Not as slow as not caching at all, but still less than ideal.  It would also necessitate a large cache and since we'll be using Amazon's Elasticache, this comes with a pricetag.

What we decided to do was stick with an in memory cache, but use memcached to help us know when an object cached in memory was stale (modified by another server in the cluster).  We do this by giving cacheable objects the ability to implement an interface whose single method returns an SHA1 hash that represents the version or state of the object.  It could be a message digest based on the properties of the object, or, when generating a string to base the digest on would be too expensive, it could simply be a unique (and persistent) hash that changes when the object is mutated (like a git commit).

The cache lookup code will pull the object it has in the local cache and compare it's version hash to the one in memcached for the same cache key (and do this in separate parallel threads). If they match, all is well, and the cached object is returned.  If they don't match, then the in memory object is no good and the cache lookup will return null.

A side effect of this approach is that it could lead to memory churning (lots of space related cache evictions) if our high availability architecture uses round robin load balancing.  We'll try to optimize the architecture by using hostname affinity for load balancing, session affinity as a last resort.  Unlike most "sticky session" based load balancing, this approach will just be a performance optimization and shouldn't impact reliability or failover.  We don't rely on http sessions, so losing session variables won't hurt the user at all.

Wrap It Up, Kid

This work is currently slated for release in Flex 4.6.16, although given the risk and magnitude of the change, that version number could slip to .17 or .18.

The single slowest part of Flex has always been availability calculations and this batch of enhancements so far have been very encouraging.  We haven't done a formal comparison yet with unoptimized versions of Flex, but on my machine the availability calculations appear to be virtually instantaneous.  Here's hoping the fix holds up in regression testing and the performance boost we're seeing so far holds up, too.

Wednesday, May 1, 2013

Version Numbers, Part 2, Part 1.

Last night we released the first version of Flex with a fourth decimal place in the version number: 4.6.13.2.

Believe it or not, the different decimal places in version numbers do have meaning and our new numbering scheme reflects some underlying changes in how we manage source code and version numbers at Flex.

First off, we recently changed our source control system from Subversion to Git.  As part of this conversion, we restructured our projects and modules to make it easier to branch and merge source code.

Source Control

Under our new regime, we have three main branches of code we work on simultaneously: master, dev-minor and dev-major.  Master is the Git equivalent of Subversion's trunk and for us represents the maintenance or emergency branch.  Most new work happens in other branches, leaving master relatively pristine and unchanged since the last release.  If we're in the middle of coding a big new version of Flex and a severe bug suddenly crops up, we can fix the bug in the master branch and push a release without pushing all the risky new work we're doing along with it.  This allows us to respond to emergencies and push small changes much faster.

Most of the work happens in a branch called dev-minor.  We now do the majority of regular work in this branch and only when we're ready to build a release candidate does the new work get merged over to the master branch for regression testing and release.

Work we consider to be risky or potentially time consuming is done in yet another branch called dev-major.  For example, if we have to rip out parts of the availability engine and rebuild it for performance - a task with the potential to run awry and delay the schedule - we do this in the dev-major branch where it can't delay the work going on in dev-minor.

A Version For Every Branch

The code in each branch has a different version number.  The three digit version numbers we're all used to (e.g. 4.6.14) come out of dev-minor.  Once a routine version has been released, if we find we have to push an emergency or maintenance release out of master, that version has an extra digit tacked on the end like this: 4.6.13.2.  That number tells you that this version was the second emergency release after 4.6.13.

The dev-major branch will get a version like 4.7.0 or even 5.0.0, but it's possible that work done in the dev-major branch could get merged into the dev-minor branch without the version number going with it.  In fact, this happened with version 4.6.14 of Flex.  A number of experimental performance enhancements were done in dev-major and once they were complete and stabilized these changes were merged into the dev-minor branch for the next standard release.

The Point

This may seem confusing to those unfamiliar with source code version controls systems, but the point of all this complexity is to make is easier for us to get our work out sooner.  The schedule is no longer at the mercy of the most complex task on the schedule.  We can isolate the time consuming or risky things in their own branch and get other functionality out faster.  We can also respond faster to emergencies or minor maintenance issues.


This process is already starting to work.  Version 4.6.13.2 went out last night, but version 4.6.14 is almost ready and work on version 4.6.15 started today.  We were also able to retire our old release candidate numbering system in favor of true release candidate builds.