Wednesday, September 26, 2012

Time Zones and Checkstyle

After the release of 4.5 and the first maintenance release (4.5.4), we started on the long delayed and much needed project of upgrading our internal build and management tools.  We upgraded the version of Ubuntu we use for the development server and a whole host of tools we depend on, including Jira, Nexus, Bamboo, Confluence and Subversion.  (The move to GIT is scheduled for next year.)

Elastic Build Agents

We also changed the way we do our continuous integration builds.  Previously, the builds would run on the same machine as the continuous integration server (Bamboo).  The downside of this approach is that the builds would drain resources from the build server, slowing it down (which made things slow for us - not customers, which are on different servers.)  We've also wanted to start running performance statistics on regression tests in order to catch sudden deviations from previous performance patterns.  This is hard to do accurately in anything but a cleanroom environment - so having a dedicated build server made the most sense.

On the other hand, we felt that running a dedicated build server might be wasteful because, on average, we're only running builds for 3-4 hours per day, which means that the rest of the day the server would sit idle.

Luckily, the new version of Bamboo has been designed to work with EC2 and by defining a server image for running builds, Bamboo will automatically spin up a new build server when builds are in the queue and drop the instance when the build queue is empty.  After some trial and error to find the right build server image, we got this up and running and now all builds are running on dynamically instantiated or elastic build servers.

The Final Four

Once we got the kinks worked out of the new build environment, all modules were building and all tests were passing - except four availability tests.  These tests worked fine on the old servers and were suddenly failing in the new environment.  We felt it was some kind of environment issue and weren't overly concerned about it, but we don't screw around with availability math and so we weren't about to proceed with development until we knew for sure, until all tests were green.

After examining the specific tests, I first suspected that Hibernate wasn't flushing sessions and therefore the conflict query might have been getting stale data.  I tried to deal with this by creating a new AOP aspect that would intercept the test fixture setup and assertion methods for the functional test suite - and force each one to use its own Hibernate session.  This mirrors how the code being tested runs in real life - and we probably should have done it a long time ago.

The new session management aspect worked fine - except that it didn't fix the four failing tests.

Time Zones

That brings us to about 3:30 AM on Tuesday morning.  Just as my head hits the pillow, I start to wonder if maybe running the build servers on Universal Coordinated Time (UTC) might have something to do with it.

The next morning (five hours later), I get up and check all the test code for spots where the Java Calendar object is used without setting a time zone.  I also go back to our localization code and revert a change I had made to fix a failing test during the build server changeover.  Here it becomes obvious that running the build servers on UTC is the issue - or rather unmasked the issue.

Working most of yesterday trying to find code where un-localized dates and times might have been worked with, it becomes obvious that we're dealing with a large issue and lots of spots in the code that will need to be touched.

In essence, we need to make sure that every instance in the code of...

Calendar cal = Calendar.getInstance();

...becomes...

Calendar cal = Calendar.getInstance(timezone, locale);

Not only that, but we need to make sure this mistake can never get made again.

Checkstyle

It's similar to what we did with number and date formats way back when we purged a great many of the America-centric assumptions about date and number parsing from the code.  To prevent us from reverting to bad habits, we added a custom module to checkstyle that caused a build failure if java's built-in date and number parsing classes were used directly instead of our newly created localization service.  This worked well and has kept peace for some time.

So, this morning I added a new checkstyle module that applies the same philosophy to Java Calendars.  It scans the code for every spot where calendars are instantiated.  If the developer uses the no argument insantiator instead of the one that takes a timezone and locale, the build will fail.  I also added checks for TimeZone.getDefault() because UTC is never a valid timezone in Flex user space.  And I wrapped it up by making sure nobody ever calls setTimeZone() on an instantiated calendar  - because that could undo all the good things done by instantiating the calendars with time zones.

I did this by knocking together a quick little module that uses Checkstyle's parser architecture.  The relevant bit is shown here:

    protected void visitMethod(DetailAST token) {
             
       
        DetailAST methodCall = token.getFirstChild();
        DetailAST invocationTarget = methodCall.getFirstChild();
        DetailAST methodName = invocationTarget.getNextSibling();
       
       
        if (invocationTarget.getText().equals("Calendar")) {
            if (methodName.getText().equals("getInstance")) {
                DetailAST argList = methodCall.getNextSibling();
                DetailAST arg = argList.getFirstChild();
                if (argList.getChildCount() != 3) {
                    log(token.getLineNo(), "Unsafe Calendar instantiation.  Use Calender.getInstance(TimeZone, Locale) instead.");
                }
            }           
        }
        else if (invocationTarget.getText().equals("TimeZone")) {
            if (methodName.getText().equals("getDefault")) {
                log(token.getLineNo(), "Default Time Zone Instantiation.  Use LocalizationService.getSystemTimeZone() instead.");
            }
           
        }
        else if (invocationTarget.getText().length() >= 3 && invocationTarget.getText().toLowerCase().endsWith("cal")) {
            if (methodName.getText().equals("setTimeZone")) {
                log(token.getLineNo(), "Time zone override in calendar.  Initialize calendar using desired time zone instead of setting later.");
            }
        }
       
       
    }
It's not fancy, but it does the job.

With this new module in place, the build server will flag all the spots in our code that need to be changed and we'll simply iterate over the code until all modules pass the checkstyle test.

The Bug

Switching the build server environment from Mountain Time to UTC revealed a pretty serious availability bug - one that we think impacts calendars (as opposed to quotes and pull sheets) because the time zone glitch may have impaired the system's ability to determine when days start and end - potentially turning your days into British days since all cloud servers also run on UTC.

It's never fun to find a glitch in availability or calendar math, but this is exactly why we do regression testing - to quickly discover any negative ripple effects from code or environment changes.  The hope at this point is that once all the unlocalized references to calendars are cleaned up, that the final four tests will pass.

Going forward the build environment should prevent further regressions and the new checkstyle module should prevent the return of bad localization habits.

Monday, September 17, 2012

The Wonderful World of Auto-Scans

All in all, the release of 4.5 last week went pretty smooth.  As much QA as we do -- which is not much for software in general, but quite a bit for the rental software industry -- we always worry that our testing will miss a serious show stopping issue and we'll have to slam in an emergency fix for something.

There were some issues, but nothing quite that serious.  Mostly dinky edge case null pointers that we were able to fix quickly.  The most serious problem that arose after the release of 4.5 pertained to what we call autoscans, or the automatic scanning of a container's contents when the container is scanned.

Inside Contents

At a basic level, autoscans are pretty simple.  For example, when you scan an amp rack, it stands to reason that the rack's component amps and any accessories or power cables usually stored inside the rack also get scanned.  Flex has supported that model of autoscans for a long time.

Where things fell short were situations where the contents of a container change during the prep process.  The most obvious example is free pick containers, though anything can be tweaked or changed during the prep process.

We added the ability to take this into account when autoscanning.  This would mean that if you had a free pick container out on a show, that when the show came back, scanning the container would effectively scan all of its contents back in.

The only problem here is that contents configured for an inventory item and contents specified as child line items are not really the same thing.  We wrote code to analyze the data and try to guess whether the user wants the configured contents or child line items to be scanned, or both.

Shortly after the release of Flex 4.5 we discovered that this code often guesses wrong and effectively scanned the same item twice - once for the contents and another time for the child line items.  When the parent item is a serialized item, you end up with a separate scan operation for each serialized unit, which could exacerbate the problem.

No More Guesswork

We looked at this issue of autoscan processing and decided that it just wasn't possible to make educated guesses about configured contents versus child line items that could fit every situation.

So, we chucked the old yes/no autoscan flag some of you may have seen in the equipment list element configuration screens and replaced it with a set of three Autoscan Modes: All Contents, Permanent Contents, and Child Line Items.

This means that you can configure how you want the autoscan process to work for each warehouse mode.  By default any situation where autoscans were enabled will be set to All Contents, which mirrors the system's behavior prior to 4.5.  It is recommended that anyone who wants line item autoscans change the autoscan mode for manifest returns to Child Line Items.  If you use a two stage check out process (with a prep and ship scan), you might also considering setting the autoscan mode for ship scans to Child Line Items as well.

One Last Guess

The only place in the system where some measure of autoscan guess work is still in play relates to when gear is flowed from one show to another.  When you flow gear, the system will default to Child Line Items mode if you have any autoscan mode configured.  If you have an autoscan mode of None, there will be no autoscans, even if you are flowing gear from show to show.

This fix, along with a number of fixes to clear some of the annoying error messages some of you may have been seeing, is in QA now and will deploy as part of Flex 4.5.3.  You won't have to wait months for this one.  This version will regression test today and should be deployed tomorrow or Wednesday night depending on how regression testing goes.

Tuesday, September 11, 2012

With No Further Ado, Here's Flex 4.5

Almost four months in the making, Flex 4.5 has just been cleared by QA for general release and we've scheduled a full push of the new version just after 12:00 AM Pacific Time on Wednesday, September 12th. 

That being said, the release is done and ready.  Some beta testers got the release last night and anyone who'd like to get it a few hours early should feel free to contact us at support@flexrentalsolutions.com.

The Big Picture

Flex 4.5 includes a large number of new features, enhancements and bug fixes as detailed in the release notes here.  This release began as a simple release designed to added tiered pricing rules in order to support the complex labor calculations common in our industry.  This aspect of the release was completed fairly early on in the process.  What derailed the schedule, the usual frequency of our releases, was a redesign of the calendar system.

We had a customer fast track issue asking if certain columns could be added to the daybook screen.  We could have done it as a one off hack for this customer and the specific column they wanted.  Instead, we decided to redesign the system such that any field could be added to the Daybook.  In general, we felt the technical architecture of the calendar was outdated and inflexible, so we pulled it apart and put it back together the right way.  In addition, the old filter tree Shoptick users may remember had been ripped out when the Ajax calendar was moved to Flash.  We redesigned it and brought it back.  Chris is preparing a video tutorial on the new calendar system with more detail.  I'll update this post when it's ready.  It's also been discussed at length in the support forums here.

At a high level the new calendar system isn't wildly different from the old system - it merely adds more personalization and flexibility.  We take an old Flex concept of Calendar Templates and better integrate it with default calendars.  You can now configure any number of different calendar templates, determine whether or not the list (daybook), traditional calendar, or Gantt views are considered relevant for a particular template.  You can use the filter tree to determine what types of elements and statuses are shown on a calendar - and you can also determine what fields are shown in the list or daybook view.  You can even change the name of the daybook view to something else if you have different lingo.

Other Stuff

A significant amount of time was spent in the last month or so fine tuning the scan and availability process, particularly for non-serialized items and subrentals.  Any fixes related to pricing math and availability usually take a little longer because we build regression tests for these kinds of fixes (to make sure what gets fixed stays fixed).

You can also now cross scan items from one list to another without manually scanning each item in and back out.  This should cut scan time in half in certain warehouses.  This is the first stage in a series of improvements planned to support alternate warehouse workflows.  We'll soon be introducing something to the scanning process called Concurrency Mode.  Right now there's only one concurrency mode: Real Time.  In the coming months we'll be introducing two others for fast paced warehouse environments.

One of my favorite enhancements in this release involves reworking the way availability is calculated for suggestions.  Some customers with lots of suggestions were reporting long wait times as availability was calculated.  We redesigned this to be a lazy loading process on a per suggestion type basis.  We also added the availability meatball to the suggestion dialog so the way availability is displayed remains consistent.

Those of you who frequently interact with administrative screens may notice that many of the admin console menu options have been moved into the workbench.  This is part of an ongoing process to phase out Struts, as Roger Diller - the developer heading up this effort - noted in his blog post dated August 31.  Struts is an older MVC framework introduced with Shoptick E.  As we get ready to move toward a REST/JSON oriented architecture for our mobile back end, we felt it was wise to start clearing out the cobwebs so we don't need to run two MVC frameworks side by side.  For the curious, we're probably going to run the REST back end we'll use to support iPhone/iPad devices on Spring-MVC.

A little Easter Egg that's come out of this refactoring relates to how the Performance Monitor was moved from HTML to Flash.  We introduced the first glimpse of our new dashboard architecture.  To access it, put the workbench in debug mode by adding ?debug=true to the URL and goto Flex > Performance Monitor.  This was a nice bit of initiative on Roger's part and I think it turned out great.  Can't wait to see how it looks when we bring back the dashboard.  (Remember Flex/Shoptick had the industry's first dashboard way back in 2007.)

Coming Up Next

The calendar redesign put us in the weeds and a large number of Fast Track projects have stacked up, so the next few releases will be dedicated to custom development for Fast Track customers, many of which are small tweaks and enhancements.  We'll also be monitoring customer logs for error reports and I would anticipate a fairly frequent number of maintenance releases over the coming weeks.  It's been several months without a release, which I want to emphasize is not really how we prefer to do things.  We prefer small, frequent, incremental releases and until the next big redesign project, we're planning on getting back to that way of doing business.

If the upcoming part of the development roadmap has any overall themes, they would include adding scheduling and planning tools to the labor/crew list section, adding scan concurrency modes for fast paced warehouse environments, and multi-session event planning.  We're also planning a big upgrade of our internal development tools and a redesign of the security architecture we use to administer and support customer systems.

I'm also sneaking in support for lighting paperwork.  Since I worked myself into a desk job a decade ago supporting people who do what I used to do, I've missed being out in the field and I volunteer with a local community theatre as a lighting designer a few times a year to stay current.  I really don't want to buy Lightwright just for two shows a year, so with no offense intended to John McKernon, I'll be introducing configuration options that will enable lighting paperwork to be incorporated into pull sheets - with custom reports for Instrument Schedules, Channel Hookups, etc.  Our customers may find this useful, but it's really just for me at this point.  One of the perks of being the software developer is that you can occasionally slip something in for yourself.

In Conclusion

I'm proud of what we've accomplished in this release.  Roger, Suman and Courtney have worked incredibly hard to get it ready and tested.  The calendar redesign did blow our schedule and make it tough to set expectations about when the release would be ready.  For those who wondered what was taking so long - it was the calendar and subrental testing.  In the future we'll introduce code branching for major rework so that critical bug fixes aren't delayed by other unrelated work and we're also introducing some process changes to ensure that we get back to the tight, frequent release schedules our customers had come to expect prior to 4.5.

Thanks to our growing customer base for their support thus far and their patience.  If any problems arise tomorrow morning, please let us know.  If you'd prefer to evaluate the release prior to deployment, please contact support and we'll take you off the automatic push list and arrange a private beta site.