Wednesday, September 26, 2012

Time Zones and Checkstyle

After the release of 4.5 and the first maintenance release (4.5.4), we started on the long delayed and much needed project of upgrading our internal build and management tools.  We upgraded the version of Ubuntu we use for the development server and a whole host of tools we depend on, including Jira, Nexus, Bamboo, Confluence and Subversion.  (The move to GIT is scheduled for next year.)

Elastic Build Agents

We also changed the way we do our continuous integration builds.  Previously, the builds would run on the same machine as the continuous integration server (Bamboo).  The downside of this approach is that the builds would drain resources from the build server, slowing it down (which made things slow for us - not customers, which are on different servers.)  We've also wanted to start running performance statistics on regression tests in order to catch sudden deviations from previous performance patterns.  This is hard to do accurately in anything but a cleanroom environment - so having a dedicated build server made the most sense.

On the other hand, we felt that running a dedicated build server might be wasteful because, on average, we're only running builds for 3-4 hours per day, which means that the rest of the day the server would sit idle.

Luckily, the new version of Bamboo has been designed to work with EC2 and by defining a server image for running builds, Bamboo will automatically spin up a new build server when builds are in the queue and drop the instance when the build queue is empty.  After some trial and error to find the right build server image, we got this up and running and now all builds are running on dynamically instantiated or elastic build servers.

The Final Four

Once we got the kinks worked out of the new build environment, all modules were building and all tests were passing - except four availability tests.  These tests worked fine on the old servers and were suddenly failing in the new environment.  We felt it was some kind of environment issue and weren't overly concerned about it, but we don't screw around with availability math and so we weren't about to proceed with development until we knew for sure, until all tests were green.

After examining the specific tests, I first suspected that Hibernate wasn't flushing sessions and therefore the conflict query might have been getting stale data.  I tried to deal with this by creating a new AOP aspect that would intercept the test fixture setup and assertion methods for the functional test suite - and force each one to use its own Hibernate session.  This mirrors how the code being tested runs in real life - and we probably should have done it a long time ago.

The new session management aspect worked fine - except that it didn't fix the four failing tests.

Time Zones

That brings us to about 3:30 AM on Tuesday morning.  Just as my head hits the pillow, I start to wonder if maybe running the build servers on Universal Coordinated Time (UTC) might have something to do with it.

The next morning (five hours later), I get up and check all the test code for spots where the Java Calendar object is used without setting a time zone.  I also go back to our localization code and revert a change I had made to fix a failing test during the build server changeover.  Here it becomes obvious that running the build servers on UTC is the issue - or rather unmasked the issue.

Working most of yesterday trying to find code where un-localized dates and times might have been worked with, it becomes obvious that we're dealing with a large issue and lots of spots in the code that will need to be touched.

In essence, we need to make sure that every instance in the code of...

Calendar cal = Calendar.getInstance();

...becomes...

Calendar cal = Calendar.getInstance(timezone, locale);

Not only that, but we need to make sure this mistake can never get made again.

Checkstyle

It's similar to what we did with number and date formats way back when we purged a great many of the America-centric assumptions about date and number parsing from the code.  To prevent us from reverting to bad habits, we added a custom module to checkstyle that caused a build failure if java's built-in date and number parsing classes were used directly instead of our newly created localization service.  This worked well and has kept peace for some time.

So, this morning I added a new checkstyle module that applies the same philosophy to Java Calendars.  It scans the code for every spot where calendars are instantiated.  If the developer uses the no argument insantiator instead of the one that takes a timezone and locale, the build will fail.  I also added checks for TimeZone.getDefault() because UTC is never a valid timezone in Flex user space.  And I wrapped it up by making sure nobody ever calls setTimeZone() on an instantiated calendar  - because that could undo all the good things done by instantiating the calendars with time zones.

I did this by knocking together a quick little module that uses Checkstyle's parser architecture.  The relevant bit is shown here:

    protected void visitMethod(DetailAST token) {
             
       
        DetailAST methodCall = token.getFirstChild();
        DetailAST invocationTarget = methodCall.getFirstChild();
        DetailAST methodName = invocationTarget.getNextSibling();
       
       
        if (invocationTarget.getText().equals("Calendar")) {
            if (methodName.getText().equals("getInstance")) {
                DetailAST argList = methodCall.getNextSibling();
                DetailAST arg = argList.getFirstChild();
                if (argList.getChildCount() != 3) {
                    log(token.getLineNo(), "Unsafe Calendar instantiation.  Use Calender.getInstance(TimeZone, Locale) instead.");
                }
            }           
        }
        else if (invocationTarget.getText().equals("TimeZone")) {
            if (methodName.getText().equals("getDefault")) {
                log(token.getLineNo(), "Default Time Zone Instantiation.  Use LocalizationService.getSystemTimeZone() instead.");
            }
           
        }
        else if (invocationTarget.getText().length() >= 3 && invocationTarget.getText().toLowerCase().endsWith("cal")) {
            if (methodName.getText().equals("setTimeZone")) {
                log(token.getLineNo(), "Time zone override in calendar.  Initialize calendar using desired time zone instead of setting later.");
            }
        }
       
       
    }
It's not fancy, but it does the job.

With this new module in place, the build server will flag all the spots in our code that need to be changed and we'll simply iterate over the code until all modules pass the checkstyle test.

The Bug

Switching the build server environment from Mountain Time to UTC revealed a pretty serious availability bug - one that we think impacts calendars (as opposed to quotes and pull sheets) because the time zone glitch may have impaired the system's ability to determine when days start and end - potentially turning your days into British days since all cloud servers also run on UTC.

It's never fun to find a glitch in availability or calendar math, but this is exactly why we do regression testing - to quickly discover any negative ripple effects from code or environment changes.  The hope at this point is that once all the unlocalized references to calendars are cleaned up, that the final four tests will pass.

Going forward the build environment should prevent further regressions and the new checkstyle module should prevent the return of bad localization habits.

No comments:

Post a Comment