Friday, January 20, 2012

Leaks By Reference

Yesterday I started working on a strange bug reported by Go Vision in which the schedule screen locked up loading schedule data for one of their LED tiles.  For context, we're talking about this screen:

Usually when we see the Flash workbench throw up the "Loading..." modal and lock up, it means an ActionScript error occurred behind the scenes.  Usually it's a simple error, like a Null Pointer Exception.  We run the test case with our debug version of the Flash Player and it takes us right to the source.

Not in this case.  When I got around to unpacking this bug in a test environment, we saw a server side error: a MySQL Packet Size exception when executing the conflict query - which provides raw data for the availability engine to perform its calculations with.  This was very odd, because conflict/availability queries aren't that large, nowhere near large enough for MySQL packet sizes to be an issue.  We've seen packet size issues before, but only as they related to blobs stored using Flex's document elements - things we expect to be big on occasion.

It made no sense at all for a simple conflict query to be anything larger than a few KB at the most.  So rather than skirt the issue by increasing the max packet size, I set about figuring out why the query was so large in the first place.  What I found is that the query started small, but gradually grew bigger and bigger until it became too big for reasons which will soon become apparent.

But first, some background.  The availability data used to fill in the schedule isn't one availability query, but one for each day.  The facade function that coordinates assembling all the data needed by the calendar UI loops over the range of visible days on the calendar and runs an availability query for each day.  The Java code below shows the basic structure of the main loop - with some technical details removed for clarity:

ManagedResource resource = getResourceService().findById(resourceId);
Collection<ManagedResource> resources = new ArrayList<ManagedResource>();
while (currentDate.before(endDate)) {
      if (currentDate.before(now)) {
                //grey out the past
      else {  
                result = findAvailableQuantityResults(collab, resources, dates);                                      //serialize result                         
      currentDate = DateUtils.add(currentDate, Calendar.DATE, 1);

For more clarity, I've pasted in the section of the code that generates the conflict query below -- the query that was somehow exceeding MySQL's max packet size:

            StringBuffer idBuffer = new StringBuffer();
            for (ManagedResource rc : resources) {
                rcMap.put(rc.getObjectIdentifier(), rc);
            idBuffer.deleteCharAt(idBuffer.length() - 1);
            StringBuffer query = new StringBuffer();
            query.append(", L.managed_resource_id");
            query.append(" FROM st_prj_project_element_line_item AS L");
            query.append(" WHERE L.snapshot = 0 AND L.is_deleted = 0 ");
            if (confirmed) {
                query.append(" AND L.is_conflict_creating = 1 ");
            query.append(" AND L.managed_resource_id in (");
            if (startDate != null) {
                query.append(" AND ( (L.pickup_date <= ? AND L.return_date >= ?)");
                query.append(" OR (L.pickup_date >= ? AND L.pickup_date <= ?)");
                query.append(" OR (L.return_date >= ? AND L.return_date <= ?) )");
When the schedule assembly loop calls findAvailableQuantityResults(), one way or another this query gets assembled and executed.

Now, when I started examining the error in the Eclipse debugger (ironic that our Java development IDE has the same name of one our competitor's products), I started by looking at the value of the query variable when the Packet exception gets thrown.  It was so large, Eclipse couldn't even display it.  But looking at the code that concatenates the various parts into a finished query, it wasn't hard to spot the only part of the query that could have an impact on the query size: the idBuffer.  So I switched from looking at the query to the resources collection that we loop over to assemble the idBuffer (which is a list of items for which availability is being checked -- this allows us to check the availability of multiple items in one query).  Couldn't really browse through the resources collection either, but after a taxing delay, I was able to ascertain the size: over 62,000 elements, meaning somehow the resources collection passed into the findAvailableQuantityResults() function had magically grown from 1, when we initialize it just before entering the day loop, to over 62,000.

This was enough to tell me that we had a pass-by-reference style problem and that the availability engine was somehow manipulating the resources collection after it's passed in. (And I know it is, because this is how we process component availability for kits and packages.)  I setup some logging to output the size of the resources collection for each pass.  It started at 1 and grew exponentially until the max packet size exception stopped the experiment.  This means that the availability engine manipulated version of the resources collection became the input value on the next loop iteration, which, as we've seen, wreaks havoc. 

In lower level languages like C++, whether to pass a variable by value or by reference is something you think about quite often.  You can pass variables or their pointers; you must choose which.  In Java, everything's passed by reference except primitives.  Developers don't have any say in the matter.  You can't pass anything by value unless you somehow clone or copy it first.

Needless to say, that was the solution here, making a fresh copy of resources collection for each pass, so:

result = findAvailableQuantityResults(collab, resources, dates);

         result = findAvailableQuantityResults(collab, new ArrayList(resources), dates);

After this one line code change, the schedule fired right up, and for obvious reasons (O=1 is better than 0=n^n), loaded must faster to boot.

So, my thanks to Go Vision for stumbling onto the right combination of settings to make this error something big enough for us to see.  What was a functional bug fix for Go Vision became a major performance enhancement that will benefit everyone.

No comments:

Post a Comment