Tuesday, January 31, 2012

The Availability Cache


Having done software development in other industries ranging from insurance to manufacturing to nuclear security, it always amazes me how well the rental industry stacks up to those industries in terms of complexity and uniqueness.  In many respects, the rental industry is like any other: we make customer contacts, market our wares, provide quotes, take orders, fill orders and ship orders.  Pretty standard business well within the capability of SAP, Oracle Fusion, or QuickBooks, right?  Not quite.  There is one little wrinkle that makes the rental business different.  The stuff we ship comes back.  That one caveat renders the big ERP packages a poor fit for our industry and borderline useless in many cases.

The number one benefit anyone purchasing rental software hopes to derive from it is a clear way of managing availability.  Rental software is only as good as the availability math and the availability math is complicated.  For a single location without purchasing or subrentals factored in, it's not so bad.  You can just run a database query over a date window, find which point in the date range the utilization is the highest and subtract that number from the quantity owned.  You can also use a running total technique.

This is complicated enough, but it's woefully oversimplified given the realities of the average rental house.  In that reality equipment is transferred between shops, often directly from show to show without returning to the shop.  Equipment breaks and must be taken out of service, which can have ripple effects on future jobs.  Amps can live inside cases and still be a part of inventory, although for the purposes of availability, should be invisible unless the customer wants the whole amp rack -- or not as with our new storage container feature.  Items tracked by serial number still need to be factored into the math when querying availability for the parent item.  In some cases, an item isn't really an item at all, but a composite or kit whose availability is the minimum availability of its component parts.  And so on.  We have a giant pile of special edge cases we account for in the Flex availability algorithm.  These are but a few.

Then there's the matter of what you, as the user, will do with the availability math.  Most of us don't say 'no' when a customer inquires about availability and gear is out.  We say yes and try to cover the shortage with a subrental or by purchasing the equipment, maybe transferring it from another shop.  Any system that attempts to give a true picture of availability has to account for all of this.

At Flex, as we added these layers of detail to our availability calculation, we noticed that the calculation took longer and longer.  On a per calculation basis, the calculation is pretty fast, but when hundreds of calculations have to be done for larger quotes or pull sheets, it can take quite a while for all the calculations to complete.

Last week we set out to address this issue.

The Developer's Toolkit

When I first started working in professional software shops I hoped to learn some neat technical tricks for improving performance, some secret sauce the pros use to make things zip.  Over time, as I learned these tricks, particularly those available to us who write database driven business software, I was a bit disappointed.

In most database applications, not surprisingly, the database is the bottleneck.  Scrutinizing the code for Big 0 improvements is fine, but tuning, in my experience, always comes down to how the database is queried and when.  With this in mind, the performance tuner's list of options is limited:

  • Do something only at the exact moment you need to do it (lazy fetching)
  • Do something before you need to it (eager fetching)
  • When you do something, preserve it so you don't have to do it again (caching)
You can also combine eager fetching and caching by loading commonly used data or resources when the system starts up.  When you launch Photoshop for example -- and the splash screen pops up and you wait for it to open -- this is what it's doing.

That's our toolkit. 

Profiling

The first rule of performance tuning in software is not so different from carpentry: measure first, fix once.  This stems from the tendency of developers to make assumptions about what aspect of the code might be slow and spend fruitless hours iterating over something that may look inefficient, but in practice contributes very little to the overall slow down.

At Flex we use a profiling tool called YourKit.  When we started profiling the availability speed, using data from the customer that brought the issue to our attention, the profiler allowed us to quickly identify some big - and easily fixed bottlenecks - usually by switching something that was formerly eager fetched to being lazy fetched and vice versa.

The one nut we could not crack by tuning data fetch strategies was something we call the double conflict filter.  In situations where a quote is used to generate a pull sheet and the scan process creates a manifest, control of availability is passed down the line to what we call downstream elements.  The double conflict filter ensures that if a quote, a pull sheet and a manifest are all present, that a given item is not counted more than once.  This is a recursive operation.

When we fired up the test availability scenario with in our profiler, we saw that the real bottleneck appeared to be the hasDownstreamMatch() method, as shown in this screenshot:



The other, longer running methods shown in the profiler all wrap each other; it's almost like a reverse view of the call stack, so we know that the hasDownstreamMatch() method is the longest running single component of the higher level methods.

The hasDownstreamMatch() method is a recursive method a line item can use to determine if there's a matching line item downstream in the workflow.  The method checks all downstream lines for the matching criteria (resource types, status, etc), and all the downstream lines for their downstream matches -- and so on.  The downstream line items used by the calculation are lazy fetched, meaning they are retrieved on demand.  Since this method is recursive, each recursion triggers it's own lazy fetch operation, making this a somewhat atypical variant of the N+1 select problem, though in this case N is the depth of the tree (if we're lucky).  Worst case N could run much higher, into the exponential order, though in practice the downstream line item graphs don't branch out much and N stays closer to the tree depth.

Seemed like a perfect opportunity to take a case where lazy fetching is causing an N+1 select problem and turn that N+1 into constant time with a eagerly fetched join.  The problem is this would require something called a recursive join, and recursive joins aren't supported by Hibernate (our O/R mapping framework).  Some databases do support recursive joins, but not MySQL.  Even so, the typical tree depth never runs much higher than four or five, so we could potentially join a table to itself a fixed number of times and get close enough.  We tried this.  This replaced several fast queries with one slow query and bypassing Hibernate required hydrating the ids returned by the query into objects. The net effect was that the performance bottleneck moved elsewhere, plus the join query was slow -- not surprising since we were pushing our luck with MySQL in the first place.

So we found ourselves with a bottleneck that could not be fixed with tuning fetch strategies.  It was time to contemplate caching.


Caching

We cache things all over the place in Flex.  If you check out the Flex Admin Console, you can peek under the hood and see information about what information Flex is caching and even clear the caches if you want to.

The difference is we tend to cache things that hardly ever change, like project element definitions and configuration settings.  There's no point going back to the database every time we need a configuration setting that might change once every four years.  We've also recently introduced a second level Hibernate cache, which helps Hibernate do its internal work faster.

We'd always thought about caching availability information -- after all, there's no point recalculating something if the underlying data used to calculate it hasn't changed -- but something about this always made us nervous.  Caching availability is one thing, but what about keeping the cache up to date?  Stale availability data is unacceptable.  The concern for us was always missing a condition that should trigger a cache eviction.

Yet, there was no other way around our recursive join issue.  Caching availability had to be on the table.  We knew right away the real trick to making an availability cache fast and reliable was the right cache eviction strategy.

Cache Eviction

Cache eviction refers to the process of throwing stale data out of the cache.  For the availability cache, it means determining which events would render which availability calculations invalid.  All the caveats and edge cases that make the calculation slow enough to warrant a cache also make the cache eviction events complicated.  We found ourselves facing a situation where just determining when to evict something from the cache might be slow.  I believe Joseph Heller coined a phrase for situations like that.

Then we hit upon an idea: what if instead of trying to build the perfect, surgical cache eviction system, we opted for a simple approach that erred on the side of caution and evicted all data that might be stale?  This would mean evicting some data from the cache that doesn't need to be evicted, but it would ensure that all stale data goes along with it while still preserving most of the benefits of an availability cache.

Our strategy became a simple one: whenever any information about a given resource changes, evict all cache entries that reference that resource.  (Resource is the generic term in our architecture for schedulable things like contacts, facilities, serial numbers and inventory items.)

If a serial number is added, evict all cache entries that reference the serial number and its parent item (because adding a serial number would change the on hand quantity).  Whenever a line item is modified or created, evict all cache entries that reference whatever resource the line item referred to.  Notice I said evict all resource entries, not just the line item.  That means if you have a quote with a VL-2500 on it, and you change the quantity from 1 to 2, all other line items that reference VL-2500 will have their cache entries evicted.  Same thing if you change a quote's date range or locations; the cache entries for all items referenced on the quote will need to get purged from the cache, not just the cached records for the quote you changed.  This should come as no surprise.  After all, rental availability is about how different jobs impact each other; no job exists in a vacuum.


Implementation

First off, we opted for a database cache with the potential for an in memory cache to come along later.  (High Availability Flex will have a distributed cache and we'll tackle the problem then.)  We started, rather naively, by adding a column to the line item table for the cached value and the cache eviction process would simply set this value to null.  Needless to say, we bypassed Hibernate and went with straight SQL.

We built the whole cache architecture around this approach, hooked in all the eviction events and began testing it.  The problem we ran into in beta testing was MySQL table locks.  The line item table is the most active table in Flex.  Cache eviction queries were failing intermittently or locking other operations out.  This manifested itself in beta testing with occasional scan failures.  Needless to say, this approach wasn't working.

Luckily, the solution wasn't all that terrifying.  We just moved the cache to its own table with no foreign key relationships (although we did add indexes in place of foreign keys for speed).  The reason we didn't do this before is that adding a cache entry for a line item for whom no cache information has ever been stored requires two queries: one to determine if the line is present and another to optionally insert or update it.  We fine tuned this assumption by skipping the select query and using the number of updated rows returned by the standard update query to determine if an insert is necessary.  If the number of updated rows is zero, we know we need to run an insert query to insert a cache entry.  This has worked well so far.

Cache Eviction Paranoia

Even so, all those queries that failed due to table locks during beta testing left us a little wary of just trusting that every eviction query would magically succeed now that we had a brand new table just for the availability cache.   If adding an entry to the cache failed, no biggie; we'll just have to recalculate it next time.  If a cache eviction query fails, however, that could mean users getting stale availability numbers.  So we built a rather elaborate system for retrying failed cache evictions. 

To begin with, a cache eviction triggering event can tell the cache management service whether or not the cache eviction must happen before the user transaction action that triggered it completes.  A line item quantity change is a good example of this.  The cache must be evicted immediately to trigger a recalculation.  In other situations, such as changing a non serialized quantity or adding a serial number, it might be okay if the cache eviction doesn't happen for a second or two.

However the cache eviction is initiated, either in the user's thread or in another thread, if it fails, it must be reattempted; and when this happens we want to give the system some time to breathe.  Trying the exact same thing the millisecond after it failed is likely to produce the same result, and a cascade of deadlocks if things get really out of hand.

So we built a retry queue and made it a persistent JMS queue.  This means that if the first cache eviction attempt fails, information about the cache eviction request is serialized as XML and placed on a retry queue.  A process listens to the queue, and after a respectable delay (breathing room), unpacks the message and tries the eviction attempt again.  If it fails again, it goes back on the queue until the max retry threshold is reached.  (Which is 10, by the way.)  The idea here is to stretch the retry attempts out over a fairly long period and increase the chance that whatever intermittent condition that caused the first eviction attempt to fail will have cleared up.


Testing Cache Eviction

Testing when and if the cache was evicted presented a unique problem for Courtney, our one woman QA team.  The availability numbers in quotes look the same to end users whether they were retrieved from cache or not.  Courtney needed an easy way to tell which numbers came from cache and which numbers were recalculated on demand.

Flex already has a number of test utilities and hooks built into the client, but you have to put the client in debug mode to access them.  How do you do that?  You add &debug=true to the query string.  When you do this and load up a quote, you'll be able to tell if the availability numbers you're looking at came from cache or not by hovering over the number; the tooltip will say either 'Cached' or 'Recalc'.

(Note that if you double click on the availability numbers and open the conflict dialogue, this will trigger a recalculation and update the cache -- because the conflict dialog requires detailed information that isn't stored in the cache.)

Asynchronous Cache Updates

Because large cache evictions have the potential, in theory, to hold on to table locks for a few seconds, we didn't want recalculations that trigger cache updates to have to deal with this possible lock contention.  The solution was to hand the cache update off to separate thread and let the user go on with their life.  This means that even if the cache eviction from hell locks the cache table for five seconds or more, that waiting is done by the thread and not by the user.  We also decided to serialize cache updates to control the number of simultaneous update threads, thereby minimizing the number of concurrent threads requesting table locks.  Here's a snippet of code that deals with cache updates:

         if (threaded) {
            executor.execute(new Runnable() {
                               
                @Override
                public void run() {
                    doCache(resourceId, lineId, availability, compositeAvailability, expiry);
                   
                }
            });
        }
        else {
            doCache(resourceId, lineId, availability, compositeAvailability, expiry);
        }

To anyone familiar with Java concurrency, this is very basic stuff.  We have a method elsewhere in the service class that does the work and we simply declare an inline Runnable and throw it on the Executor, which manages the threads for us.

Precalculation

With Chris away doing trade shows, there was no one around to stop us from gilding the lily, so we took the cache one step further and created a background agent to precalculate availability.  You can see the settings for this agent (in 4.4.17 and up) under Projects > Availability Cache. 

What the precalculation agent does is wait until the server is idle, make an educated guess about which quotes or jobs are likely to be viewed by a user, and precalculate the availability for those jobs.

There's two bits of logic that make this work.  The first is determining when the server is idle; we wouldn't want to waste CPU on speculative calculations when users are on the system.  The other is the guesswork used to determine which calculations are most useful to perform because it would be pointless to precalculate everything.

For the first part, determining when the CPU is idle, we present the user with three important settings.  The first is Max CPU Threshold, or the percentage of CPU utilization above which precalculation should not be running.  The second setting is Agent Check Interval and this determines how often the agent should check the CPU load.  This is used in conjunction with the Min Quiet Period setting to determine when to start the precalculation agent.  For example, if you have a CPU threshold of 20%, an agent check interval of 5 minutes and a quiet period of 3, this means that the agent will check the CPU every five minutes and if the CPU is below 20% three times in a row, the precalculation will start.  In practice, this would mean that precalculation would start approximately fifteen minutes after the users go home for the day.  When the CPU goes back up above the max threshold, the process will stop.

The next bit of logic, guessing which jobs to precalculate, is done by first looking at the recent items for all users.  Any job a user has recently looked at will get calculated first.  Once all these are done, jobs starting within a certain number of days of the present will be calculated.  The default is 30 days, but this can be adjusted by changing the read ahead period.  The read ahead period can be changed to months -- or even years ahead if you can afford the electricity.

In Conclusion

The availability cache has cleared testing and will be deployed as part of the Flex 4.4.17 release, which is due to be pushed this week.  In testing we've noticed a significant increase in the speed and overall usability of the system.  It was a fun project to work on and we hope Flex users will find it a welcome addition, even if they never really know it's there.

Monday, January 23, 2012

DNS and SOPA

In light of the brew-ha-ha over SOPA and some recent DNS changes we made at Flex, I found myself in the position of explaining DNS to my family and others over the weekend -- and it occurred to me that DNS is one of those aspects of the Internet that very few lay people know about, and even those of us who work in technology everyday only have a cursory understanding of it.

The various political perspectives on SOPA have been well covered.  I can't really add much to the debate and this is a technical blog, not a political blog.  With that in mind, I'll stop at pointing out that SOPA has a number of enforcement mechanisms subject to preliminary injunctions (meaning the claims against a web site don't have to be proven in court before the site's taken down), but one of the most alarming is the one related to DNS filtering.  Here's the relevant portion of the bill (H.R. 3261) in its current form:

A service provider shall take technically feasible and reasonable measures designed to prevent access by its subscribers located within the United States to the foreign infringing site (or portion thereof) that is subject to the order, including measures designed to prevent the domain name of the foreign infringing site (or portion thereof) from resolving to that domain name's Internet Protocol address.
This means creating a DNS blacklist.  Rather than expatiate about why this is or isn't a good idea, it might be a better idea to explain how DNS works first and save conclusions for later.

IP Addresses

First off, the servers on the Internet -- and all the client machines that browse it, are assigned unique numbers called IP addresses.  You've probably seen these before.  They look like this: 66.29.188.245.  They have four numbers separated by dots.  Each number is called an octet, because each number actually represents a byte, which is eight bits -- hence 'octet'.  Since each IP address has four octets and each octet is 8 bits, full IP addresses are 32 bits.  And since each bit can only have two discrete values (0 and 1) and 2 raised to the 32nd power is around 4.2 billion, that means there can only be 4.2 billion discreet IP addresses.

Last time I checked there were roughly 7 billion people in the world, many of whom have devices that need IP addresses: laptops, cell phones, iPads, desktop computers, video game consoles - and increasingly, consumer electronics.  My Blu-Ray player and Roku box, for example, connect to the Internet, which means they also need IP addresses.  Then there's the server side of the equation: millions of servers in concrete bunkers, all of whom also need their own unique address, as does every router, firewall, wi-fi hotspot and networking appliance in between.  A technique called Network Address Translation can be used to stretch the limits of scarce IP addresses, but the problem remains.

The telephone industry was hit with the same problem not long ago, when cell phones, pagers and fax machines started to tax the quantity of available phone numbers.  The phone companies responded by introducing more area codes.  The Internet world responded by introducing a new kind of IP address called IPv6.  (The old kind is called IPv4.)  IPv6 raises the number of bits from 32 to 128.  The total number of addresses in a 128 bit space is on the order of 10^38, a number so big that most of us have never even heard the word used for it (one hundred sextillion).  That should hold us for a while.

How all this relates to DNS is that IPv4 addresses are hard enough to remember.  Take a look at an IPv6 address: [2001:db8:85a3:8d3:1319:8a2e:370:7348]. Good luck memorizing that or writing a catchy jingle to drive customers to your web site.  Any marketing strategy that requires the consumer to understand hexadecimal might be doomed to failure.  Just sayin'.


We Need Names


IP addresses govern how network devices are identified and how information moves between them.  But, we give web sites names like google.com because it's much easier to remember those names than IP addresses. 

We used to map these obscure addresses to server names using a text file called hosts or hosts.txt.  This file was (and is) a simple list of IP addresses mapped to the names by which they can be referred.  If you search your computer for this file, you'll find it hidden away somewhere.  Here's the one from my computer:
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1       localhost
192.168.1.101   local-dev-db
255.255.255.255 broadcasthost


Each computer had a copy of this file.  Every computer still does in fact, though it's only used to override DNS these days.

Domain Name System

Then along came DNS, or the Domain Name System, which is, in essence, a distributed database of IP addresses and their names.  When you type in a web address, like www.flexrentalsolutions.com, your computer must first determine the IP address to connect to.  It does this using a piece of software called a resolver.  The resolver first checks the hosts file on your computer.  If the domain name isn't found there, as it likely won't be, it refers the request to your DNS server.

You have a DNS server?  Yes, you do.  You have several.  Usually they're set by your ISP or corporate network using a networking protocol called DHCP.  DHCP is part of the handshake that puts your computer on the Internet when you first turn it on or connect to a wi-fi hotspot.  If you want to see what your DNS servers are, you can open a command shell on Windows and type ipconfig, ifconfig on Linux.  On Mac OS, you can take a look at the Network applet in System Preferences.  Here's what mine looks like:


As you can see, my ISP, Comcast, has assigned me the DNS servers 75.75.75.75 and 75.75.76.76.  I could change these to any value that I like, but I'm not Kevin Mitnick and SOPA isn't law yet.

So, in my case, if I type www.flexrentalsolutions.com into my web browser, my computer will contact 75.75.75.75 or 75.75.76.76 and ask these servers for the correct IP address.  Does this mean that 75.75.75.75 and 75.75.76.76 know the correct IP address?  Maybe, but probably not.  They do know how to get it however.

Because DNS is a distributed database, none of the information exists in one place.  It's scattered all over the planet in caches and zones across thousands of servers.  Let's walk through a simple request and assume (wrongly) that Comcast's DNS servers cache nothing - not even top level domains.  To begin with, polling all those thousands of DNS servers until you find one with information is not a very practical idea.  There has to be a way to efficiently sift through them for the correct server and that information is the basis of the domain name system itself, it's the reason we have .com and .org and .net.  These are called top level domains.

When my www.flexrentalsolutions.com request comes in, if the DNS server has no cached information, it will first have to determine which server has information about .com domains.  It must ask a root server - and there are only 13 of them.  They're maintained by the Department of Commerce in conjunction with ICANN.

The Root server will return something called a SOA (Start Of Authority) record, or a list of DNS servers that can be consulted for more information about .com domains.  Using a tool called nslookup on my Mac, here's what I got when I queried the root DNS server operated by the US Army Research Lab for www.flexrentalsolutions.com:

QUESTIONS:
    www.flexrentalsolutions.com, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  com
    nameserver = f.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = b.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = i.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = d.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = e.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = m.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = c.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = k.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = g.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = h.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = l.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = a.gtld-servers.net.
    ttl = 172800
    ->  com
    nameserver = j.gtld-servers.net.
    ttl = 172800
    ADDITIONAL RECORDS:
    ->  a.gtld-servers.net
    internet address = 192.5.6.30
    ttl = 172800
    ->  b.gtld-servers.net
    internet address = 192.33.14.30
    ttl = 172800
    ->  c.gtld-servers.net
    internet address = 192.26.92.30
    ttl = 172800
    ->  d.gtld-servers.net
    internet address = 192.31.80.30
    ttl = 172800
    ->  e.gtld-servers.net
    internet address = 192.12.94.30
    ttl = 172800
    ->  f.gtld-servers.net
    internet address = 192.35.51.30
    ttl = 172800
    ->  g.gtld-servers.net
    internet address = 192.42.93.30
    ttl = 172800
    ->  h.gtld-servers.net
    internet address = 192.54.112.30
    ttl = 172800
    ->  i.gtld-servers.net
    internet address = 192.43.172.30
    ttl = 172800
    ->  j.gtld-servers.net
    internet address = 192.48.79.30
    ttl = 172800
    ->  k.gtld-servers.net
    internet address = 192.52.178.30
    ttl = 172800
    ->  l.gtld-servers.net
    internet address = 192.41.162.30
    ttl = 172800
    ->  m.gtld-servers.net
    internet address = 192.55.83.30
    ttl = 172800
    ->  a.gtld-servers.net
    has AAAA address 2001:503:a83e::2:30
    ttl = 172800
If I query the same root server just for com as opposed to the full web address, I get the same response, meaning that the root server just maintains a list of servers that are considered authoritative for the .com top level domain.  So we've now gone from the root server to the top level domain (TLD) servers for .com addresses.  I happen to know that these servers are operated by Verisign.  (For a complete list of operators for all Top Level Domains, check here.)  When you register a .com web site, you're essentially paying Verisign to add your domain to their servers or paying a third party like register.com or godaddy to do it for you.

Let's query one of these TLD servers for www.flexrentalsolutions.com and see what we get:
  QUESTIONS:
    www.flexrentalsolutions.com, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  flexrentalsolutions.com
    nameserver = ns-756.awsdns-30.net.
    ttl = 172800
    ->  flexrentalsolutions.com
    nameserver = ns-421.awsdns-52.com.
    ttl = 172800
    ->  flexrentalsolutions.com
    nameserver = ns-1070.awsdns-05.org.
    ttl = 172800
    ->  flexrentalsolutions.com
    nameserver = ns-1830.awsdns-36.co.uk.
    ttl = 172800
    ADDITIONAL RECORDS:
    ->  ns-756.awsdns-30.net
    internet address = 205.251.194.244
    ttl = 172800
    ->  ns-421.awsdns-52.com
    internet address = 205.251.193.165
    ttl = 172800

We're getting warmer.  Versign's servers can't give us the final answer, but the get us one step closer: they return the DNS servers that are authoritative for our domain.  Let's query one of the servers listed above for SOA records and see what we get:


    QUESTIONS:
    www.flexrentalsolutions.com, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  flexrentalsolutions.com
    nameserver = ns-756.awsdns-30.net.
    ttl = 172800
    ->  flexrentalsolutions.com
    nameserver = ns-421.awsdns-52.com.
    ttl = 172800
    ->  flexrentalsolutions.com
    nameserver = ns-1070.awsdns-05.org.
    ttl = 172800
    ->  flexrentalsolutions.com
    nameserver = ns-1830.awsdns-36.co.uk.
    ttl = 172800
    ADDITIONAL RECORDS:
    ->  ns-756.awsdns-30.net
    internet address = 205.251.194.244
    ttl = 172800
    ->  ns-421.awsdns-52.com
    internet address = 205.251.193.165
    ttl = 172800
This tells us a few important details.  First, it tells us that we've reached the authoritative DNS server with official information on the flexrentalsolutions.com domain, including some information related to caching.  More on that later.  Now that we know we've reached the authoritative DNS server for our domain, let's find the IP address for www.flexrentalsolutions.com. 

------------
    QUESTIONS:
    www.flexrentalsolutions.com, type = A, class = IN
    ANSWERS:
    ->  www.flexrentalsolutions.com
    internet address = 66.29.188.245
    ttl = 3600
------------
Name:    www.flexrentalsolutions.com
Address: 66.29.188.245

So, there's our answer: 66.29.188.245.  We made it in four moves, which is much simpler than randomly scanning DNS servers until we find one with information.  In reality, when I query the Comcast DNS server, it does all these steps automatically.  This is called a recursive DNS request.

Caching

In practice it's unlikely that a request would go all the way to the root DNS servers, or even the top level domain servers, because DNS servers cache as much information as they can.   They also frequently bulk transfer information for big domains (like top level domains), which renders on demand caching unnecessary.

As you scan the output in the examples above, you'll notice that each record contains something called TTL.  This stands for Time To Live and is the amount of time, in seconds, that this information should be considered valid without needing to recheck it.  If you look at the records returned by the root server, they each have a TTL of 172800, which is 48 hours.  When a DNS server checks this information, it will save a copy and refer to the copy for at least 48 hours before checking the root server for updated information.

By the same token the top level domain server in our example returned an SOA record for flexrentalsolutions.com with a TTL of 172800, also 48 hours.  What this tells me as the administrator of the flexrentalsolutions.com domain is that if I ever change DNS providers, I'll need to keep the old one up for at least two days before shutting the old one down.

Then we come to the final result, the actual IP address record we were after in the first place.  It comes with a TTL of 3600 seconds, or one hour.  This only gives non-authoritative DNS servers permission to cache the result for 1 hour.  We use a low value like this to give us the ability to quickly change servers if we need to.

DNS Tricks

For simple web sites like ours where there's only one server, returning the same IP address with a relatively long TTL like 3600 is fine.  For big companies like Google and Facebook however, who operate many thousands of servers in dozens of locations around the world, they might need more exotic techniques.

In these cases, big companies might return a different answer for each DNS request, perhaps routing requests from the United States to US data centers and European requests to European data centers.  They could use traffic management algorithms to route traffic to data centers that have more unused capacity.  They could use a simple round robin DNS technique to randomly distribute load across multiple servers and data centers.  For these situations, it's not uncommon to see a TTL of 300, (5 minutes).

As we roll out our High Availability architecture over the next year, you'll see our TTL's drop as well, though in our case it's for rapid failover in the event of a catastrophic data center event.


Hooks For SOPA Enforcement

Since DNS is required for web surfers to find web sites, we can't really live without it.  In order to wipe a .com web site off the internet within 48 hours, all you have to do is convince Verisign to remove it from each of their 13 TLD servers, say, with something like a preliminary injunction from a judge.  That's it.  In two days, unless users are using alternative DNS systems or DNS servers have have ignored the TTL setting, your web site is invisible, even if your authoritative domain server is still up.  (Full disclosure: there are other ways a blacklist could and probably would be implemented.)

Even more interesting is the fact that 13 root servers control all officially recognized domains.  The same technique could be used to blacklist an entire TLD, effectively shutting down all .com domains withing 48 hours.  Since these servers are all controlled by the United States Government, this gives them a convenient Internet kill switch.  If you see the TTL's on the root servers drop from 48 hours to something shorter, you can be reasonably sure they're thinking about doing it.


Alternative DNS Systems

Paranoia about the concentration of power in the hands of the US Government and the companies that have cut deals with ICANN has led to the development of alternative DNS systems, systems that have been developed expressly for the purpose of keeping blacklisted domains accessible.

After all, what keeps the authority of root DNS servers in tact is that fact that most Internet service providers stick to the 13 official root servers.  There's nothing to prevent a private party from mirroring the root servers or any of the TLD servers and running their own DNS.  There's nothing to prevent you from finding out about one of those servers and manually changing your system to use it instead of the one suggested by your ISP.  There are also browser plugins that can do the same thing - - and these are much easier for non-technical types to figure out.

And in the wake of the SOPA/PIPA scare, these services and tools are multiplying rapidly, like bacteria on an agar plate.

Chaos Theory

One of the benefits of the current system of DNS and domain name registration is that it's secure (enough).  People who register domains have to identify themselves (sort of) and control of which DNS servers are authoritative for all .com domains rests in the hands of 13 servers run by a big and legally liable company.  Right now it's a pretty open, accessible (meaning it's cheap to register domain names) and secure process.

What happens when government meddling compels hundreds of competing DNS systems to spring up and web users to switch over to them?  And once that happens, what's to prevent a private DNS operator from maliciously changing DNS entries for popular web sites?  Or to prevent a hacker from messing with a poorly protected DNS server?  Maybe you get a different wellsfargo.com from the private DNS server than from the official system.  Sound insecure?  It is.  It also makes registering a domain name much harder - because there are more servers that need to be updated.

For big companies that use DNS tricks for traffic management, it would also make their life much harder, because private DNS providers would not have access to the data used to drive the traffic management algorithms.  In fact, private DNS could be an effective way to intentionally overload and crash big web sites.

This is why companies like Google insist that SOPA will make the Internet less secure, and they're right.  It will create a DNS free for all.  The only way to prevent that is to remove the incentives that will push users to a competing system - and find non-DNS ways to execute enforcement actions against black hat web sites.  Otherwise, pirates and those who download their wares will just go around the official DNS system, and break the internet for the rest of us while they're at it.

Friday, January 20, 2012

Leaks By Reference

Yesterday I started working on a strange bug reported by Go Vision in which the schedule screen locked up loading schedule data for one of their LED tiles.  For context, we're talking about this screen:



Usually when we see the Flash workbench throw up the "Loading..." modal and lock up, it means an ActionScript error occurred behind the scenes.  Usually it's a simple error, like a Null Pointer Exception.  We run the test case with our debug version of the Flash Player and it takes us right to the source.

Not in this case.  When I got around to unpacking this bug in a test environment, we saw a server side error: a MySQL Packet Size exception when executing the conflict query - which provides raw data for the availability engine to perform its calculations with.  This was very odd, because conflict/availability queries aren't that large, nowhere near large enough for MySQL packet sizes to be an issue.  We've seen packet size issues before, but only as they related to blobs stored using Flex's document elements - things we expect to be big on occasion.

It made no sense at all for a simple conflict query to be anything larger than a few KB at the most.  So rather than skirt the issue by increasing the max packet size, I set about figuring out why the query was so large in the first place.  What I found is that the query started small, but gradually grew bigger and bigger until it became too big for reasons which will soon become apparent.

But first, some background.  The availability data used to fill in the schedule isn't one availability query, but one for each day.  The facade function that coordinates assembling all the data needed by the calendar UI loops over the range of visible days on the calendar and runs an availability query for each day.  The Java code below shows the basic structure of the main loop - with some technical details removed for clarity:

ManagedResource resource = getResourceService().findById(resourceId);
Collection<ManagedResource> resources = new ArrayList<ManagedResource>();
resources.add(resource);
       
while (currentDate.before(endDate)) {
      if (currentDate.before(now)) {
                //grey out the past
      }
      else {  
                result = findAvailableQuantityResults(collab, resources, dates);                                      //serialize result                         
      }      
      currentDate = DateUtils.add(currentDate, Calendar.DATE, 1);
         }

For more clarity, I've pasted in the section of the code that generates the conflict query below -- the query that was somehow exceeding MySQL's max packet size:

            StringBuffer idBuffer = new StringBuffer();
            for (ManagedResource rc : resources) {
                idBuffer.append("'");
                idBuffer.append(rc.getObjectIdentifier());
                idBuffer.append("',");
                rcMap.put(rc.getObjectIdentifier(), rc);
                rcHeap.add(rc);
            }
            idBuffer.deleteCharAt(idBuffer.length() - 1);
               
            StringBuffer query = new StringBuffer();
            query.append("SELECT");
            query.append(" L.id, L.managed_resource_id");
            query.append(" FROM st_prj_project_element_line_item AS L");
            query.append(" WHERE L.snapshot = 0 AND L.is_deleted = 0 ");
            if (confirmed) {
                query.append(" AND L.is_conflict_creating = 1 ");
            }
            query.append(" AND L.managed_resource_id in (");
            query.append(idBuffer.toString());
            query.append(")");
            if (startDate != null) {
                query.append(" AND ( (L.pickup_date <= ? AND L.return_date >= ?)");
                query.append(" OR (L.pickup_date >= ? AND L.pickup_date <= ?)");
                query.append(" OR (L.return_date >= ? AND L.return_date <= ?) )");
            }
When the schedule assembly loop calls findAvailableQuantityResults(), one way or another this query gets assembled and executed.

Now, when I started examining the error in the Eclipse debugger (ironic that our Java development IDE has the same name of one our competitor's products), I started by looking at the value of the query variable when the Packet exception gets thrown.  It was so large, Eclipse couldn't even display it.  But looking at the code that concatenates the various parts into a finished query, it wasn't hard to spot the only part of the query that could have an impact on the query size: the idBuffer.  So I switched from looking at the query to the resources collection that we loop over to assemble the idBuffer (which is a list of items for which availability is being checked -- this allows us to check the availability of multiple items in one query).  Couldn't really browse through the resources collection either, but after a taxing delay, I was able to ascertain the size: over 62,000 elements, meaning somehow the resources collection passed into the findAvailableQuantityResults() function had magically grown from 1, when we initialize it just before entering the day loop, to over 62,000.

This was enough to tell me that we had a pass-by-reference style problem and that the availability engine was somehow manipulating the resources collection after it's passed in. (And I know it is, because this is how we process component availability for kits and packages.)  I setup some logging to output the size of the resources collection for each pass.  It started at 1 and grew exponentially until the max packet size exception stopped the experiment.  This means that the availability engine manipulated version of the resources collection became the input value on the next loop iteration, which, as we've seen, wreaks havoc. 

In lower level languages like C++, whether to pass a variable by value or by reference is something you think about quite often.  You can pass variables or their pointers; you must choose which.  In Java, everything's passed by reference except primitives.  Developers don't have any say in the matter.  You can't pass anything by value unless you somehow clone or copy it first.

Needless to say, that was the solution here, making a fresh copy of resources collection for each pass, so:

result = findAvailableQuantityResults(collab, resources, dates);
 became:

         result = findAvailableQuantityResults(collab, new ArrayList(resources), dates);

After this one line code change, the schedule fired right up, and for obvious reasons (O=1 is better than 0=n^n), loaded must faster to boot.

So, my thanks to Go Vision for stumbling onto the right combination of settings to make this error something big enough for us to see.  What was a functional bug fix for Go Vision became a major performance enhancement that will benefit everyone.

Thursday, January 19, 2012

Regression Testing Performance?

One of the biggest things you worry about when you develop software is that a change in one area of the codebase will have ripple effects and break something else.  This is why it's smart to put aggressive regression testing in place, preferably automated regression testing done in conjunction with a continuous integration process.  This way, a pile of tests gets run automatically every time you commit source code changes to the repository.

At Flex, we use Subversion for source control and a continuous integration tool called Bamboo.  Every five minutes, Bamboo checks Subversion for new commits and triggers a full rebuild and retest of the codebase.  Here's a screenshot from Bamboo for clarification:


As you can see, we have the code broken into various modules and each module has a certain number of tests.  We also have the Functional Test Suite, which adds additional tests for pricing, availability math, etc.  We don't do a release of Flex until this screen is clear of pending builds and 100% green.

We don't catch every ripple effect, or regression as we call them, but this tool prevents us from missing more regressions than we do.

But lately we've noticed a new kind of regression, one that isn't strictly functional.  The code still works as it always did, but somewhere along the line a change that seems minor to us causes a ripple effect in terms of performance.  Today we discovered that the calculation logic for total prep time was getting invoked for every bar code scan and slowing down scan response times to unacceptable levels in certain circumstances.  This of course was a ripple effect of tweaking the total prep time value so it could be retrieved in a batch HQL query, meaning that every time an equipment list was saved to the database, the total prep time was recalculated.  Doing this for every bar code scan introduced slowness in the form of an N+1 select issue (where N is the number of line items on the underlying pull sheet).

Once we identified the bottleneck, is was an easy fix.  We pulled the totalPrepTime out of Hibernate and this reduced the test scan in our case from 5.7 seconds to 580 milliseconds.  We also found another performance slow down in scans when containers are configured to automatically process their contents.  We call these auto scans and the auto scans were needlessly performing orphan scan checks (which include an availability check), so for containers with lots of contents, this could result in a major slow down.  We've added a new flag to skip orphan detection when processing a container child.  This sped things right up.

This got Chris and I thinking about how we might do a better job of catching performance regressions in the same way we catch functional regressions now.  If we'd established a benchmark for scan performance in our test bed, we could always trigger a test failure if a test run exceeds the benchmark by some predetermined value, perhaps by one standard deviation.

It'll be tricky to establish initial benchmarks and I think I favor a statistical technique that "learns" what the typical performance characteristics are, maybe with some provisions for adding hard limits for certain test scenarios.  Otherwise it's just trial and error and you'd end up chasing a lot of false negatives.

Adding statistical performance analysis to our test suite sounds like a lot of fun for me and a good way to prevent accidentally releasing new performance bugs in the future.  When the FastTrack schedule gets back under control, maybe I'll twist Chris's arm to let me take a crack at it.


Wednesday, January 18, 2012

Detail Block Row Model and Post Update Processors

The last big release of Flex, 4.4, concerned itself principally with replacing a UI component we call the detail block, as a necessary step to closing the availability donut hole for kits and packages (which we've now closed). 

The detail block is the component used on quotes and pull sheets to represent line items.  The old detail block was also some of the oldest client side code in Flex.  It was spaghetti.  Obfuscated, brittle and hard to maintain.

It was nice to redesign it.  After living with the limitations of the old version for so long, we had a good idea what sort of things we'd need to change to make our lives easier going forward.   We chucked the separate pull sheet and quote detail blocks in favor of a unified component and gave the new detail block a modular, plugin style architecture for handling things like data loading, drag and drop, and context menus. 

We also abstracted the data model from the view component and built something we call a 'delta processor' to take line item changes from the server and apply them to the user interface.  This was one of the toughest parts of the architecture to get right, but this centralization of data model changes also permitted us to introduce a few new architectural features that I personally think have made things much easier for us day-to-day.  Some of the design changes turned out to be abstraction for the sake of abstraction, but not row model processors and post update processors.  These have saved our bacon on more than one occasion and made life much easier on the dev team.

Here's an MXML snippet from the quote detail block for consideration:

        <model:FinancialDocumentDataProvider>
           <model:rowModelProcessor>
                <coremodel:PluggableGridRowModelProcessor>
                    <coremodel:processors>
                        <mx:ArrayCollection>
                            <model:SubtotalRowProcessor/>
                            <model:WarningRowProcessor/>
                        </mx:ArrayCollection>
                    </coremodel:processors>
                </coremodel:PluggableGridRowModelProcessor>
            </model:rowModelProcessor>
            <model:postUpdateProcessors>
                <mx:ArrayCollection>
                    <project:AvailabilityPostUpdateProcessor batchSize="5"/>
                    <project:SuggestionPostUpdateProcessor/>
                    <coremodel:LineExpandPostUpdateProcessor/>
                    <model:RecalculationPostUpdateProcessor>
                    <project:ResourceAvailabilitySyncPostUpdateProcessor/>
                    </mx:ArrayCollection>
            </model:postUpdateProcessors>
        </model:FinancialDocumentDataProvider>
This code snippet shows a set of row model processors and a set of post update processors.  Their names should give you clues about how we use them in real life.

I mention this today because I had to fix an issue with PDF generation in which users may want a subtotal visible (not muted), but they may want to mute the footer or perhaps mute the discount if the subtotal contains a discount.  This involved going into the detail block and modifying the way subtotal rows get processed when a delta (a batch of data changes) comes back after the server saves the new line or price mute settings.

The tricky thing about subtotals is that the server and Flex's underlying data model only considers them one row, one discreet piece of data.  But to display a subtotal on the screen, you might need to display as many as four rows: a header, a total -- and a discount and total before discount if the subtotal's been discounted.

Rather than hack in some special case for this, we inserted a hook in the architecture -- row model processors -- where incoming row changes can be massaged and manipulated into the form needed in the user interface.  In the case of subtotals, this means one row can become four.  Having that subtotal row processor sitting there made today's line/price mute changes much easier.  And this is just one example of many cases where these processors have helped us take shortcuts without feeling icky afterward.  Row model processors are about to come in handy again as we get ready to add labor planning features to the quote UI.

The row model processors could just as easily been called pre-processors because they act on delta rows before they are passed on to the live data model.  Post update processors are invoked after the update has been applied and we use them to handle situations where a line item change might need to trigger another change elsewhere on the page.

Availability is the most obvious example.  After the line items on a quote load, the availability indicators start filling in 5 at a time.  In the code sample up above you can see the AvailabilityPostUpdateProcessor plugged into the data provider, and with a batch size of 5.  If we changed that number of 10, the availability indicators would start filling in 10 at a time.

Likewise, the SuggestionPostUpdateProcessor examines the delta rows to determine if the suggestion dialog needs to be raised.  The LineExpandPostUpdateProcessor remembers which nodes were expanded or collapsed from last time and sets them to their previous values.  The RecalculationPostUpdateProcessor triggers total recalculations if an edit action would change the total, as in changes to quantity, pricing or pricing models. (In reality, these recalculations happen when the quantity change is made.  The post processor just fetches the recalculated values from the server.)

The newest addition to the post update processor chain is ResourceAvailabilitySyncPostUpdateProcessor.  Our QA team of one, Courtney, noticed a while back that when the same item appeared twice on a quote, if you changed the quantity of one line, it would refresh the availability of the line you changed, but not the other line.  We solved this problem by introducing a new post update processor to scan the line items for duplicates of the item whose availability is changed by an edit, and refresh all the other lines that reference it.  In the 4.3 detail block, this kind of change would have been painful.  With the post update processor architecture, we were able to implement it in about half a day.


As we grow and the demand for FastTrack customizations ramps up, the pressure on the dev team to crank out more features in less time is building.  Arcane architectural flourishes like these may help us do it.

Tuesday, January 17, 2012

Automatic Error Reports / Client Detection

I'm usually a pretty sound sleeper, but not the night of a release.  Last night was no exception with the release of 4.4.14.  My son usually wakes up around 6 AM and as soon as my wife leaves to go get him, I check my phone for server alerts and error messages.

As most Flex users know, we have something called a quality feedback agent in the system that automatically sends the development team emails when a user encounters a popup error.   This has a been a huge help to us on the dev team because the vast majority of errors go unreported to support.  We've been able to fix problems that we never would have known about otherwise.

It's fairly common to get error messages after a release.  Most of them are what we call serialization errors and look like this:

Active Panel Component: financialDocument
Active Panel Data: 09d97e22-3e10-11e1-846d-
12313f02cc9b

Event Stack...
[Event type="sendErrorEvent" bubbles=false cancelable=false eventPhase=2]

 Java Exception...
[FaultEvent fault=[RPC Fault faultString="Didn't receive an acknowledgement of message" faultCode="Server.Acknowledge.Failed" faultDetail="Was expecting message '9CA3DB5D-7B88-DD87-905A-EC04433901A3' but received '886101D0-9777-45C5-852A-765E23DBB6A8'."] messageId="6DD73431-CB83-C376-2B49-EC0443E227B5" type="fault" bubbles=false cancelable=true eventPhase=2]
To the trained eye, these errors tell us that the user is using a stale version of the Flash client.  Whenever Devon or Chris tell you to clear your cache or refresh, it's usually to correct this problem.


What we look for the morning after a release are other errors -- anything other than a simple stale client error.  This morning we had a different kind of error come in, but only from one customer.   It looked like this:

Facade Processing Exception
Report ID: ebd44295-0134-1000-0324-
123140010de4
Facade Method:com.shoptick.projects.facade.impl.ProjectElementFacadeImpl.searchProjectElements()
Arguments:com.mf.roundhouse.core.system.SessionCollaborator@6978a836,9bfb850c-b117-11df-b8d5-00e08175e43e,null,true,null,false,

java.lang.RuntimeException: java.lang.IllegalStateException: No data type for node: org.hibernate.hql.ast.tree.IdentNode
 \-[IDENT] IdentNode: 'totalPrepTimeAsTimeString' {originalText=totalPrepTimeAsTimeString}

       at com.shoptick.bizops.dao.impl.ExpressionHintProcessingDaoImpl.processHibernateBatch(ExpressionHintProcessingDaoImpl.java:140)
       at com.shoptick.bizops.dao.impl.ExpressionHintProcessingDaoImpl.resolveExpressionsHints(ExpressionHintProcessingDaoImpl.java:54)
       at com.shoptick.bizops.util.PropertyExpressionUtils.resolveExpressionsHints(PropertyExpressionUtils.java:55)
       at com.shoptick.projects.facade.impl.ProjectElementFacadeImpl.searchProjectElements(ProjectElementFacadeImpl.java:2505)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:309)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
       at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
       at com.mf.roundhouse.core.util.FlexOmnibusProcessingAspect.processLockableJoinpoint(FlexOmnibusProcessingAspect.java:183)
       at com.mf.roundhouse.core.util.FlexOmnibusProcessingAspect.executeFacadeMonitoringAdvice(FlexOmnibusProcessingAspect.java:99)
       at sun.reflect.GeneratedMethodAccessor2361.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
       at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
       at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
       at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:89)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
       at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
       at $Proxy274.searchProjectElements(Unknown Source)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:309)
       at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:196)
       at $Proxy275.searchProjectElements(Unknown Source)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:309)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:183)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:150)
       at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:80)
       at com.mf.roundhouse.core.util.FlexOmnibusProcessingAspect.processLockableJoinpoint(FlexOmnibusProcessingAspect.java:183)
       at com.mf.roundhouse.core.util.FlexOmnibusProcessingAspect.executeFacadeMonitoringAdvice(FlexOmnibusProcessingAspect.java:99)
       at sun.reflect.GeneratedMethodAccessor2361.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:621)
       at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:610)
       at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:65)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:161)
       at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:89)
       at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
       at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
       at $Proxy275.searchProjectElements(Unknown Source)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:597)
       at org.granite.messaging.service.ServiceInvocationContext.invoke(ServiceInvocationContext.java:79)
       at com.mf.roundhouse.core.security.FlexSecurityService.authorize(FlexSecurityService.java:187)
       at org.granite.messaging.service.ServiceInvoker.invoke(ServiceInvoker.java:123)
       at org.granite.messaging.amf.process.AMF3RemotingMessageProcessor.process(AMF3RemotingMessageProcessor.java:59)
       at org.granite.messaging.amf.process.AMF0MessageProcessor.process(AMF0MessageProcessor.java:79)
       at org.granite.messaging.webapp.AMFMessageServlet.doPost(AMFMessageServlet.java:62)
       at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
       at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
       at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
       at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1166)
       at org.springframework.orm.hibernate3.support.OpenSessionInViewFilter.doFilterInternal(OpenSessionInViewFilter.java:198)
       at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
       at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
       at org.granite.messaging.webapp.AMFMessageFilter.doFilter(AMFMessageFilter.java:93)
       at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
       at com.mf.roundhouse.core.servlet.ExceptionEmailFilter.doFilter(ExceptionEmailFilter.java:43)
       at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1157)
       at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:388)
       at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
       at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
       at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
       at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:418)
       at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
       at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
       at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
       at org.mortbay.jetty.Server.handle(Server.java:326)
       at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:536)
       at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:930)
       at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:747)
       at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
       at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:405)
       at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
       at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.lang.IllegalStateException: No data type for node: org.hibernate.hql.ast.tree.IdentNode
 \-[IDENT] IdentNode: 'totalPrepTimeAsTimeString' {originalText=totalPrepTimeAsTimeString}

       at org.hibernate.hql.ast.tree.SelectClause.initializeExplicitSelectClause(SelectClause.java:145)
       at org.hibernate.hql.ast.HqlSqlWalker.useSelectClause(HqlSqlWalker.java:705)
       at org.hibernate.hql.ast.HqlSqlWalker.processQuery(HqlSqlWalker.java:529)
       at org.hibernate.hql.antlr.HqlSqlBaseWalker.query(HqlSqlBaseWalker.java:645)
       at org.hibernate.hql.antlr.HqlSqlBaseWalker.selectStatement(HqlSqlBaseWalker.java:281)
       at org.hibernate.hql.antlr.HqlSqlBaseWalker.statement(HqlSqlBaseWalker.java:229)
       at org.hibernate.hql.ast.QueryTranslatorImpl.analyze(QueryTranslatorImpl.java:228)
       at org.hibernate.hql.ast.QueryTranslatorImpl.doCompile(QueryTranslatorImpl.java:160)
       at org.hibernate.hql.ast.QueryTranslatorImpl.compile(QueryTranslatorImpl.java:111)
       at org.hibernate.engine.query.HQLQueryPlan.<init>(HQLQueryPlan.java:77)
       at org.hibernate.engine.query.HQLQueryPlan.<init>(HQLQueryPlan.java:56)
       at org.hibernate.engine.query.QueryPlanCache.getHQLQueryPlan(QueryPlanCache.java:72)
       at org.hibernate.impl.AbstractSessionImpl.getHQLQueryPlan(AbstractSessionImpl.java:133)
       at org.hibernate.impl.AbstractSessionImpl.createQuery(AbstractSessionImpl.java:112)
       at org.hibernate.impl.SessionImpl.createQuery(SessionImpl.java:1623)
       at com.shoptick.bizops.dao.impl.ExpressionHintProcessingDaoImpl.processHibernateBatch(ExpressionHintProcessingDaoImpl.java:121)
       ... 89 more

We knew right away this was related to the search optimization we released last night, so some mild panic set in.  I unpacked the error, dug into it and realized that it was just a false alarm.  This particular customer just had an invalid property expression configured for quotes.  It was an easy mistake to make, because the expression is valid for Pull Sheets and Manifests, just not for quotes.  Removing the bad expression from their configuration resolved the error.  Never surfaced before because the old non-optimized property expression processor is designed to return a blank string when resolving an invalid property.

Now, back to these stale client errors.  For a long time we've had bigger fish to fry and just sort of lived with them.  Recently we've taken steps to handle this more gracefully.  If the user has a stale version of the client, they should get a nice friendly message telling them to refresh, not an obscure technical error.

Users may notice that we've added client and server versions to the login page and if you have a stale version of the client, you'll get a message warning you about it.  What we've since realized is that many of our customers leave Flex open all the time, all night, even.  When a system update happens, the user usually just gets automatically logged out and logs back in -- all without refreshing the client.  By adding server version checks to a few key remote events (remote events are what we call actions that initiate communication with the server) we could validate the client version before the user gets that ugly technical error.  As of now, we'll probably add these checks to all events related to authentication, the process scan event and the event that retrieves the day book.

These checks will probably be in the next maintenance release, but the benefits of better client/server version detection will take a second release to manifest themselves.

Monday, January 16, 2012

Flash vs. HTML 5


Note: This article is about the Adobe Flex SDK, not Flex Rental Manager.

Adobe seems to be in a state of crisis: not along ago they announced a bold new initiative to compile Flex 4.5 to IOS and other mobile platforms, which to us signaled a long term commitment to Flex. Since our whole front end is built in Flex and runs on Flash, this was relevant to our interests.

Then, in a series of shocking announcements, they pulled their plans for mobile platform support and then announced they were thinking of donating Flex to the Apache Software Foundation, citing the upcoming release of HTML 5.  Needless to say, this led to some introspection around here and a close look at the HTML 5 draft specification.  More on that later.

Apple's never been a fan of Flash.  Steve Jobs has been very vocal about it and claimed it was a waning technology.  I think that's just another example of NIH syndrome running rampant at Apple and I think users of Google Analytics, YouTube, and FarmVille might disagree with Steve's assessment of Flash.

I hate to be disrespectful of the dead and Flash is a lot of things, but a waning technology isn't one of them.  In fact, there are few web technologies as ubiquitous as Flash.  It's everywhere.  Microsoft even launched their own project, Silverlight, to compete with Flex/Flash.  (Our friends at Production Exchange use Silverlight.)  And as of yet, there is no suitable replacement for it, not even HTML 5.

Flash does have it's problems, however.  The FlashBuilder development tools have memory leaks galore and the Flash runtime could do with some serious enhancements.  Moving the Flex SDK into the open source community could work wonders for these problems and perhaps make Flex even more useful.  If the Flash Player follows the Flex SDK into the OSS world, it could even lead to embedding the Flash Player into browsers since the Apache license allows incorporating open source software into closed source or commercial platforms.

What makes Flex useful for creating rich internet applications (like Flex Rental Manager) is that it provides a component/event style model for building user interfaces.  In many ways, it's like a long lineage of GUI API's developers are familiar with.  AJAX wizards will assert that you can do the same thing with HTML, DOM and JavaScript -- and they're right, just not as easily, and not without accounting for the subtle differences between browsers and how they process markup, apply CSS styles and interpret JavaScript code.  HTML 5 won't solve any of these problems.  It just adds a new pile of requirements for browser developers to release different interpretations of.

With that being said, I do like what I've seen so far of HTML 5.  It cleans up some of the loosy goosy formatting issues in HTML 4 and provides new API's for 2D drawing and video.  Much of the ubiquity of Flash is due to its video support and I do think HTML 5 will displace Flex/Flash when it comes to video and some of the simpler web based games.

In order for HTML 5 to truly displace Flex, however, it will require an ecosystem of third party libraries to get the same kind of rich UI components that come with Flex.  jQuery and other DOM/JavaScript frameworks have provided this for a long time.  HTML 5 does nothing new but clean up the syntax and provide better low level API's for drawing and multimedia - some new objects for us to manipulate using the same old DOM and the same old JavaScript.  Great for component developers, not that compelling for us. Not yet, anyway.

Another key issue for me with moving back to AJAX is the often obfuscated nature of DOM/JavaScript.  It's not often you can open up a piece of JavaScript that does something useful and easily understand it.  Flex, on the other hand, has a nice structure with classes and inheritance and interfaces and all the rest.  The HTML/DOM/CSS/JavaScript universe seems like a mish-mash of competing philosophies at present.  It can be powerful, but it's hard to access that power without the code getting ugly and rendering the relationships between different components unclear.  I know from experience that when code isn't clear and self documenting, we're afraid to touch it.  It becomes a black box and we get skittish about maintaining it.

Again, I do think HTML 5 is a great improvement over HTML 4.  I'm excited to work with it and it will simplify traditional web development.  But Flex Rental Manager is not a traditional web site -- it's a full screen RIA, rich even by Rich Internet Application standards.   It's not a typical use case, so judging HTML 5 by how it lends itself to our purposes isn't really a fair means to evaluate it. We prefer Flex over Ajax because our time is limited and we can't afford to spend our time building components or testing (and working around) browser idiosyncrasies.

In time, of course, these issues we're concerned about with HTML 5 will get resolved.  jQuery or libraries like it will work out the browser issues like they already have with HTML 4 and a whole universe of third party components will spring up.  I think of HTML 5 and the other standards that complement it as a transition step from the old way to a more elegant way that may emerge down the road.  For Rich Internet Applications, I don't think it offers much beyond HTML 4 -- the power all comes from third party JavaScript libraries.

I liken HTML 5 as it applies to rich internet apps to SOAP in the web service realm.  When SOAP first hit, we all thought it was going to be the best thing since sliced bread.  Object serialization interoperability between platforms and longer lasting light bulbs.  SOAP didn't quite work out that way, did it?  We developers started working with it and it didn't take long for us to realize how much it sucked.  Then came REST.  Finally.  REST is much better: simpler, cleaner.  Then JSON came along and helped us serialize our objects in a clear, elegant way, and (gasp), without XML.

It takes time to get things right.  I see HTML 5 as a step along the way to cleaning up and simplifying web development, but just the first of many.  But it's easy for me to sit here in my glass house and throw stones at standards working groups.  It's not an easy job to mediate all the competing visions for what the web should be, for what web development should be.  But unless we're forced to do something drastic by a sudden disappearance of the Flash player, we're content to let the working groups duke it out for a few more years.  Our early adopter days are over.

What really confuses me in all this is Adobe's sudden rush to jettison Flex for a standard the WC3 hasn't even approved yet.  Seems pretty rash.  Most of the reasons developers picked Flex over AJAX in the first place haven't changed and HTML 5 won't change them much either.  I think it comes down to a problem that's plagued a lot of companies in the business of making developer tools and Internet technology: their engineering teams produce something wonderful and useful, but the business folks can't figure out how to sell it.  Adobe is scapegoating HTML 5 because it can't figure out how to make money off of Flex, even though lots of big players use it.  Adobe had the same problem with Acrobat, the ubiquitous PDF.  The PDF has changed the way we share and access "printable" documents on the web, but how many of us remember paying for the right to use it?  And my beloved Sun invented Java -- and Java was undeniably a huge success in terms of developer and business adoption.  But where is Sun, now?  Gone.  Absorbed into the pitiless borg that is Oracle, just like MySQL and Peoplesoft.

I guess we should be thankful that Larry Ellison has, as of yet, left Adobe alone.  I'd rather see Flex go the Apache route than get sucked up into Oracle.  Either way, Flex isn't going away any time soon, no matter what anyone says.  It's way too useful.