Wednesday, May 9, 2012

Going Barefoot

I think I've mentioned before a common expression in software development shops: "The cobbler's children have no shoes."

It means we often spend so much time working on customer projects that we neglect much needed internal projects that will help us do our job better, even if customers may not ever see the results directly.  Our internal notification and deploy system, Cluster Bomb, is an example.  Without Cluster Bomb, pushing software updates to 150+ customer systems would be error prone and take hours.

Distributed Performance Analysis

One area we've needed to work on for some time is performance and resource analysis.  We have a performance monitoring system built into Flex, but presently it's in memory only and if a system crashes, we have no way of examining response times or resource consumption prior to the crash - which makes it hard to determine why a system crashed and correct the problem.  We've wanted to build a system that takes performance data from running systems and sends it off to a central server for indexing and analysis.

We always thought this would be a big project, then I remembered from a previous job (Blue Nile) a distributed log analysis tool called Splunk.  Splunk can take log file feeds from various servers in a cluster and centralize them for analysis.  This can be as simple as viewing logs from a crashed system to aggregating statistical and trend analysis.  You can also set alerts that look for certain patterns in the log files and alert administrators when something unusual happens.  By outputing performance metrics in the application logs, we should be able to get close enough to the kind of performance analysis we were hoping for without going through the trouble of building a whole new system for it.

We also need a good archive of historical performance and resource consumption data as we plan for a more fault tolerant load balanced architecture in order to help us make the right assumptions about nominal versus peak load.

Logging Performance

The good news about this approach is that it doesn't involve much coding.  We just had to modify our existing performance monitoring to put performance data in the logs.  Since we use JavaSimon for performance monitoring, this meant adding a callback for logging.  JavaSimon comes with a logging callback implementation, but we felt it was too verbose.  We just wanted to output log data whenever a timer is stopped.  We don't need to know when timers are started.  We also wanted to output the performance data in a human readable format (ns, ms, seconds, etc) and a machine readable format for use in aggregate functions.

We created a new callback called PerformanceLoggingCallback and the entire class is shown below:

public class PerformanceLoggingCallback extends CallbackSkeleton {
   
    private static Logger logger = Logger.getLogger(PerformanceLoggingCallback.class);

    @Override
    public void stopwatchStop(Split split) {
       
       
        Stopwatch watch = split.getStopwatch();
       
        StringBuilder sb = new StringBuilder();
       
        sb.append(split.getStopwatch().toString());
        sb.append(", last ");
        sb.append(SimonUtils.presentNanoTime(split.runningFor()));
        sb.append(", avg ");
        sb.append(SimonUtils.presentNanoTime(split.getStopwatch().getTotal() / split.getStopwatch().getCounter()));
        sb.append(" [");
        sb.append(String.valueOf(watch.getTotal()));
        sb.append("|");
        sb.append(String.valueOf(watch.getCounter()));
        sb.append("|");
        sb.append(String.valueOf(watch.getMax()));
        sb.append("|");
        sb.append(String.valueOf(watch.getMin()));
        sb.append("|");
        sb.append(String.valueOf(split.runningFor()));
        sb.append("] ");
       
       
        logger.info(sb.toString());
       
    }
   
   
     dd

}

There's not much to it.  We just use a StringBuilder to build up a line of log output.  In practice, the log output looks like this:

STDOUT 2012-05-09 10:58:14 PerformanceLoggingCallback [INFO] Simon Stopwatch: [com.mf.roundhouse.core.service.impl.SchemaCheckRecordServiceImpl.save() INHERIT] total 232 ms, counter 58, max 44.9 ms, min 498 us, last 3.47 ms, avg 4.00 ms [231954000|58|44869000|498000|3391000]
No need to understand everything here.  Just note that performance data is output in human readable units, usually in the most convenient units of time.  Because nanoseconds and microseconds would require conversions in order to be used in analytical functions, we also output most of the key performance metrics at the end of the log line in nanoseconds.  They're pipe delimited and bracketed so we'll be able to configure the analysis tool with a regular expression that matches and extracts the nanosecond values as fields.

Finally, we had to add code to Flex that plugs the callback into the architecture and we also wanted the ability to turn it on or off on a per instance basis.  We still aren't sure what impact logging performance will have on performance and we needed a quick way to switch it off in case it becomes a problem.  To handle this, we added a little snippet to the core module's activate() method:


    public void activate() {

       
        if (isPerformanceLoggingEnabled()) {
               
            SimonManager.callback().addCallback(new PerformanceLoggingCallback());
           
        }
       
    }

The performanceLoggingEnabled field defaults to yes, but can be overridden through the use of a JNDI injected property.

Configuring Splunk

This performance logging will be in version 4.4.38 of Flex and later.  The next step for us is to setup data feeds from all the customer servers to the Splunk server.  We'll be working on that over the next few days.  I'll follow up with some preliminary reporting data once it becomes available.  The hope is that by using a centralized monitoring tool, we'll be able to deal with performance problems a bit more proactively rather than just wait for customers to report issues.

No comments:

Post a Comment