Tuesday, November 12, 2013

A Self Hoster's Guide to 4.6.16

This week we start deploying version 4.6.16 of Flex.  If you haven't looked at the release notes, check them out here.

While the release notes may not signal a big release, 4.6.16 introduces two new technologies into our stack: RabbitMQ and Riak.  Another notable feature of this release is that it requires 13.04 of Ubuntu.  More on that later.

RabbitMQ

RabbitMQ replaces our old ActiveMQ JMS server for asynchronous messaging.  This was a major undertaking because it meant getting rid of JMS as a standard for messaging in favor of AMQP, a protocol developed by the financial industry for high speed trading, etc. 

We use asynchronous messaging throughout Flex.  We use it for Quickbooks Integration, pushing data to Production Exchange, eager availability recalculations, search indexing, and many other applications.  Switching to RabbitMQ makes this messaging faster and more reliable - and is a necessary step for our high availability architecture where asynchronous message processing will be handled by dedicated servers.

Riak

Roger Diller did most of the work on our new Riak search implementation and he did a great write up of Riak earlier today.  I'd only add that Riak is perhaps the best of the so called NoSQL databases that have become popular in recent years as a way around the scalability problems of relational databases. 

The new version of Flex uses Riak instead of Lucene to process project element searches and store the indexes.  Over time our use of Riak will likely increase.  We're kicking around the idea of using Riak for document storage and will likely leverage it for inventory and contact search before long.

Ubuntu 13.04

As of this release, we will require all servers to run Ubuntu 13.04.  For customers that run cloud based instances, there's nothing to worry about.  We're already running 13.04 in our Sydney, Chicago, Montreal and Roubaix clusters.

All self-hosted customers will have to upgrade to 13.04 before 4.6.16 can be deployed.  If you run Flex on a self hosted system, we'll contact you shortly to schedule a maintenance window to perform the Ubuntu upgrade and install Flex 4.6.16.  Part of this process will include upgrading MySQL, installing Riak and RabbitMQ - in addition to basic upgrade process.  We'll also archive your Flex data and configuration prior to any OS upgrade just in case there's a problem that may require reinstalling the operating system from scratch.

Home Grown Fault Tolerance

Something we'd like system administrators to start thinking about is how they might want to deploy a high availability system on site once our software supports it.  To get started, you'd need 2 or more servers  - and some method of distributing load across the different servers and redirecting traffic when a server fails. 

The most common way of doing this is with a hardware load balancer.  Flex won't require any particular kind of load balancer, but you'll get better performance if the load balancer supports sticky sessions or server affinity that can be tied to configurable http headers or cookies (and most do).  Flex doesn't and won't use volatile user sessions (which means a server crash won't interrupt user logins), but it will use a hybrid local/remote caching system - and sticky sessions will increase the efficiency of this cache - and overall performance by reducing cache misses.

You could also use a software load balancer if you have an old server lying around.  We recommend HA Proxy if you go that route.

A poor man's load balancing would be to simply use round robin DNS.  This wouldn't make server crashes as painless for users as a true load balancer, but it would help break up your user load across multiple servers.

You'll also want to make sure any servers you provision for a multi-server Flex install have two network interfaces as the second interface will be used to form a private network between the servers for Riak synchronization and database IO.

And, if you wanted to be really cool, you could deploy a four server Flex configuration - two database servers configured as master and slave - with two applications servers in front of those - with load balancers in front of those.  This configuration would closely mirror the hardware configuration we'll use in our next generation cloud.

If all this sounds daunting or like overkill, don't worry.  Even once Flex has been modified to run on a multi-server high availability architecture, it will still run just fine on a single server, just like it always has.


Introducing Riak

In the upcoming 4.6.16 release, we've made some major changes to the search system. In fact, we introduced a brand new external NoSQL database system to use for search. Everyone, meet Riak.

Riak is a high speed/low latency key/value database that is masterless, distributed, & fault tolerant. Let me break that down.

Masterless means that in a cluster of Riak nodes, there is no primary or master node. All nodes are equal. This means it doesn't make sense to use Riak as a single node. Riak should be ran with at least 3+ nodes per cluster. Anyone of these nodes can go down without hurting anything. You can fix the node, bring it back up, and it will automatically begin participating in the cluster again. Riak uses a "gossip" protocol to spread all the data around among the nodes in the cluster.

Distributed just means the data is replicated among the nodes without any node being a single point of access or failure. If one of the nodes is not present, you can just ask another node. This eliminates the need for backups since the data is replicated on every node in the cluster.

Riak is fault tolerant because there is no single point of failure. If a node fails, you can just fix it and bring it back. Or, you could create a new node & add it to the cluster. If you need more performance, you can just continue to add more nodes to add capacity to the cluster.

There are a few drawbacks as well. One is that Riak is an eventually consistent database. This means you can store a value using one node but the other nodes in the cluster won't immediately have the value. It takes a bit to spread the data around in the cluster. Riak also does not have transactions, so there are no read or write locks to guarantee consistency. This means you need to use Riak with prudence. It is not suited for everything. For example, we would not want to use it to store Quote info since we need Quotes to be in a guaranteed consistent state. It is however very well suited for search documents & status report data since this info is either temporary or can be rebuilt.

The actual Riak storage interface is very simple. First you have the concept of "buckets" to add whatever namespace you need to your data. So you could have a "person" bucket to store all the data about a people. Within a bucket, everything is just a key & a value. The key is a just a string, and the value can be anything. It could be plain text, XML, JSON, etc or it could be binary data like an image or PDF file.

We are/will be using Riak for several things. Currently we are already using it to store customer status data that comes in from our Nagios monitoring system. This creates quite a lot of data since a status report comes in from each customer every 5 minutes. Riak has no trouble handling the volume since it can write very fast.

The second use is for search. Riak has a built in full text search engine called Riak Search. Once the 4.6.16 releases goes out, the search documents will be generated and stored into Riak. We took our existing High Speed Search Interface and created a Riak implementation. The search document is a JSON value. All the JSON fields are indexed automatically by Riak and we can search on any combination of the JSON fields.

We believe that Riak will create a more robust search system that will make it easier to find your data. Our prior search implementation was based on Lucene which seemed too search engine like (fuzzy searches, ranking, etc) for our purposes. Riak Search is more straightforward and should just find the data your are looking for.