Tuesday, November 12, 2013

Introducing Riak

In the upcoming 4.6.16 release, we've made some major changes to the search system. In fact, we introduced a brand new external NoSQL database system to use for search. Everyone, meet Riak.

Riak is a high speed/low latency key/value database that is masterless, distributed, & fault tolerant. Let me break that down.

Masterless means that in a cluster of Riak nodes, there is no primary or master node. All nodes are equal. This means it doesn't make sense to use Riak as a single node. Riak should be ran with at least 3+ nodes per cluster. Anyone of these nodes can go down without hurting anything. You can fix the node, bring it back up, and it will automatically begin participating in the cluster again. Riak uses a "gossip" protocol to spread all the data around among the nodes in the cluster.

Distributed just means the data is replicated among the nodes without any node being a single point of access or failure. If one of the nodes is not present, you can just ask another node. This eliminates the need for backups since the data is replicated on every node in the cluster.

Riak is fault tolerant because there is no single point of failure. If a node fails, you can just fix it and bring it back. Or, you could create a new node & add it to the cluster. If you need more performance, you can just continue to add more nodes to add capacity to the cluster.

There are a few drawbacks as well. One is that Riak is an eventually consistent database. This means you can store a value using one node but the other nodes in the cluster won't immediately have the value. It takes a bit to spread the data around in the cluster. Riak also does not have transactions, so there are no read or write locks to guarantee consistency. This means you need to use Riak with prudence. It is not suited for everything. For example, we would not want to use it to store Quote info since we need Quotes to be in a guaranteed consistent state. It is however very well suited for search documents & status report data since this info is either temporary or can be rebuilt.

The actual Riak storage interface is very simple. First you have the concept of "buckets" to add whatever namespace you need to your data. So you could have a "person" bucket to store all the data about a people. Within a bucket, everything is just a key & a value. The key is a just a string, and the value can be anything. It could be plain text, XML, JSON, etc or it could be binary data like an image or PDF file.

We are/will be using Riak for several things. Currently we are already using it to store customer status data that comes in from our Nagios monitoring system. This creates quite a lot of data since a status report comes in from each customer every 5 minutes. Riak has no trouble handling the volume since it can write very fast.

The second use is for search. Riak has a built in full text search engine called Riak Search. Once the 4.6.16 releases goes out, the search documents will be generated and stored into Riak. We took our existing High Speed Search Interface and created a Riak implementation. The search document is a JSON value. All the JSON fields are indexed automatically by Riak and we can search on any combination of the JSON fields.

We believe that Riak will create a more robust search system that will make it easier to find your data. Our prior search implementation was based on Lucene which seemed too search engine like (fuzzy searches, ranking, etc) for our purposes. Riak Search is more straightforward and should just find the data your are looking for.

No comments:

Post a Comment