⇐ Previous | Next ⇒

Riak 101

Post created 2013-11-21 16:30 by Gabe Koss.

Riak is a distributed key-value store written in Erlang. Development is carried out primarily by a company called Basho technologies

I first became aware of Riak when searching for solutions to complex data storage issues we started to encounter dealing with large volumes of JSON data. Because we were researching the solution. Sam Stelfox and I volunteered to give a presentation on the database at a @btvwag meetup.

Most of this content is derived from the slides from the presentation.

NoSQL, huh?

The event where we gave this presentation was a meeting on varion NoSQL technogogies. We started with the following quotes from Andy Gross:

"NoSQL marketing is confusing... Everything does everything and at a small scale everything works."

"If you're evaluating Mongo vs. Riak or Couch vs. Cassandra you don't understand either your problem or the technologies."

Andy Gross, VP of Engineering, Basho Technologies

(approximate paraphrasing, hopefully not grossly misquoted)

NoSQL is often cited as a sort of 'magic bullet' by both marketers and engineers. These technologies are generally less mature than their conventional & relational cousins. As such it is often a problem where tools which have significant differences between them are compared against each other for problems which neither one solves.

As developers it is crucial to understand our problems fully before we move on technologies which may or may not suit our purposes.

So what is Riak then?

Riak offers quite a lot. It is a REST-ful Key-Value store with some internal intelligence. As far as I have been able to decipher it's most significant features are as follows:

Other features which I will not dig into in this article include full text search via Solr, link walking, commit hooks for data processing and a secondary/ custom indexing system.

Key-Value Heaven

Since it is HTTP based it uses a URL scheme to organize keys and values into data buckets. Buckets are not strictly defined like tables in relational databases.

The basic URL scheme looks like: http://riak-node-hostname:8098/riak/<bucket-name>/<key-name>

This scheme dictates that actual internals of the Riak internal data models which is basically made up of the following:

Simple Example

A basic PUT / GET operation to save some data into a bucket looks like this:

$ curl -XPUT "http://localhost:8098/riak/my-bucket/my-key"\
  --header "Content-Type: application/json" \
  --data '{"living_in":"the future!"}'

$ curl -XGET "http://localhost:8098/riak/my-bucket/my-key"
  {"living_in":"the future!"}

In this example a JSON blob containing '{"living_in":"the future!"}' is stored in the my-bucket bucket with the key my-key. It is then retrieved using the same bucket/key combo.

Binary Data

The simplicity of the data model at Riaks core allows for very simple saving of other filetypes. Here is an example of posting an image of a horse into the images bucket with the key horse.jpg:

$ curl -XPUT "http://localhost:8098/riak/images/horse.jpg"\
  --header "Content-Type: image/jpeg" \
  --data-binary @/home/user/horse_pic.jpg

$ curl -XGET "http://localhost:8098/riak/images/horse.jpg" > /home/user/new_horse.jpg

$ md5sum /home/user/horse_pic.jpg
3f9bdc9366aec1f98839f47717ada5bf horse_pic.jpg

$ md5sum /home/user/new_horse.jpg
3f9bdc9366aec1f98839f47717ada5bf new_horse.jpg

After posting in the binary data an retrieving it we see that the MD5 checksum matches which indicates that the image was stored and retrieved properly.

It's distributed right?

The Ring

The primary way which Riak handles distributed clustering is a mechanism called 'The Ring'. Riak splits its data into a series of partitions (default is 32). Data is hashed using the Bucket/Key combination in a repeatable way. This hash effectively becomes the unique identifier

The partitions are applied evenly across all nodes of the cluster as a ring. For example if there were four nodes, Node 1 would be assigned partitions 1, 5, 9 ... and so forth.

Data is also saved across redundant nodes

Riak Ring

image source: http://docs.basho.com/shared/1.4.2/images/riak-ring.png

Where does it fit on the CAP spectrum?

CAP theorem is the idea that in a system where you need to maintain tolerance to network partitions you can effectively choose any two of the following:

"If you have a system that can get a network partition you have a choice: do you want to be consistent or do you want to be available?"

Martin Fowler

Generally any sort of distributed database system will optimize for either CP or AP. Riak does not make this decision for you.

Tuning for CAP

There are several properties which can be applied to Riak system wide, bucket wide or on a per data value basis. This allows for the Riak cluster performance to be tightly coupled to the type of data being stored in it.

There are also two advanced options: dw and rw.

Optimize for Consistency and Availability

{
  n_val: "<all>",
  r_val: "1",
  w_val: "<all>"
}

Optimize for Consistency and Partition Tolerance

{
  n_val: "<high value>",
  r_val: "<n_val quorum+>",
  w_val: "<n_val quorum+>"
}

Optimize for Availability and Partition Tolerance

{
  n_val: "<high value>",
  r_val: "<low value>",
  w_val: "<low value>"
}

A few more samples

Ruby Client

Since I use a lot of Ruby the first thing I checked out was a Ruby client. There is a client gem which can be installed via gem install riak-client.

The project source is here. It does not seem as active as I'd like. Here is a sample anyway:

require 'rubygems'
require 'riak'

# Create a client for localhost
@client = Riak::Client.new

# Specify a bucket
@bucket = @client.bucket('sample-bucket')

# Build an object
@object = @bucket.get_or_new('sample-key')
@object.content_type = 'application/json'
@object.data = '{"sample":"data"}'
@object.store

# Access the object
@bucket.get('sample-key')
#=> #<Riak::RObject {sample-bucket,sample-key} [#<Riak::RContent [application/json]:"{\"sample\":\"data\"}">]>

Simple Map Reduce

The second functionality I wanted to investigate was the MapReduce functionality.

Chains of functions (written in Javascript or Erlang) are sent in and distributed amongst the nodes. The nodes run the query against their own Keys and Values and

For a general MapReduce overview take a look at the Wikipedia article.

Start by seeding some simple sample data as follows. Note bucket is users, key is the user email and then the data is simply some basic information.

$ curl -XPUT "http://localhost:8098/riak/users/btables@gmail.com"\
  --header "Content-Type: application/json" \
  --data '{
    "first_name":"Bobby",
    "last_name":"Tables"
  }'

# do this a few more times ...

A Map or Reduce query can pull this data out in a new format:

$ curl -XPOST "http://localhost:8098/mapred"\
  --header "Content-Type: application/json" \
  --data '{
    "inputs":"users",
    "query":[{
      "map":{
        "language":"javascript",
        "source":"function(record) {
          var record_data = JSON.parse(record.values[0].data);
          return [[
              record.key,
              record_data.first_name,
              record_data.last_name
            ]];
        }"}}]}'

# Returns:
# 
# [ ["gabe@gabekoss.com",  "Gabe",   "Koss"   ],
#   ["sam@stelfox.net",    "Sam",    "Stelfox"],
#   ["btables@gmail.com",  "Bobby",  "Tables" ]  ]

Summary

Overall we ended up not using Riak. At least not yet. In general I think it is a strong system when it meets your needs.

Use Riak for...

Don't use Riak for...

Additional Resources

comments powered by Disqus