23 May
2009

Relaxing on the Couch(DB)

Category:UncategorizedTag: :

The last couple of months, I heard some buzz around CouchDB at several user groups. Listening to this podcast really got me interested, so I decided to learn more about it in order to find out what all the fuzz is about.

couchdb-logo

Is CouchDB yet another relational database? Heck no! CouchDB is an open-source document-oriented database. Its originally created by Damien Katz and developed using Erlang. The data stored in CouchDB cannot be accessed using traditional SQL but through a RESTful API. CouchDB can be installed on most POSIX systems, including Linux and Mac OS. It can also be installed on Windows, but this isn’t officially supported and according to this page on the wiki, not all IO-related features are fully functional. But don’t let that scare you away. It can be fun to spend some time with Ubuntu :-). The final goal of CouchDB is to provide a scalable, reliable and fault-tolerant document database that can run on failure prone systems.

So, what’s the difference between a relational database and a document-oriented database and why would you care? As you probably know, the center of the RDBMS world is the row. It may not come to a surprise that the centerpiece of a document-oriented database is the document.

A document in a document-oriented database is completely self-contained. There are no tables, rows, foreign keys, joins and more importantly, there is no database schema at all! You can store whatever data you’d like for a particular document and use a completely different set of data for another. Once you have a set of documents stored, you can add/remove data of one or more documents without affecting others.

For example, when your application wants to store customer information like the first and last name, then these attributes can be stored in a document for every customer. As documents in CouchDB are stored as JSON, a customer document would look like this:

{ "_id" : "...", "_rev" : "...", "FirstName" : "Chuck", "LastName" : "Norris" }

Suppose you application has been in production for almost a year now and by celebrating this fact, the shoe size of the customer must be stored as well.

Simple. You don’t have to change any schema or anything (and face those nasty DBA’s). You just start saving the valuable shoe size information for every new/existing customer from now on.

{ "_id" : "...", "_rev" : "...", "FirstName" : "Chuck", "LastName" : "Norris" "ShoeSize" : "52" }

You probably want to update the already existing documents as soon as you can get a hold of this extra information, but the point is that you don’t have to.

Another major difference between a relational database and a document-oriented database are the unique identifiers. A table in a RDBMS typically has a primary key in order to identify a row. The value of the primary key column is mostly generated by some sequence generator. This means that the identifier of a row is only unique for the table itself. CouchDB on the other hand, provides a document identifier that must be unique across multiple databases. It has to be unique on a global scale because in order to achieve high availability, CouchDB has a replication feature built in so that it’s possible to replicate with multiple database instances across multiple machines. The safest way for uniquely identifying a document is to provide a GUID or let CouchDB generate a unique randomly generated identifier for you.

Another interesting feature of CouchDB is the way it handles versioning. But first lets talk a bit how you would traditionally solve this using a relational database.

In order to achieve optimistic-concurrency in a RDBMS, you have to include a version column that you have to update with every row change and use it in the WHERE clause of the update statement involved. But what if you want to know what got changed to a particular row last month? In order to store that kind of information, you have to create a separate table where you insert a snapshot of the row for every update statement that gets executed on the table (probably using a trigger). The point that I’m trying to make here is that a relational database is not going to help you with that. You have to provide all this plumbing yourself.

As you may have noticed from the customer example above, there are two extra fields _id and _rev. Every document in CouchDB contains this extra meta-data. We just discussed that the document identifier is stored in the _id field. The revision of the document is stored in the _rev field. When you create a new document, this revision field will be filled by CouchDB and returned to you. Every change from that moment on will not be made to the current existing document, but a new version of the entire document? is created. This means that the complete history of a document is automatically captured and maintained by the database. The revision number is also used for implementing concurrency. When two clients want to make changes to the same document, the first client will be able to store his changes while the second one will receive a notification about the conflict.

That’s all fine and sweet. But what’s the point of having a document database filled with all the information I want, but not having SQL at my disposal to get it back out?

CouchDB relies on views to retrieve all the information you want out of your documents. These views are computed using Map/Reduce, a way of dealing with large sets of data using parallelism. This concept originally comes from the Lisp programming language, but is popularized by Google. You can read the paper here.

I’m planning to write a couple of blog posts in the near future about the things I’ve learned so far and the new things I’m going to learn about CouchDB along the way. In the meanwhile, if you can’t wait to learn more, then check out these forthcoming books:

I’ve only read the MEAP of CouchDB in Action which is only about 30 pages so far, but looks very promising.

For my next post on CouchDB, I will try to make a summary of the steps I have gone through for installing CouchDB on a clean Ubuntu Linux installation.