The next generation of data bases

Yesterday I was reading an interview to Brian Aker (MySQL director of technology) I found via Slashdot when something caught my attention.

On the second side of this which may actually be more exciting is the issue of–instead of the structured data world of the relational database but the semi–the semi-structured world. You look at what is being done today with CouchDB, you look at Amazon ScaleDB, to a lesser extent but to a similar extent you–not ScaleDB, SimpleDB–to a lesser extent or a similar extent Tokyo Cabinet, those databases are really kind of fascinating because those databases are redefining really how we access data and how we are going to be searching and using data. So there’s a whole world out there that’s just starting to open up in that direction.

For a while now, I have been using different flavors of metadata stores. Everything tends to work nice and dandy as long as you do not push the storage volume too far. For instance, together with Bernie Acs at NCSA, we have run experiments where we could deal with up to 280 million triples using Virtuoso or up to 60 million triples using Jena with a MySQL back end without much trouble and still be able to run arbitrary SPARQL queries on a reasonable time. However, these were relatively small tests. The first one represented only 120 documents in a collection, where as the second one was only a subset of the Wikipedia link graph. Yes, there are ways to move beyond via proper striping and replication of the data, but that is not by default a key concern of such engines. Another sad note was that we had to drop Mulgara because we had a pretty hard time pushing it that far (most of the bugs we run into has been fixed since last year and they have started a push toward getting SPARQL in, so it may be time for revisiting it). However, all the above approaches were not born out of a distributed environment. Lately, I have started looking for alternative large scale storage born from the distributed environment’s soup. After the Hadoop Summit/BCDSG 2008 trip I started looking into HBase (open source take to Google’s Big Table by the Hadoop community). Not a bad alternative if you can fit your application data needs into their structure. Since runs on Hadoop FS you get all its benefits for free. But, after reading the above-mentioned interview, however, I just found myself intrigued about CouchDB and Tokyo Cabinet. I guess I better go and take a look at them :D