Distributed Data Networks with GaianDB

2010-10-31

Technology

Databases , NoSQL

This weekend has been a rather fascinating one. It was BathCamp 2010 where I gave a couple of different talks on Scala and NoSQL and also listened to others talk about new and interesting tech. One of those talks was by Dale Lane from IBM about their AlphaWorks project GaianDB

Firstly, a very interesting talk by Dale. I have monitored the NoSQL space for the past 18 months with a close eye. The way in which we are building applications and the scale to which we as developers have to take said apps is really reaching new, dizzying heights - these are exciting times. With that said, GaianDB was a new one on me, and very out of left field.

So what is this thing?

Well, I must say that the “DB” part of the name feels a little wrong, as Gaian is essentially a way of treating lots of heterogeneous distributed data stores as one, coherent unit that can be queryered and accessed from any part of the “network”. IBM claim that Gaian was:

biologically inspired in that it strives to minimize network diameter and maximize connections to the most fit nodes. GaianDB advocates a flexible “store locally, query anywhere” (SLQA) paradigm.

This strikes me as pretty interesting for several reasons:

Today is the age of the supposed “cloud” where centralised data storage and large scale applications rule. It takes a brave team to buck that trend and go for a strategy that is different.
Being connected all the time, anywhere, with anything is, alas, still a dream. There are many business situations and organisations that are physically unable to be connected all the time perhaps because it simply is not practical for either cost or network performance reasons.
Whilst technically possible to hold large volumes of data centrally, for a lot of telemetry-style applications this may not actually be the best route.
With more and more people/things becoming “connected”, having a system that can be both independent and collectively connected seems… interesting.

Here’s an example of Gaian nodes within a visualisation system:

Some Figures (from IBM)

Given 1000 Gaian network nodes, and one million rows of data the following was recorded:

Query Time - We are able to query all 1000 nodes in about 1/8 second. The results show that the query time grows logarithmically - in other words as you add more and more databases, the increase in query time slows down, providing excellent scaling. The way that a Gaian Network is grown from individual nodes automatically ensures this behaviour.
Fetch Time - We are able to fetch 1 million rows of data in under 5 seconds. The fetch time is proportional to the amount of data returned so that if you fetch twice the data it takes twice as long regardless of which of the 1000 nodes the data resides in. The Gaian Database actively pre-fetches the data from all the nodes to achieve this scalability
Concurrent Queries - I injected queries from up to 40 nodes at the same time, the Gaian Database showed that it could handle these queries robustly with a modest increase in the query time due to running out of available processor time on our test platform.

Tentative Conclusion

For some unbeknownst reason, I do find this sort of thing really interesting. I think there are a bunch of use cases where this type of workflow could be come useful - I mean, its essentially like skynet… small pieces of tech that can comunicate to other pieces of near-by tech and also be commanded as a large whole. To that end, I’ll be keeping an eye on this project to see how things evolve over time and pondering some other use cases for such a technology. Who knows, perhaps we’ll see some kind of interactive advertising that merges RFID and smart telemetry over similar data networks to taylor customer experiences or something.