Thursday, September 6, 2012

NoSQL and Heavy Data

This post is OBE. I couldn't bring it to a useful conclusion, and left it alone for too long. A bunch of things have evolved in my thinking about NoSQL, as well as the work I'm doing with some of the tools. I'm publishing it only because it's a useful foil for some other things I want to write about in the future.

***

I have been paying more attention to the NoSQL world lately, and to all of the choices available. The Amazon paper about the early genesis of Dynamo is interesting, and seems to have influenced a bunch of people to create cool tools across a wide range of domains. I've also read both the Google Bigtable paper as well as the Percolator paper, and scanned through a presentation on F1. I'm interested in some  of the scalability and performance characteristics available in the different toolsets, but also just interested in other ways to solve the persistence problem with web applications.
 
I've tried out a couple of systems, not extensively but enough to get a flavor for how they (the systems I've tried) work. My current favorite is redis, for its clean implementation of all those basic data types from that 1st level Data Structures & Algorithms class from college. I have a couple thoughts about redis, but first let's set the wayback machine to 2000 so I can get my crusty-DBA rant on.

The first non-relational database I ever used was probably Zope's ZODB. It has power and a couple weird limitations, but clearly provides more than enough capability to run large, complex concurrent/transactional systems (edit: for some circa-2000 definition of large and complex). Using Zope (and later Plone) also gave me my first taste of the hellish netherworld of upgrading that goes along with schema-less implementations.

In a relational database, you can establish constraints to ensure referential integrity, to limit the values that a column can assume, to require that master data be defined, and so on. If you really wanted to you could build relational schemas without constraints <snark>like a current project of mine</snark>, but in so doing you give up your ability to communicate your design to teams in the future. When a developer attempts to change system code against a well-defined model, they will quickly figure out any incorrect assumptions, by way of hard errors.

On the flip side, a poorly-constrained data model might allow all sorts of creativity, and will certainly not forbid it. Since we all know that what is not expressly forbidden is allowed (some programmers would say "encouraged"), if your data model is unconstrained it is implying to your future programmers that you *intend* for the design to be freely modified. Fast-forward to schema-less designs, and that intention is now a feature. Professionally, I seem to specialize in legacy systems that have been built with or de-constrained into spotty schemas, and so am extra-cranky about slapdash domain models.

During the heady pre-plone days of the Zope Content Management Framework (CMF), any mildly interesting feature you wanted to add to your code meant an upgrade cycle. You could walk the database tree looking for objects of your type, or you could search the catalog (if you'd registered your objects there). And then poke each existing data object into compliance with your new model. Very entertaining. Current versions of Plone are more explicit about the upgrade path, paying attention and providing some services for handling that transition, but upgrades are still fraught. And often don't work, stranding your data.

The same sort of thing happens in relational systems, obviously, although somewhat less severe since you can always query them. The only real difference I can think of is that in an RDBMS you are guaranteed that all the data in a table adheres to the same schema, even after you apply a change. That's both blessing and curse, of course, and is one of the motivations described in the Dynamo paper. SteveY blogged about it as well (angrily, in the Drunken Blog Rants), talking about his customer service days at Amazon.

The thing is, data has gravity (ht Colin Clark of DarkStar). It distorts its surrounding reality. I have a whole spittle-flecked rant about frameworks that assume my database is empty when I start a project. I don't care if you've poured your heart and soul and several years of your life into your snazzy new framework, if you are just now getting around to dealing with *existing* data, it's a toy. My data has weight. Like gold. It's worth money. I don't let toys touch my data. But holy moly, are there only two points on the Schema line? With, and Without?

So let's talk about redis. It's a really interesting piece of software, implementing a group of data structures that are very useful for a large range of projects. To date, I don't trust it with my data, so I only use it as a caching and synchronization mechanism (Ed: you and everybody else, genius. Everyone talks about the great features it has over and above memcached, not compared to Oracle). And I think it might have a namespace problem. But for a single application I can overlook the stuff that makes my inner DBA cringe, and revel in luscious feature-rich creaminess.

On both my mac (homebrew) and linux (apt) boxes, redis installed without a peep. Type 'redis-server' and you're live. (No password? No security? Inner DBA turns pale). I can go on and on about the features, but that ain't what I'm about.

As I savor my way through another delightful morsel (sorted sets? mmmmmmmm) the thought occurs to me that redis is an ideal prototype system for a large-scale deployment in AWS using SimpleDB and SQS. Of course, redis does quite a bit more than those two services in many areas, and quite a bit less in one small but important one (scale). But if you limit yourself to lists and hashes, you can easily build an application that runs locally against redis and remotely in AWS using the services. Why, you ask? Because you can then develop your application even when you don't have network, and yet easily deploy to an AWS setup if you find yourself needing an unreasonable amount of cpu power and insane scalability.

To bring this all back together, how does one take a (delicious, lovely) toolset like redis and apply any rigor? It has no data types, let alone schemas. You could encode object type and version in the tag (Person.v27.id13579), you could introspect your runtime objects and serialize them natively (eg pickles in python, Zope-style) or in json. Or, use Google protocol buffers and a bit of metadata. You could constrain yourself to hashes for data storage, and employ an envelope technique to encode metadata around your serialized objects. Can I just say it? Bleah. And a cursory reading of Google's F1 presentation suggests there are some people holding a similar opinion. Or at least, it assuages my ego to think so. (More on this idea, deeper support for structured data, in a later post).

But wait, there's more. MongoDB. CouchDB. Riak and Cassandra and Dynamo. Neo4J. Prophet. A bunch of other stuff I haven't heard of (LevelDB). Clearly, there is a serious need that the relational model doesn't address. Perhaps less clear is the fact that these tools lead to distributed architectures that challenge the notion of System of Record for a particular object/document/id. I'm all in favor of higher scalability and lighter weight to drive the applications, but only if I can eliminate any mismatches in expectations about duplicates and conflicts and transactions (or lack thereof) that the business teams may have floating around.

I'm doing a lot in redis lately, and leaning on Riak for a financial tick store problem I've been working. The more I learn about the available tools, and talk to people doing real work with heavy data, the less convinced I am that the current crop of NoSQL databases are anywhere near "done". Don't get me wrong, I'm enjoying the mental exercise of trying out different solutions to my problem space, but I'm beginning to formulate a concept about where all these tools are in their lifecycle and what that means to the data-is-gold crowd.

***

Apologies for the lack of useful conclusions. I think this is going somewhere but will take a while to get there. I'm attending the Basho RICON sessions in a few weeks, and really looking forward to some geekage, listening to some of the really big users talk about their experiences, and challenging my strawman tick-store architecture among people who have architecture experience.

No comments:

Post a Comment