Friday, April 26, 2013

Database roles: BigData, BigUsers, Big...Problems?

Today at lunch we had an interesting discussion about the role of databases in the contemporary IT development - and there was a lot of buzz-words thrown across the table: Hadoop, NoSQL, BigData, BigUsers (reasonably small data+high number of concurrent users), ACID-compliance etc. To be fair, a bit too many buzz-words - I have a very bad feeling that even contemporary architects stopped understanding core concepts behind their work!

Let's start from the key point - we have databases to manage DATA. And one of the key elements of this tasks is to make sure that data is reliably stored and retrieved. And here is a catch - what do we mean by reliable? Or to be precise - what happens to you/your company/your customers if some piece of the data is lost forever/unrecoverable? And the answer on this question drives the whole technology stack! For example, if you work with medical/legal/official data - a small chunk of lost information (if noticed) could mean litigation at best and people's life at worst!

Let's be clear - majority of current NoSQL DB solutions are explicitly not ACID-compliant (or at least not 100% ACID compliant). For example, I found a pretty good analysis of MongoDB and CouchDB - and it is clear that even its proponents say that there are always trade-offs between performance and data reliability. In some articles there are even suggestions to have double-environment implementation, where you have NoSQL-database for non-critical data plus RDBMS for critical data.

Just to clarify - what do I mean by ACID-compliance:
  • Atomicity requires that each transaction is executed in its entirety, or fail without any change being applied.
    • I.e. if you have successful INSERT and successful DELETE in the same transaction - you will have both/none of them committed.
  • Consistency requires that the database only passes from a valid state to the next one, without intermediate points.
    • I.e. it is impossible to catch the database in the state when for the stored data some rules (for example, PK) are not yet enforced
  • Isolation requires that if transactions are executed concurrently, the result is equivalent to their serial execution. A transaction cannot see the partial result of the application of another one.
    • I.e. each transaction works in its own realm until it tries to commit the data.
  • Durability means that the the result of a committed transaction is permanent, even if the database crashes immediately or in the event of a power loss.
    • I.e. it is impossible to have a situation when the application/user thinks the data is committed but after the power failure it is gone. 
As we can see from that list, all of these listed requirements are technically very challenging to implement, especially with the high number of concurrent users and significant data volumes - that why Oracle went extreme with its UNDO/REDO/LOG mechanisms. But that's the price to pay for being sure that if you saved the data - it would NEVER disappear.

I understand that there are environments where that small chance of data loss can be if not ignored, but at least tolerated: we all know that Craiglist is being run by MongoDB - so, what's the impact by one lost add? Somebody might get annoyed, but that's all!

Although when I start hearing about medical systems being built via NoSQL solutions - I start to get nervous. Maybe, in a couple of years before going to the doctor I will first check what kind of software they use! Just to feel safer...

1 comment:

Maxym Kharchenko said...

Good overview, thanks! :-)

But, to put a bit of the defense for nosql, here is my 2c:

Of course, ACID is awesome and given a choice people would always chose it for precisely the reasons that you mentioned.

However, what many folks probably do not realize is that transactions in "traditional" databases may not be what they think of ACID in the first place.

I.e. default ORACLE isolation level (which, I'm betting is used almost everywhere) is READ COMMITTED, which means that unless transaction is 1 statement only, database engine allows both non-repeatable reads and phantoms ... and if you think about it, this looks very similar to what eventual consistency brings.

Many people are dealing with eventually consistent data daily, without even realizing it and it works ok ... mostly :-)
I.e. you wouldn't think that your bank would be a big fan of eventual consistency (ACID was created for them, right?), but they use it all the time in their ATMs (Each ATM is an independent node that is allowed certain "autonomy" even if it is disconnected from the main network)

Regards,
Maxym Kharchenko