Monday, April 22, 2013

Collaborate'13 Summary. Part 3. BigData confusion

This post is a continuation of my thoughts about IOUG Collaborate'13 conference (part onepart two). As I've been mentioning in the preview, the whole question of BigData was on my short-list. First of all, I am still trying to figure out what are we talking about - but since there was a number of talks about this subject, I hoped that I can hear some words of wisdom from my esteemed colleagues.

3. BigData

If you wonder, whether I came back from the conference with definitive answers - sorry, but no. BigData community is still is a very amorphous structure with a lot of ideas/tools/concepts floating around. Unfortunately, some software vendors (let's not name them, please) are also trying to fish in that muddy waters, and as a result some presentations became about 90% of marketing / 10% of content. As a result it also complicated my attempts to get "the big picture". Although, there are some common ideas floating around:
  • people should not mix BigData and NoSQL, but they do! And that's one main confusion areas:
    • BigData (Hadoop+MapReduce) is an extension to good-old data mining, just you can mine much more data much faster on much cheaper hardware. It makes the whole life easier because you can store data first in any way/shape/form it comes and try to make use of out it later - there is no need to define a structure beforehand.
    • NoSQL is the environment where you can WORK with that (non)(semi)-structured data very efficiently from the very beginning. Minor problem - each existing solution is optimized for a pretty focused set of tasks. Yes, it is well-optimized, but you need to think hard to select right tools for the right problem.
  • HADOOP is huge as a storage mechanism. Minor problem - as far as I understood, anything below 50-node cluster is considered a toy-box. So you need to have a real issue to be solved by it, because starting budgets are in $300k+
  • NoSQL usually comes in conjunction with regular RDBMS and rarely a standalone, especially if you care about high reliability/control/audit etc. It is considered as a side performance booster - but it also can mask MAJOR problems in the RDBMS. I've heard a number of anecdotes, when IT stuff was introducing NoSQL for performance reasons, but at some point they've hired Oracle performance experts - and they scrapped NoSQL solution, because Oracle RDBMS started to crunch data as fast as it was needed. So, if you expect that your bad coders will do good code just because of the technology change - you may not be 100% right.
  • Toolsets are in the disarray - and everybody got his own preferences. There is a lo-o-o-o-ng way for this environment to mature.
Summary:  data growth by itself and the raw speed of data growth make an enormous market for solutions that could handle them. Currently this market is "up for grabs", because major RDBMS environments just think differently and have different priorities, while smaller players are linked to their niche audience. So, we (IT specialists) need to keep our eyes opened - at some point "there can be only one"(c)!

No comments: