It's not how big your data is, it's how you use it!
Over the past couple of months I have met and talked to a lot of new and interesting people. Everywhere I go I encounter the same questions about Big Data, it's like some sort of mass hysteria around what on the face of it is a simple concept "volumes of data".
The Same Questions, Everywhere
Example questions:
- "How much data is Big Data?"
- "I'm using NoSQL, is that Big Data?"
- "How many servers do I need to get started with Big Data?"
- "Do I have to use Hadoop to be Big Data?"
After spending a combined sum of many hours explaining my perspective on what is "Big Data", I have managed to distill it down into an amazingly simple concept.
A Simple Definition
Quite simply this:
"When you need to process volumes of data at least two orders-of-magnitude greater than what you have today, you are probably doing Big Data."
Please note, I didn't mention Gigabytes, Terabytes or even Petabytes. Nor did I pontificate about how many billions of rows and columns are needed to become 'Big Data'.
When you think about it in terms of "volumes of data", all the complexity falls away.
Three Key Questions
Ask yourself these three simple questions:
- Are my existing tools at their data/performance capacity?
- If I am to scale up, is it going to be too expensive?
- Would being able to process two (or more) magnitudes more data give me a competitive advantage?
If you answered 'Yes' to any of the above, you are probably approaching a tipping point where you need to think about your future investment in tools.
A Practical Example
For example, if your business is currently using Excel '97 (yes really) with a limitation of 65,536 rows, then ask yourself how you would deal with 6.5 million rows of data.
While this may get you laughed out of Silicon Valley by the 'Big Data Kool-kids', to you — these volumes are the very definition of 'Big Data' and chances are you should be looking at MS SQL Server (at least).
Wait a minute, I didn't see any mention of NoSQL or Elephants… you mean MS SQL Server can be used for 'Big Data'?
In this very simple example, yes.
The Real Point
I guess the point I'm trying to get across is this… Sure you could go and splurge your company's time and money playing with these awesome shiny new tools, but sooner or later you're going to need to prove that investment was worthwhile.
It makes far more sense to either:
- pick a problem your company is facing that can't be solved with your existing tools or,
- find an 'edge' or 'advantage' that you could deliver, if you had the tools to realize it
Now you have a clear objective and a set of requirements, you're now ready to start looking for a new and innovative way to deliver something of clear value.
Your bosses will be overjoyed if you can deliver real returns on their investment and your company stands to benefit as whole. After all, I believe it was a recent Forbes article that stated: "By 2015, companies that are using 'Big Data' effectively will be 20% ahead of their competitors".
It's not about how big your data is, it is all about how effectively you use it!
Related Posts
The problem with Big Data is not the Data
The real problem with Big Data isn't volume—it's knowing what you want to achieve and starting with clear business challenges, not technology.
Hadoop: Processing ZIP files in Map/Reduce
Updated ZipFileInputFormat framework for processing thousands of ZIP files in Hadoop with failure tolerance and comprehensive examples
Reading ZIP files from Hadoop Map/Reduce
Custom utility classes to extract and parse ZIP file contents in Hadoop MapReduce jobs using ZipFileInputFormat and ZipFileRecordReader