Avinash Kaushik did a great presentation on Big Data at Strata Conference 2012. It's worth watching!
Avinash also wrote a post in his blog, talking about the presentation. He defines big data as "the collection of massive databases of structured and unstructured data. The data sources include traditional (now considered puny) sources like corporate ERP/CRM systems and non-traditional (massive) sources like every technical ping from every human or mechanical sensor, all web behavior by everyone across the entire Internet, increasingly digital data from analog sources like hospitals or the atmosphere, and (good lord!) our collective tweeted wisdom."
In the post, he listed the Six Rules That Should Govern Your Big Data Existence:
1. Don't buy the hype of big data and throw millions of dollars away. But don't stand still
Structure your big data efforts, at least initially, to fail faster while failing forward. Don't build the biggest, baddest big data environment over 32 months, only to realize it was your biggest, baddest mistake.
2. Big thinking about what big data should be solving for is supremely important
I can't think of any other time in our lives where we could literally swim endlessly in an ocean of data, without having anything to show for it. Big data is that world. If you don't know where you are going, you will get there and you'll be miserable
3. The 10/90 rule for magnificent data success still holds true
For every $100 you have available to invest in making smart decisions, invest $10 in tools and vendor services, and invest $90 in big brains (aka people, aka analysis ninjas, aka you!).
4. Shoot for right time data, not real time data
Real time data is almost insane to shoot for because even for the smallest decisions, you'll have to do a lot of analysis first (5 hours), then present it to your superior (1 hour), who will add two bullet items and send it to a team of people (20 hours), who will in turn argue about priorities and how much the data is wrong (16 days), but ultimately come to an agreement because the deadline to make the decision passed 7 days ago (20 seconds), and send the data to the big boss who'll read just the first part of the executive summary (3 days), and decide that the data is telling her something counter to what she has always known works, and she'll make a decision based on her gut feel (5 seconds), and some action will be taken (14 days).
5. "Data quality sucks, just get over it"
Data on the web will never get to 95% clean and it will have big holes and it will be sparse in some areas. We should aim to collect, process and store data as cleanly as humanly possible, but after that we should move on to using the data, because we will still have more data about the web than what God's blessed any other channel with.
6. Eliminating noise is even more important than finding a signal
Thus far in the history data analysis the objective for our queries has been trying to find the signal amongst all the noise in the data. That has worked very well. We had clean business questions. The data size was smaller and the data set was more complete and we often knew what we were looking for. Known knowns and known unknowns.
No Hacker is a Bandit
6 days ago