Certification Survey Extra: How much of the data in Big Data is superfluous?
Certification Survey Extra is a series of periodic dispatches that give added insight into the findings of our most recent Certification Survey. These posts contain previously unpublished Certification Survey data.
It’s not for nothing that Big Data is noted for its bigness. Data visualization software company Domo, which issues an annual report about data generation and storage, projects that by next year there will be 1.7 MB of data created every second for every person on Earth. That’s roughly 12.8 billion MB (or 13.4 quadrillion bytes) of new data every second. That’s a lot of data.
That’s the total amount of new data that will be created per second, of course. Which means it’s not necessarily the amount of new data that will be collected and stored. And it’s almost certainly not the amount of new data that will be analyzed, and definitely not the amount of new data that will (someday) be reported in detail.
That all being the case, it’s interesting to ponder exactly how much data, out of all the data that IS collected and stored, could be considered helpful, useful, actionable, and so forth. Will every (literal) bit of the data we collect eventually, in some way, tell us something important? Or are data firms casting a wide net and hauling in a lot of stuff that, strictly speaking, no one is ever going to use for much of anything?
We decided to put the question to the certified Big Data professionals who participated in our recent Big Data Certification survey. How much of the data gathered and stored by companies that work with Big Data is ultimately superfluous? What percentage of Big Data, generally speaking, will never be put to and productive use?
Here’s what we learned:
Q: As a certified Big Data professional, what percentage of the data that most organizations and/or businesses collect and store would you estimate is never put to any productive use by those same organizations?
More than 90 percent of the data is never used — 8 percent
Between 80 and 90 percent of the data is never used — 20 percent
Between 70 and 79 percent of the data is never used — 24 percent
Between 60 and 69 percent of the data is never used — 8 percent
Between 50 and 59 percent of the data is never used — 16 percent
Between 40 and 49 percent of the data is never used — 8 percent
Between 30 and 39 percent of the data is never used — 12 percent
Between 20 and 29 percent of the data is never used — [No responses]
Between 10 and 19 percent of the data is never used — 4 percent
Less than 10 percent of the data is never used — [No responses]
You can tell at a glance that there are a variety of opinions. At least a small group of certified Big Data professionals point to almost every range along our spectrum.
On the other hand, there is a great deal of belief that a lot of the data in Big Data is more or less digital dead wood, taking up space but not good for much more than that. More than half of those surveyed, in fact, believe that 70 percent or more of the data in Big Data is essentially superfluous.
One reason for the high level of clutter is that organizations often don’t know what they want to know — or at least they don’t know everything that they want to know — when data is collected. A lot of what Big Data can reveal isn’t discovered until after the fact. You grab everything that’s available essentially knowing beforehand that you won’t gain something from all of it.
Maybe the next big advances in Big Data will provide some relief along those lines. The more that we study existing Big Data caches, the better we may be able to recognize (and avoid, in the future) certain types of information that we don’t really need to collect and store.