You can’t strike it rich collecting and analyzing Big Data without a plan
This feature first appeared in the Summer 2017 issue of Certification Magazine. Click here to get your own print or digital copy.
The mystique of Big Data is that, given sufficient data and applied mathematics, we can finally unpack the chaos and understand the true effects of our actions (Clarke, 2016). We can identify the butterfly flapping its wings and see how it creates the hurricane.
This promise is especially seductive to those of us working in assessment and certification in high technology for two reasons: First, because both psychometrics as well as high technology hold strong reverence for the infallibility of data analysis; and second, because unlike educational assessment, we struggle to accumulate hard data when it comes to assessment validation. It is no wonder than many of us are looking towards big data as our savior.
Despite the hype however, the path to applying big data to your business may not be so simple, and the destination may not be what you expected.
Defining “Big Data”
The term Big Data is so amorphous that even its origins seem lost to history, though it is believed to have first entered the lexicon (potentially) sometime in the 1990s (Lohr, 2013). That initial designation referred to the rapidly growing size (or volume) of data sets, noting the effects of current storage and memory constraints (Spencer, 2016).
Continued refinement of the concept included not just the volume of data, but the speed with which it is generated or grows (velocity), and the many different types (variety) of data being processed (Clarke, 2016). Today, numerous other characteristics have been associated as well (Moorthy et al., 2015).
Although volume, velocity, variety, veracity, and value are the most frequently cited characteristics of big data, even these metrics can be subjective and change over time. For instance, what was once a significant volume of data in the late 1990s might be considered insignificant in the current environment. Or what constitutes big data in one organization might not be considered as such in another. Big data is, somewhat, in the eye of the beholder.
Fortunately, at the end of the day, how or what you define as big data is much less relevant than having the right data (Wessel, 2016). The success of your data analytics, big or small, hinges on this one, universal truth: If it isn’t the right data, having more of it doesn’t help. How do you know you have the right data? That starts with having a plan.
Plan to succeed
Regardless of whether you classify your data as big, little, or somewhere in between, the ultimate value of your analysis depends greatly on having a systematic approach. The challenge with even big data isn’t the volume, velocity, or variety, but the ability to create value with that data (Moorthy et al., 2015).
Having a good idea of what you want to achieve beforehand is critical to finding value in your analytics (Bean, 2016; Niraj Dawar, 2016; Wessel, 2016). As suggested in The Age of Analytics (Henke et al., 2016), three critical questions can guide value creation:
1) What are you looking for, what value does it provide, and how will it be measured?
2) How are you going to collect, store, and consolidate the data?
3) What do you need to analyze and derive insights from the data?
Looking at each of these critical questions individually offers further illumination.
What are you looking for?
Part of the mystical beliefs about big data is that if you have sufficiently large data and applied mathematics, traditional science methodologies no longer apply (Clarke, 2016). This approach, however, may be why few organizations have been able to demonstrate true big data success (Bean, 2016).
Without following traditional approaches of scientific inquiry (i.e. hypothesize, model, and test), it is difficult to know what you’ve found, or even if you found anything at all. While most big data analysis approaches are exploratory in design, there must be some guide for the exploration.
As an example, our program is looking to explore the relationship between customer service time-to-resolution (TTR) times and the level of certification the customer service engineer has obtained. Having a hypothesis about what we expect to see, a plan for conducting the analysis, and a means to evaluate our results directs how we conduct the analysis.
Additionally, it helps us determine the data points we need (instead of looking for a single tree in an overgrown forest), and have high confidence that the results will have value to the organization. Without having a plan, you can waste many hours aggregating and analyzing data solely on the hope you may find something of value.
That’s not to say pure big data exploration can’t have value in uncovering new avenues of interest — but it is difficult to justify the cost of the tools and people in the hopes of one major breakthrough. The results of less extensive but valuable data investigations can go a long way to building organizational support of a data-based approach (Spencer, 2016).
How are you going to manage your data?
Most organizations approach the management of big data solely from a storage or processing perspective. In today’s cloud-based environment, however, storage and processing are the easiest aspects of data management.
The more time-consuming and difficult process is sourcing the data you need, combining the data, and cleaning the data — all processes required before any analysis can take place. Furthermore, the decisions made at this point can affect your outcomes significantly (Ransbotham, 2015).
Identifying the sources of necessary data, and getting access to it is not always a simple process. Even if you can identify a data source, questions of privacy can cause unexpected challenges. In our case, to correlate individual customer service engineers and their level of certification, we must uniquely identify individuals, and have access to performance metrics that could prove detrimental to individuals if used without aggregation.
In addition, the data was collected by another internal group requiring multiple authorizations to access. Of course, having a plan with a well-defined value proposition can go a long way to overcoming these challenges and getting cooperation.
Combining data collected for purposes outside your investigation also presents challenges. In today’s technology environment, it is increasingly unlikely data is accompanied by a detailed data dictionary (Clarke, 2016). Without fully understanding the purpose of the data elements and the rules around their collection, combining them with your existing data becomes a guessing game.
In our attempts, the support data had multiple date-time stamps associated with TTR and we were unsure which ones delineated and defined TTR relevant to our investigation. Further, the data itself was formatted as time stamps, but we needed a single value representing the difference between them, data that needed to be calculated instead of simply combined.
Cleaning and normalizing data must also be done. Beyond the common need to ensure that the data is uniform (e.g. not confusing different scales like nominal vs. ordinal or ratio vs. interval) and that incomplete cases are not used, you must also examine the data with a common-sense eye.
During initial descriptive analysis of our data, we discovered cases where the TTR exceeded nine years, and others that existed for less than 30 seconds. While the cases were complete, the question then became whether they represented a relevant support interaction, or were they spurious data. Again, having a plan helps make appropriate decisions about which data makes sense.
What are you missing?
The last thing integral to your data analysis plan is to determine what pieces are missing to achieve your goals. Do you have access to the appropriate data? Do you have the tools necessary to analyze the data? Do you have access to the people who can do the analysis? Do you have people who can explain the results to executives?
Although there is a reported dearth of data scientists and data translators in the market, having a plan helps you determine what level of skill is necessary. In addition, the rapid growth of the big data ecosystem has resulted in the availability of many on-demand, cloud-based tools that can do the number-crunching without the cost of implementing in-house tools.
There is also increased availability of orthogonal data sources (data like geographical, weather, etc.) that may add additional dimensions to your analysis either for free, or at minimal cost.
For our work, my fledgling use of R (a free, open-source data analysis suite) has proven critical to the management of our project data. By building R scripts documenting and automating how we combine, transform, and clean our data for analysis, we get repeatable results even as updated data becomes available.
Ultimately, if the results prove valuable, we can codify the processing in R as well as build real-time dashboards to display the results. We also leveraged our relationship with our psychometric consultants (who live and breathe data analysis) to help navigate the many decisions required to get us where we need to be.
Evaluating the results
It should be obvious that having a plan is critical to success. What may not be as obvious is that every decision made in that planning process will affect your results (Clarke, 2016; Ransbotham, 2015). In addition, you must be careful not to fall into the trap of confusing correlation with causation. Big data analysis is exploratory in nature, meaning its goal is not to find answers, but to help you discover the right questions. Again, some common sense needs to be applied in interpreting any results.
From the choice of data sources through how and why the data was originally collected, the way data sources are linked, and the way we clean data, every decision changes what our results show (Ransbotham, 2015).
Going back to our experience with support TTR values, we made the choice to remove values greater than 180 days, and less than 15 minutes to limit spurious non-relevant support cases. While we documented this exclusion, as well as the analysis precipitating those choices, it inherently changes the reality that the results purport to explain.
Further, although we haven’t completed our analysis, what if it showed that more credentialed support engineers close cases faster? Does that unequivocally demonstrate the value of credentialing showing that achieving certification results in greater productivity?
Or does it prove something else entirely? Perhaps that the most productive, skilled engineers also achieve (or can achieve) certification demonstrating exam validity? In actuality, it only shows there is a correlation, or relationship — determining causation and directionality is another matter entirely.
Which brings us to the final point: exploratory analysis like that used in most big data projects is not for finding truth, but for uncovering possibilities. Correlations may be very useful in many instances — think Amazon or Netflix recommendation engines — but they don’t tell you what the mechanism is or how to influence it.
In our case, if we find a correlation between certification and TTR, we then need to understand whether it shows the value of pushing engineers to be certified, or whether it confirms the validity of our program. At the same time, if no correlation exists, why? Our initial analysis is exactly that — initial; the start, not the end.
Method, not magic
Whether our data is classified as big or small, we can create value by applying the tools and techniques of big data. Yet, we must be careful not to fall prey to the siren song of big data as a magical solution — it is not a silver bullet.
The best way to ensure that we create value is to plan for it, and understand how the choices we make affect the final results. Finally, the long-term value may not even lie in the results themselves. Be prepared to launch another cycle of exploration and discovery: One that begins wherever your results suggest that you look next.
Bean, R. (2016). “Just using Big Data isn’t enough anymore.” Harvard Business Review, (2), 1–4. Retrieved from www.hbr.org
Clarke, R. (2016). “Big data, big risks.” Information Systems Journal, 26(1), 77–90. Retrieved from http://10.0.4.87/isj.12088
Henke, N., Bughin, J., Chui, M., Manyika, J., Saleh, T., Wiseman, B., & Sethupathy, G. (2016). The age of analytics: Competing in a data-driven world. Retrieved from http://www.mckinsey. com/
Lohr, S. (2013). The origins of “Big Data”: An etymological detective story. Retrieved April 19, 2017, from https:// bits.blogs.nytimes.com/2013/02/01
Moorthy, J., Lahiri, R., Biswas, N., Sanyal, D., Ranjan, J., Nanath, K., & Ghosh, P. (2015). “Big Data: Prospects and challenges.” In Vikalpa: The Journal for Decision Makers (Vol. 40, pp. 74–96). Sage India. Retrieved from http://10.0. 4.153/0256090915575450
Niraj Dawar. (2016). “Use Big Data to create value for customers, not just target them.” Harvard Business Review, 6–11. Retrieved from https://hbr.org/
Ransbotham, S. (2015). Detecting bias in data analysis. Retrieved April 12, 2017, from http://sloanreview.mit.edu/ article/detecting-bias-in-data-analysis/
Spencer, G. A. (2016). “Big Data: More than just big and more than just data.” Frontiers of Health Services Management, 32(4), 27–33. Retrieved from http://library.capella.edu/ login?url=http://search.ebscohost. com/login.aspx?direct=true&db=bah&AN=115925884&site=ehost-live&group=alumni
Wessel, M. (2016). “You don’t need Big Data — You need the right data.” Harvard Business Review. Retrieved from www.hbr.org