Mining for Good Data
No less a wit than Oscar Wilde, the Irish master, once said private data was the basis of all modern fortunes. It’s an impressive thought, seeing as though Wilde lived before the database had even been conceived. In fact, he would be dead for decades before the first computers were born, and he would be fodder for college seminars before data mining — sifting and searching data for hidden patterns with predictive power — would truly come into its own.
Today, of course, it’s all the rage, given its usefulness. If you’ve ever bought something on Amazon, you’ve seen the power of data mining in real time. Just buy a book and within days (or even minutes), Amazon will send you an e-mail with other books you might want — books that are more often than not perfectly on point.
Amazon’s not guessing, either. Its suggestions are based on data-mining tools that look at what customers who resemble you — in age, purchase history and more — have bought and not returned.
Data Mining in Action
Let’s say you’re a DSL vendor, a BellSouth or an Earthlink. Your CRM systems tell you a large number of customers (up to 20 percent) are leaving your service for another provider after only a few months. You, of course, want to know which customers are likely to leave before they leave, so you can offer them incentives to stay. So you mine the data and discover customers with a certain number of support calls — and above all, calls that lasted beyond a certain number of minutes — are most apt to leave, especially if they live in high-rent districts and have incomes above $80,000 per year. With more study, you discover men are more likely to leave than women, and men who work from home, whose home and business address are the same, are the most likely to leave of all. Voila! Now you have a model to predict at-risk customers (it’s called a churn model in industry parlance), and you can target those customers with special incentives that you don’t have to waste on people who aren’t likely to leave.
Or let’s say you’re Citibank, Discover or American Express. Of course you’d like to cut down on fraud, as it hurts both you and your customers. So you sift and sort your data and discover a pattern you’d never seen before: If a credit card is used to buy prescriptions from two or more pharmacies in the space of 60 minutes, and if each purchase is more than $100, there’s a 70 percent chance the card’s owner will report it lost or stolen within a day.
But why wait? Now that you have a predictive model, you can decline the second purchase before it’s completed, and call the cardholder ASAP to notify him or her of the problem.
If you’re a grocery chain that keeps tabs on what customers buy through its discount card program, you can mine your data for profit, not just fraud prevention. Over time, you notice some customers who buy only a few items tend to buy diapers and beer (this is a well-worn example that’s often used in data mining). In response, you arrange your store’s layout to put diapers closer to beer, putting an end cap with high-margin beer at the end of the diaper aisle, something you’d never think of without an expedition into your database.
The bottom line is simple: Data mining is today’s alchemy. It turns information into assets and data into dollars.
But it’s not for the novice. Data mining involves large numbers of systems and expert technicians to run them. It starts with a data warehouse — a relational database management system (RDBMS) of sufficient size and power to store millions, if not billions, of records. Why so many?
Because the smaller your data set, the less value the conclusions derived from it. A model based on 10 customers is not as helpful as one that’s based on 10 million because larger data sets can yield more accurate results, and anomalies (strange, out-of-place patterns in data) can be studied as objects of value in themselves. It’s hard to define an anomaly in a small data set — patterns of standard behavior emerge only when there’s enough behavior to support the notion of a standard.
The raw input for these systems comes from point-of-sale networks, CRM tools, server and site logs and polling, all of which demand network administrators; Oracle, Siebel or SQL Server experts; webmasters; Perl programmers and hordes of other technicians to maintain.
And you’ll need more than a laptop to sort that data. Most data mining projects use large servers with substantial parallel processing power. Not only must they deal with a huge number of records, but today’s models often take huge sets of factors into account. Consider this: It’s one thing to look at patterns in customers’ age and race but quite another to include their income over time, credit scores and risk factors, past purchase history, education level, profession, sex, number of children, travel patterns, zip codes and a dozen other points, each of which puts more strain on a CPU.
Data mining relies on statistics, high-end math and predictive analysis that can’t always be done with point-and-click ease. As you can guess, there’s no such thing as an out-of-the-box data mining tool (although some vendors claim to offer just that). Systems commonly are built from the ground up, including large point-of-sale and other data trapping methods, powerful RDBMS tools, analytics software, and mapping and data graphing programs that render numbers into pictures to aid in human cognition.
Books have been written on the subject — dozens if not hundreds — and it’s far too large a field to explore here. But there are a few terms you’re likely to run across. First up? A neural network. It’s a type of model based on the structure of the human neuron, meaning it’s both nonlinear and can be trained, that is, its predictive capacity can be improved over time with the right instruction.
Next? Clustering. It’s a method of parsing data based on similarity, that is, finding like sets of data. You might look for all patients with a certain side effect, then further cluster them by age, weight or sex. Or you might look for doctors who used to be neurologists and became psychiatrists, then examine their prescription data for patterns that non-neurologist psychiatrists don’t exhibit. The goal is to make each set of similar records as different from the next set as possible and derive insight from them.
And here’s an exotic concept for you: genetic algorithms. They mine data stores based on concepts from the science of evolution, including natural selection and mutation. (Darwin might be dying off in our schools, but he’s alive and well in the data mining world.)
And data mining, it’s fair to say, is alive and well in today’s enterprise. According to research firm Gartner, the global market for business intelligence alone — a subset of data mining for commercial firms — will grow to $2.9 billion in 2009. That’s a compound annual growth rate of 7.4 percent, which is large enough to put data mining on your radar and large enough to keep it there.
David Garrett is a Web designer and former IT director, as well as the author of “Herding Chickens: Innovative Techniques in Project Management.” He can be reached at firstname.lastname@example.org.
Some Useful Terms
KDD: Knowledge discovery in databases. Another term for “data mining.”
OLAP: Online Analytical Processing. A precursor to data mining and a term that’s often confused with data mining itself. OLAP tools are still a component of many data mining systems.
Algorithm: A set of rules or instructions, including, for instance, mathematical operations, designed to solve a specific problem.
BI: Business intelligence. A field of data mining restricted to the commercial enterprise. (Data mining is also used in noncommercial fields such as pure research or national security.