By Richard Boire
It’s almost impossible to quantify the amount of data generated daily. You have transaction data, online files, social media data and, more recently, mobility data: the list goes on. Our ability to track and store all of this information opens up great opportunities, but the real challenge is making sense of it all.
That task has fallen to data scientists: who have endeavoured to keep pace. This evolution has moved from simple business rules to segmentation systems and to models to answer specific questions. More recently, machine learning and artificial intelligence (AI) have been added to the data scientist’s toolbox to boost the predictive power of data. But rather than replacing one approach with another, each new advancement builds and complements the earlier system.
So, to understand where data science is today and how it’s being applied to things like direct marketing, it helps to understand how we got here.
The early days
The field of data science can be traced back to the early 1950s with the advent of credit card products, as issuers needed effective ways to assess potential loss at an individual customer level. Clearly, the traditional manual one-on-one approach of assessing credit risk was no longer sufficient. Organizations such as Fair Isaacs (FICO) emerged in order to meet this demand. This firm was the first to process individual customer records by computer, albeit a rudimentary one, while applying advanced statistical techniques to assess the level of risk for a given customer. These statistical techniques are still in use today.
Direct marketing was the next major industry to take this approach. In the late 1960s, direct marketing techniques tended to focus on the acquisition of lists that would produce a higher volume or likelihood of response. It wasn’t until the 1990s that data scientists started using exhibited behaviours to infer the potential behaviour of current or prospective customers.
In the early days, one-to-one marketing was viewed as being a more elite business strategy for larger organizations that had the resources to fund and perform the necessary work. With dramatic decreases in the cost of storage and computing power today organizations of all sizes can now target messages to customers at a much lower cost of entry. Still, large companies like Netflix and Amazon continue to push the envelope. Netflix, for that matter, began as a direct mail organization sending and receiving movies on DVDs to customers through the mail. Reengineering its business towards digital has reaped tremendous profits, but its success was based on what it had learned as a direct mail company.
At the beginning of my career in the 1980s, I was fortunate to work for leading edge pioneers like Reader’s Digest and American Express. In those days, mainframe computers were the only technology available to process data. Unlike the compact size of today’s laptops and mobile devices, mainframe computers had to be stored in large separate offices. Users would submit their computer programmes or “jobs” from terminals to the mainframe, which was a very slow process. In many cases, most “jobs” ran overnight, which meant starting from scratch and waiting another day to see the results if you made an error.
Data science and adaptation
This slow process meant testing and learning from different approaches, especially since people like me were learning the concepts of data science on the fly and simultaneously trying to acquire the requisite computer programming skills. But data scientists came up with an innovative way to deal with this challenge. To experiment more and generate results faster we started using smaller volumes of data. Instead of processing all the data, we randomly sampled a small percentage. Despite using a fraction of the data, the random selection process of the new smaller dataset was still representative of the larger files. Random sampling of data was used in virtually all projects, thereby allowing much higher and quicker levels of learning alongside quicker delivery of solutions.
The PC environment
The advent of the personal computer (PC) overcame some of these technological barriers of dealing with larger volumes of data in a timely manner. Data could now be stored directly on a PC. But storage issues meant random sampling was still necessary. It wasn’t until the 1990s and early 2000s that computer storage and processing power enabled practitioners to develop models using millions of records.
The notion of random samples and sample size appeared poised to become a relic of the past, but access to more data and more computing power created new challenges. Despite having access to more data than ever, there are still factors that influence the way it’s used. For example, assessing the likelihood to respond to a credit product may only require quarterly updates, while predicting the likelihood of an airplane machine defect would require real-time data. In our world of consumer marketing, updated information at reasonable periodic intervals has been very effective in building data science solutions. But adding greater volumes of data and integrating more updates can add costs and ultimately slow the process down: and for little gain.
Yet, the marketing landscape is evolving as marketers consider the use of push-type campaigns where the emphasis on customer experience represents the latest marketing approach in how marketers can better engage consumers in real-time. With mobile technology, push-type campaigns can tailor their communications on the existing situation of the consumer. But it begins with data.
Big Data and AI
Big Data and AI have been the game changers in allowing marketers to better enhance this overall customer experience. Extremely large volumes of data can be processed quickly. For example, marketing chat boxes can communicate directly with customers given their interest in a product. But these chat boxes are becoming more sophisticated as some organizations can now leverage historical customer data, thereby allowing the use of machine learning and/or AI to steer the conversation.
Meanwhile, marketing offers and communications via mobile phones can now account for customers’ locations. At the same time, customers’ prior activities can all be analyzed once again through the use of machine learning techniques, thereby enabling even better customer experiences.
The need for privacy
One outcome of all the increased capabilities is the need for data governance as the growing use of data becomes a reality for virtually all businesses. Organizations need to both protect customer data as well as respect the right to privacy concerning this data. Data science practitioners need to be increasingly aware of what data they are using given what we have observed in the Facebook/Cambridge Analytica scandal. Practitioners should adopt the culture and philosophy of analyzing data that will be of service to not only corporations but to consumers as well: while respecting the rights of consumers to opt-in to the use of their data as a tool for improved product delivery
Evolution rooted in history
One might infer that with all these changes, the data science discipline is undergoing huge disruptive changes. In some ways it is, but the technological information still adheres to the processes developed decades ago. The real changes are the ability to execute tasks in a quicker and much more automated manner.
Automation and speed are the expected prerequisites within the new paradigm. Yet, practitioners still need to adopt the same data science rigor that has been used for decades. Without this data science foundation, tools and technologies are sometimes looked as solutions instead of enablers of those solutions. In this brave new world of information, data science must be a core business discipline where the emphasis focuses on developing the right data science team rather than on tools and technologies.
Richard Boire is senior vice president, Environics Analytics.