Header image

The Business Case for A/B Testing

January 15, 2020 blog-post ab-testing
• Understanding cause and effect can be difficult in the modern complex business environment. For instance, if company sales went up or down in the prior quarter, it can be difficult to know quantitatively which departments may be driving the effect or if it was driven by some external market factor.

• A/B tests are how scientists and empirical researchers measure cause and effect precisely and with explicit uncertainty. Historically, they were used in the fields of economics and social sciences, but increasingly have become easier to be leveraged in the business world. Most notably, by digital-first companies like Facebook or Airbnb, but, more and more, those in traditional industries.

• We walk through a specific example of the Customer Value Management team of a Telco operator running an A/B test and discuss why it precisely measures the positive effect of the experiment in ways that other methods cannot.

• We conclude with the high-level tools, people, and process considerations for a company moving towards A/B testing.

One of the biggest promises of data science is for business owners and executives to be able to understand the causal relationships and fundamental drivers that underpin their business. Understanding such relationships is especially important when making large-scale decisions, such as significant capital expenditures, marketing campaigns, product launches. Or for frequent, small decisions, such as choosing which feature to launch in a product in a weekly team meeting or optimizing marketing campaigns for return on investment with 10-15 campaigns running concurrently. Without clear understanding and data-driven techniques, critical business decisions can be made based on bad intuition and lead to missed opportunities.

But how do data scientists and statisticians figure out and conclusively answer these questions? What are the tools and concepts they use to do so? And are data science concepts like A/B testing that we will discuss limited to Silicon Valley unicorn start-ups or can they be used to drive incremental revenue and profit in more traditional industries? What are some of the tools out there that can be leveraged for this task, and what are the associated people and process considerations that should be taken into account when scoping out such a potential transformation?

This post is intended for businesspeople interested in A/B testing and experimentation to dig into some of these questions, provide a high-level explanation of the math behind measurement and experimentation, and to discuss some possible real-world applications that we at Thinking Machines are brewing up. A subsequent post more aimed at data scientists will go into the statistical and technical nuances that should be considered when using and scaling out A/B testing.

Measurement is hard (and often political)

Most everyone is familiar with the phrase "correlation does not equal causation." And, of course, this is true. But, in practice, causation is much more complicated than even this, because many things can be driving your outcome of interest. Say you're trying to understand last quarter's profits in an eCommerce business. This could be affected by uncountably many things, some within your control, others not -- e.g. your website's improvements, your new mobile app, the marketing team's campaigns, your new product offerings, the economic conditions in the country where most of your customers live, improvement in Internet speeds in Southeast Asia, the proverbial butterfly's wings flapping in Brazil.

In the face of this mess of possible explanations, most of us who need to make important decisions would throw up their hands in exasperation if they were presented with such a convoluted laundry list of explanations. Instead, we often settle on simple, actionable -- but ultimately -- flawed explanations, thinking it's the "best we can do." That is, often the discussion will settle on simple explanations: "We ran the marketing campaign and it increased sales" or "The network outage last week really is why our top-up is down," full well knowing it could have been many other things.

There are also obviously political implications of any of these conclusions within an organization. In the absence of strong data and methodology, individual departments will often campaign executives for an interpretation that favors their cause (e.g. taking the credit, shifting the blame), as departments who are driving business outcomes should correctly be given more resources (budget, headcount, promotions, etc.).

While there will never be a grand unifying theory to understand the complicated and complex dynamics within each company or industry, there is a better way than these incomplete, often politically laden, "before and after" explanations that ignore the many interrelated factors driving business outcomes. Before we introduce our proposed solution, we'd like to take some learnings from another field that has laid much of the empirical groundwork for measuring effects in a messy empirical -- the so-called "dismal science," economics.

Economists, the kings and queens of empirical measurement

The specification of the scientific method enabled huge subsequent scientific discoveries. One of its most important aspects is reproducibility, namely, if you run the same experiment twice (e.g. drop a ball) under the same conditions (dropped from the same height, same gravitational force, same wind resistance factor), you will get the same result (time for the ball to reach the ground). From this basis, theories of motion and gravity that explain (or predict) the outcomes in the system can be measured, refined, and iterated against. Unfortunately, for some of the empirical questions we care about, we can't simply run the experiment again. For instance, if we wanted to understand the causes and consequences of the 2008 financial crisis, we can't simply "re-run it."

However, this experimental ideal set much of the empirical groundwork for the concept of a natural experiment that became a critical tool in economics, a field would attempt to answer these important empirical questions (e.g. should the Fed lower the interest rate, what is the effect of FHA-backed mortgages in the housing market, etc.). A natural experiment, simply defined, is a situation in which the subjects you are studying (for instance, the California economy) were roughly randomly exposed to the effect of interest you are studying (e.g. minimum wage laws). In fact, some of the early, canonical findings in economics -- such as, that minimum wage laws causes unemployment -- are based on exactly this set-up, for instance, by studying two counties that are on the border of one another and have roughly the same underlying economic, business and demographic aspects, but one experiences a change in minimum wage laws, whereas the other does not.

There is an obvious problem to this approach -- we never really know if our treatment of interest is truly being applied randomly or if there is some self-selection or, to use a more complicated term, endogeneity) in it. A simple example of why this matter follows: imagine you measure the life expectancies of two groups of people -- one that goes to a hospital, one that never does. You will almost certainly observe that those that go to a hospital have a lower average lifespan. Can you conclude that hospitals, therefore, cause lower life expectancy? No, most likely what's happening is those going to the hospital are going because they are sick, whereas, those who don't go are on average a much healthier bunch. It is this higher average sickness, not the visit to the hospital itself, that is causing the lower life expectancy.

Inevitably, findings based on natural experiments or effects teased out through econometric methods, while powerful tools in their own right, will be subject to some uncertainty. What if we could do better?

A standard, successful experiment

In the modern setting, we can, for many applications and areas of interest, create our own experiments rather than merely passively receive observational data -- and this fundamental difference enables us to precisely measure causal effects. Let's go through a real-world example of how a traditional business could use experimentation to better understand the effectiveness of their business activities.

Let's say you work in Customer Value Management at the prepaid brand of a mobile telecom operator. Whilst Sales is focused on increasing the number of your SIMs in phones, and Marketing's goal is to curate the perception of your company and its services, your goal is a deeply quantitative one: to increase the average spend of your customers. This can be accomplished through a variety of ways but one of the most popular is through customized notifications (either via SMS or digital channels), e.g. when a subscriber hits their one-year anniversary, becomes eligible for a new discounted product, or perhaps immediately after they purchase a certain package.

Let's say that the Product team has come to you to help them stimulate purchasing of a product, DATA10, and they ask if a notification campaign would increase or decrease average customer spend if it were to be launched to the entire subscriber base. Rather than guess, you decide to pilot an A/B test to validate this hypothesis. Of your total 10 million subscriber base, you randomly select 10,000 subscribers to notify them of DATA10, its inclusions and benefits, and monitor in the subsequent weeks the purchase rate of both the treatment group (the 10k who received the notification) and the control group (the 9,990,000 who did not), knowing that any differences between these groups will be specifically caused by the notification campaign itself, not any other outside factors.

Let's say we find that among the treatment group 7.5% of your customers purchased DATA10, as compared to 6.0% in the control group. Typically, the next step of the experimental evaluation would be performed by a statistician or data scientist, who would test for the statistical significance of the observed data, by running a two-sample t-test. For instance, in R, a popular language for statistical analysis, the result of which is shown below:

In this case, we can conclude that the notification campaign in fact did drive increased purchase rate of DATA10 as the likelihood this data would be observed from the same underlying distribution is 3.15 * 10^-10 (very small!), coming from the p-value on the output, tied to the concept of statistical significance. The overall effect appears to be somewhere, in absolute terms, of +1.0% to +2.0% on the purchase rate, from the 95% confidence interval output or, in relative terms, +15% to +33% higher purchase rate of our promo!

In a nutshell, that's it. We have now rigorously shown and quantified that this specific notification campaign -- not any other factors! -- drove your customers to purchase DATA10 at a higher rate. Now, of course in practice, the investigation rarely ends here -- it's common to further want to know: Why did the notification campaign work in the first place? What if we had used other channels to send it out, would it have been better? Was the effect more concentrated in certain regions of our subscriber base, or based on other dimensional attributes)? But, in its simplest form, this is the lifecycle of a typical, successful experiment.

Tools, People, Process

Thinking Machines is just starting to explore applications in this area and we simply wanted to share some of our findings and current thinking on the matter. However, our perspective is that the real-world applications of experimentation are numerous, especially as traditional companies, industries, and organizations move towards digital products and interfaces, where experimentation is easy and cost-effective to implement, or in non-digital contexts, when precise measurement is critical, such as in making large, capital expenditure decisions or seeking to optimize customer acquisition costs.

In the market today, San Francisco-based Optimizely was an early mover in the space and provides an easy-to-integrate platform to run and evaluate web and mobile experiments to improve KPIs, conversion flows, and customer experience that doesn't require any custom development. More recently, Google has added A/B testing as a standard feature in its Google Analytics 360 suite of products, called Optimize. In the offline experimentation space, Mastercard purchased Applied Predictive Technologies in 2015 for $600mm, which had developed a retail-focused Test and Learn framework that enables retailers to, for instance, run experiments on new items on their menu, pricing, and branch openings/closures.

While many marketing or product teams seamlessly transition to an A/B testing-based approach, there are some challenges to know ahead of time common to any process change. The biggest related to A/B testing is likely individual departments who were used to being measured based on soft or loosely defined targets may now have their yearly performance, in full or in part, determined based on the quantitatively and precisely measured results that they were able to drive. For instance, rather than using market research opinion indices, we may now in part measure the marketing team based on how many incremental new sign-ups or sales to the website their digital campaigns drove. Further, it can be counter-intuitive and difficult to realize that many of the initiatives that took most of the team's effort in a quarter were not those offering the most business value. However, this tension is also exactly the promise of A/B testing in the first place -- to be able to evaluate the return on investment or effort precisely -- and ultimately we believe leads to better decision-making and overall performance.

Of course, experimentation isn't appropriate for every field, for instance, high-reliability contexts such as aerospace or civil engineering, and, as with any data science or machine learning methodology, it should be deployed ethically and in compliance with relevant data privacy laws and regulations.

Closing Thoughts

We hope this post has provided a conceptual overview of the benefits and considerations that A/B testing offers to businesses, even in traditional or semi-digital contexts. If you or your organization is interested in working with us on experimentation or any other data-related topic, feel free to reach out. Or if you're an aspiring data scientist who wants to get even deeper into the weeds behind methodologies like this, please do apply via our careers page and keep an eye out for part 2 of this series that will be coming out shortly!


Automating Financial Document Analysis for the World Bank with a Document Intelligence AI Engine

The World Bank aims to better understand local government units’ (LGU) spending patterns and trends in budget execution over a period of four years by examining the LGU’s financial statements collected by Philippines Commission on Audit (COA).

Building Beautiful Plots with Matplotlib

Our data science lead Stef Sy shows you how to generate colorful line, bar, scatter plots and more using just a few lines of code from the Matplotlib library.

Using AI/Big Data to Analyze Urban Mobility Patterns During the Pandemic

Can we still monitor movement under new constraints? In this second part of a blog series with ADB, we use Waze CCP mobility data to show how traffic flow has changed throughout lockdown in major Philippine cities.