Advanced Statistical Concepts for A/B Testing and Experimentation
This is the second in our two-part series on A/B testing. The first part was primarily intended for an executive audience and discussed the business case and considerations for adopting A/B testing in an organization. Now, we tackle some of the nitty-gritty, advanced statistical concepts that must be addressed by statisticians and data scientists when adopting A/B testing in practice.
We last left our story on A/B testing with a standard, successful experiment by our Customer Value Management department. Before we dig further into some of the subtleties and nuances of A/B testing as a whole, we hope the especially curious among you had already thought of the potential pitfall we need to discuss: what would have happened if the result would have been non-significant? What information would we have been able to glean or not glean from such a non-result?
Non-results are results too.
It is possible that the results of our experiment wouldn't have been conclusive positively or negatively. This could be for one of two reasons: either the campaign truly didn't affect the purchase rate of DATA10, or there wasn't enough data to capture its effects. There is a common misconception that a non-significant result is evidence that there is no effect. However, this is only true if you have enough data -- or equivalently, if you have a sufficiently powered experiment.
Let's build some intuition for why this is the case. Let's say you have a new experimental therapy for treating sunburn and, among a treatment population of 6 individuals who have given their informed consent to receive the treatment, you randomly give 3 of the individuals the therapy and the other 3 a placebo. Among both the treatment and control groups, 1 in 3 of them recovers from the sunburn successfully. Can you conclude that the therapy has no effect? Most likely not, because you hardly have any data! If however, you were to give this to groups of 30,000 each and observed that exactly 10,000 in each group recovered, you would have very strong evidence that the treatment doesn't appear to have an effect. But then, how much is enough?
Statistical power, and how much is enough
In analyzing experimental results, you can be wrong in one of two ways: false positives (you claim a result exists when it doesn't) and false negatives (you claim a result doesn't exist when it in fact does). This is typically illustrated in a confusion matrix as below:
Before, in analyzing our CVM notification experiment, we introduced the concept of statistical significance, which is typically linked to Type I error (false positives). Namely, we typically set a threshold of Type I error rate we are comfortable with (no error or a threshold of 0% is impossible due to randomness); typically, 5% is chosen (this is also called alpha).
To answer our question of how much data is enough to indicate a non-significant result is evidence of non-existence of an effect, we must consider Type II error, or false negatives. The equivalent concept of statistical significance here is statistical power), the probability that we discover an effect given that it exists, or equivalently
1 - Type II error rate. Typically, we set the acceptable Type II error rate at 20%, or equivalently, to have experiments with 80% power. Note that we are more sensitive to claiming false positives than missing false negatives because we assume we can always repeat an experiment to handle false negatives.
Finally, we can answer our original question -- of how many experimental subjects we need to run an experiment properly, or, with the same concepts, how do we know if a non-significant result is evidence of a non-existent effect. The magical function
power.t.test in R does everything that we want. If you pass it all but one of the following parameters -- the number of samples, the standard deviation of the outcome of interest, the minimum detectable treatment effect, and the Type I/II error rates -- it will solve for the excluded parameter.
Generally, we will need more samples (e.g. more data) when we have:
- a higher standard deviation of our outcome of interest
- a lower minimum detectable treatment effect
- Type I/II error rates typically set at 5% and 20% respectively
The multiple comparisons problem
When doing many experiments or measuring many metrics in an experiment, there a common issue that arises is the multiple comparisons problem. If the likelihood of a false positive for a single experiment or comparison is 5%, our Type I error rate discussed above, as you perform more experiments or compare more metrics within the same experiment, the likelihood that you will have at least one false positive approaches 1. In a closed form, the probability of having a false positive scales as a function of the number of results you have found
1-(1-0.05)^n, which approaches 1 quickly as
n grows. For instance, if you were to find 14 statistically significant results at a 5% Type I error rate, it is more likely than not that at least one of them is incorrect (to check: evaluate the above expression when
n=14). A well-illustrated example of this issue is the Significant xkcd comic in which jellybeans are falsely linked to acne when experimentors repeatedly test and compare acne's relationship with each different color of jellybeans, in hopes of finding a single positive result (note that the authors themselves seem to imply it would take 20 experiments to find a single false result, when on average, it actually takes only 14 as discussed above).
There are ways to deal with this, such as by changing your p-value threshold to determine significance, such as the Bonferroni correction, which decreases your p-value threshold as you do more comparisons. However, in practice, the Type I error rate of 5% is deeply embedded in practitioners' minds and the number of comparisons you're doing isn't always a static, explicit number. Instead, it's important to be mindful of this issue, handle it on a case-by-case basis, and consider re-running experiments that conclude with barely significant results.
The Winner's Curse
Another issue that can arise when running multiple experiments is the Winner's Curse. Namely, each of the treatment effects you may observe and claim in practice is itself subject to some self-selection: if a product or marketing team is performing many experiments over a long period of time, they will likely -- through no fault or bad intent -- overstate the cumulative impact of these experiments because of this self-selection towards claiming positive results and disregarding negative ones. Fortunately, there exists a closed-form bias correction for such self-selection to add appropriate uncertainty to cumulatively calculated experimental results.
It's always difficult to tease out the true effects, especially with messy, real-world data, but we hope that understanding some of these advanced statistical concepts will aide aspiring data scientists and organizations better measure their performance. If you or your company is interested in working with us on experimentation or any other data-related topic, feel free to reach out. Or if you're an aspiring data scientist who wants to get even deeper into the weeds behind methodologies like this, please do apply via our careers page!