At PBworks, we take our data seriously. So it should be no surprise to learn that we use A/B testing techniques to aid our product and website development decisions. Having a web-based product means that we can quickly learn what our customers like and what they don’t like and make changes accordingly. If you’re not familiar with A/B testing, Avinash Kaushik has a great primer.
Analyzing Test Results
As the data analyst here, an A/B test for me can be reduced to just a few simple numbers. Those would be: (1) the difference in conversion rate from the control group and (2) the level of confidence we have in that difference. The first number is easy to calculate and explain to the rest of the team, e.g. “The test site resulted in 30% more sign ups that the current site.” Everyone gets that: engineers, marketers, and managers. As an example, here is how one of our recent website experiments played out over a 2 week period:
In the chart, each day shows the cumulative conversion rate (i.e. total sign ups since the beginning of the test divided by the total visitors since the beginning) for the test site (Test) and the current site (Control). Notice how well the test site is outperforming the current site.
However, anyone who’s played games of chance can tell you numbers that look good on this turn, may not be so hot on the next. For example, if you flip a quarter 5 times and it came up heads 4 times, would you feel confident on betting that the coin is biased towards heads? What if you flipped 80 heads out of 100 tosses? At this point, you’d be much more confident that the coin is biases towards heads. In our A/B test, we measure the conversion rate for a small subset of all visitors, let’s say 10,000 visitors with 100 sign ups. Do we believe that the this conversion rate will be the same for the millions of visitors we expect in the months to come? Do we need to test 1,000,000 visitors to be confident that the observed increase will apply to all visitors and was not just the luck of the draw?
Statistician have figured out a way to calculate a numerical representation for the confidence that the population (i.e. the millions of visitors our site will see in the future) will show an increased conversion if the sample (i.e. the thousands of visitors that have hit the test site so far) shows an increase. Though we have this reliable, albeit complex, formula for the confidence number (using a 2-proportion z-test, or an online calculator), explaining what this number means to the rest of my team hasn’t always been easy. How would you interpret: “We saw a 30% increase in sign ups and we’re only 90% confident there is an increase.” What this means is that if we ran this test 100 times, we’d expect in 90 cases to see an increase (though not necessarily a 30% increase) and in the other 10 cases to see a decrease or no change. For some organizations, this would be enough confidence to make the test site the actual site for everyone, for others, it wouldn’t. The decision of what confidence level to use comes down to a trade off of speed and certainty.
Unlike coin flipping, though, recreating the experiment over and over again would take too long and negate most of the gains we expect from A/B testing. So it is difficult for some to internalize what this confidence level represents. Many people, especially those that are risk-averse, don’t like dealing with probabilities and will keep asking for more data. But you’ll never be 100% certain that the test site is better converting than the current site. So at some point you need to stop collecting data and make a decision.
What I’ve found to be a useful aid in getting many of the risk-averse types to accept some risk has been to overlay confidence areas in the time series chart like so:
My team has dubbed this a “Sunrise Chart” (yeah, I’ve never seen a green sky during a sunrise either, but you get the picture). The solid black line and dashed blue line are the same as in the previous chart and the colored bands represent confidence levels. If the test line veers into the green area we have a 90% level of confidence that the test site out-converts the current site.
Many of the less technically-inclined members of my team find that this chart makes sense on a more intuitive level than a statement like: “We saw a 30% increase in sign ups and we’re 90% confident there is an increase.” The chart shows this same information, but it also shows two other things. First, the random day-to-day fluctuations in conversion rate average out and the rates stabilize over time. When people see more stable conversion rates, they are more inclined to feel confident in the difference they see. Second, this chart shows that as we collect more data over time, a smaller and smaller increase is needed to reach a specific confidence level. This is essentially the same piece of information as seeing the conversion rates stabilize, but since these confidence bands are generated from a complex mathematical formula, it gives some peace of mind that the underlying math is jibes with their gut.
To wrap things up, at PBworks we believe that A/B testing is an important tool to develop the most relevant software for our customers. However, when experimenting, it is not enough to simply compare the conversion rates of the test site with the current site. We want some level of certainty that if we do see an increase, it is not simply due to a lucky draw. That is where confidence levels come into play. Finally, and perhaps most importantly, it’s not enough for just the technically inclined to “get it” with a statistical analysis of the results. Rather the whole team needs to be on board with the decisions that result from the experiment, so everyone needs to be comfortable with the analysis. This is when Sunrise Charts can be a valuable aid.