A/B Testing at PBworks

At PBworks, we take our data seriously. So it should be no surprise to learn that we use A/B testing techniques to aid our product and website development decisions. Having a web-based product means that we can quickly learn what our customers like and what they don’t like and make changes accordingly. If you’re not familiar with A/B testing, Avinash Kaushik has a great primer.

Analyzing Test Results
As the data analyst here, an A/B test for me can be reduced to just a few simple numbers. Those would be: (1) the difference in conversion rate from the control group and (2) the level of confidence we have in that difference. The first number is easy to calculate and explain to the rest of the team, e.g. “The test site resulted in 30% more sign ups that the current site.” Everyone gets that: engineers, marketers, and managers. As an example, here is how one of our recent website experiments played out over a 2 week period:

chart_only_conversions

In the chart, each day shows the cumulative conversion rate (i.e. total sign ups since the beginning of the test divided by the total visitors since the beginning) for the test site (Test) and the current site (Control). Notice how well the test site is outperforming the current site.

However, anyone who’s played games of chance can tell you numbers that look good on this turn, may not be so hot on the next. For example, if you flip a quarter 5 times and it came up heads 4 times, would you feel confident on betting that the coin is biased towards heads? What if you flipped 80 heads out of 100 tosses? At this point, you’d be much more confident that the coin is biases towards heads. In our A/B test, we measure the conversion rate for a small subset of all visitors, let’s say 10,000 visitors with 100 sign ups. Do we believe that the this conversion rate will be the same for the millions of visitors we expect in the months to come? Do we need to test 1,000,000 visitors to be confident that the observed increase will apply to all visitors and was not just the luck of the draw?

Statistical Confidence
Statistician have figured out a way to calculate a numerical representation for the confidence that the population (i.e. the millions of visitors our site will see in the future) will show an increased conversion if the sample (i.e. the thousands of visitors that have hit the test site so far) shows an increase. Though we have this reliable, albeit complex, formula for the confidence number (using a 2-proportion z-test, or an online calculator), explaining what this number means to the rest of my team hasn’t always been easy. How would you interpret: “We saw a 30% increase in sign ups and we’re only 90% confident there is an increase.” What this means is that if we ran this test 100 times, we’d expect in 90 cases to see an increase (though not necessarily a 30% increase) and in the other 10 cases to see a decrease or no change. For some organizations, this would be enough confidence to make the test site the actual site for everyone, for others, it wouldn’t. The decision of what confidence level to use comes down to a trade off of speed and certainty.

Unlike coin flipping, though, recreating the experiment over and over again would take too long and negate most of the gains we expect from A/B testing. So it is difficult for some to internalize what this confidence level represents. Many people, especially those that are risk-averse, don’t like dealing with probabilities and will keep asking for more data. But you’ll never be 100% certain that the test site is better converting than the current site. So at some point you need to stop collecting data and make a decision.

Sunrise Charts
What I’ve found to be a useful aid in getting many of the risk-averse types to accept some risk has been to overlay confidence areas in the time series chart like so:

chart_with_confidence

My team has dubbed this a “Sunrise Chart” (yeah, I’ve never seen a green sky during a sunrise either, but you get the picture). The solid black line and dashed blue line are the same as in the previous chart and the colored bands represent confidence levels. If the test line veers into the green area we have a 90% level of confidence that the test site out-converts the current site.

Many of the less technically-inclined members of my team find that this chart makes sense on a more intuitive level than a statement like: “We saw a 30% increase in sign ups and we’re 90% confident there is an increase.” The chart shows this same information, but it also shows two other things. First, the random day-to-day fluctuations in conversion rate average out and the rates stabilize over time. When people see more stable conversion rates, they are more inclined to feel confident in the difference they see. Second, this chart shows that as we collect more data over time, a smaller and smaller increase is needed to reach a specific confidence level. This is essentially the same piece of information as seeing the conversion rates stabilize, but since these confidence bands are generated from a complex mathematical formula, it gives some peace of mind that the underlying math is jibes with their gut.

Conclusions
To wrap things up, at PBworks we believe that A/B testing is an important tool to develop the most relevant software for our customers. However, when experimenting, it is not enough to simply compare the conversion rates of the test site with the current site. We want some level of certainty that if we do see an increase, it is not simply due to a lucky draw. That is where confidence levels come into play. Finally, and perhaps most importantly, it’s not enough for just the technically inclined to “get it” with a statistical analysis of the results. Rather the whole team needs to be on board with the decisions that result from the experiment, so everyone needs to be comfortable with the analysis. This is when Sunrise Charts can be a valuable aid.

10 thoughts on “A/B Testing at PBworks”

Kevin says:

September 16, 2009 at 11:06 pm

Why are you showing the data as a time series? Wouldn’t aggregate + error bars be more intuitive? Is this just chart porn?

bojanbabic says:

September 17, 2009 at 12:49 am

I agree, if timing of A/B tests performed is perfect. Otherwise, A/B test can be misleading. Increase or decrease can be result of previous campaigns, new features or milliseconds gained due to code refactoring.
What is if A/Bs are performed for very long time while new features are rolled-out (i.e Google BIG search box has bee tested for year before finally pushed )?

Its very hard to distinguish results, very tough to make decision and balance between speed development and making right turns.

Cheers

yohannes sitorus says:

September 19, 2009 at 8:39 am

Mr.weiss i can’t log in what is my user and pass

Tom Collier says:

September 21, 2009 at 2:50 pm

Kevin, if I were the only one involved in making decisions, aggregate + error bars would be sufficient. However, members from the whole team (engineers, marketing managers, etc) must ultimately buy into A/B testing so that the results can be incorporated into decision making. Many on the team aren’t necessarily well versed in statistical jargon.

Initially my talks about the null hypothesis and confidence intervals were met with glazed-over stares. (The language of statistics is far from intuitive.) But, when I presented a sunrise chart, everyone just seemed to get it. This chart tells a story about how the test played out over time. Wrapping a story around numbers makes the analysis more accessible and ultimately allowed us all to accept the results and uncertainty of the test.

1. Kalea says:
  
  September 28, 2011 at 12:27 am
  
  If not for your wrntiig this topic could be very convoluted and oblique.
  
2. gglivvyql says:
  
  September 29, 2011 at 9:23 am
  
  AhOP8B kwiyzqfoqlng
  
Tom Collier says:

September 21, 2009 at 2:56 pm

bojanbabic, we like to keep our tests short (1 – 2 weeks) for some of those very reasons. Our mix of traffic can change quite dramatically from one week to the next. However, we run a control site concurrently with the test sites and randomize which site a new visitor will see. This allows us to account for any interesting events that happen during the test. For example, if we had a spike in traffic on day 3 of a test, both the control and the test site would see the same spikes.

Tom Collier says:

September 21, 2009 at 2:59 pm

yohannes sitorus, while we boast a healthy readership of our blog, I can’t guarantee that Mr. Weiss saw your comment. Try visiting your workspace (http://.pbworks.com/) and clicking the “Contact the workspace owner” link.

kevin says:

September 21, 2009 at 3:26 pm

@Tom, sounds reasonable – you’ve a/b tested your presentation of a/b testing results . . . now if you could just present the glazed-over metrics in a more appealing chart format 🙂

Pingback: A/B Testing Links « streamhacker.com