How to Avoid Potential A/B Testing Mistakes

“Enlightened trial and error outperforms the planning of flawless execution.” — David Kelly


1. Testing without a hypothesis or having a hypothesis that is not backed up with data and insights: If we run an experiment without an hypothesis, we can gather some information and get some results, but we would probably have a hard time to interpret or take lessons from them, since we don’t know what we are actually trying to determine.


2. Mixing two or more hypotheses and trying to test them in one test: If we try to test two hypotheses at the same time, in the end it would be hard to differentiate the reason of the change in metrics and user behaviour since it is not clear which hypothesis and the related solution is responsible for the change.


3. Calling the test off too soon or at the moment we like the results or just at the time when we see a statistically significant change in one of the variations: Even if the significance levels are tempting, significance level itself is not enough to decide whether the experiment should stop or continue. We should rather calculate the number of visitors required (sufficient sample size) and how long the test should run (ideally full weeks since we have a different e-commerce behaviour everyday) before running the test, so that we can achieve results that we can trust. Then we need to end the test at this pre-defined time no matter what the results are. At the end of this predefined time frame,

  • we could either see that there is a significant difference between the variations, which means that variation A is most probably better/worse than variation B when we consider the change in the target metric

or

  • we could see that there is no significant difference between the variations, which means that we are unable to determine whether the change made on the website/app has a positive or a negative impact on the target metric and they are both fine.

If we don’t end the test on the pre-defined time, there’s always a great chance that we will pick the wrong winner.


4. Doing A/B tests without enough traffic: As mentioned in 3, we need to run our tests with enough sample size for a sufficient length of time to get reliable and actionable results. That’s why we need enough traffic to achieve the required sample size in a reasonable amount of time since waiting for a test result for months is not an efficient option to test, learn and improve our product.

It is difficult to give a definitive number for the minimum required traffic since the required sample size for the tests depends on:

  • The current value of and the expected change in the target metricdue to the change made on the website/app, which also depends on the contrast we have between the variations. So the larger the expected effect, the less users we need.



  • The number of variations: As can be seen above, if we have two variations, then the required sample size is 122.123*2, where if we have four variations it goes up to 122.123*4. So the higher the number of variations, the more users we need.

In the end, the bigger the sample size, the better confidence for the given change or the easier detecting smaller effects for the given confidence level.


5. Trying to have the pixel perfect designs and technically perfect implementations for the A/B tests: To run an A/B test, we don’t need the finished and perfectly designed features. What we need is the essentials that will help us to test our hypothesis. For example, if we think that an element is confusing, then we can just blackout it and measure the impact of having this element easily. So, rather than spending lots of time trying to build the perfect features and designs, we need to focus on learning fast and cheap.


6. Changing test setup (traffic split, audience etc.) during testing without purging the data: If we want to change the test setup after we started the test, we should always purge the data until the time we made the changes or start a new test because otherwise it eventually leads to a data pollution.

7. Less-than-ideal technical implementation: In general, we need to make sure that we don’t harm the performance with our tests because of poorly implemented solutions like heavy CSS or inefficient JavaScript.


This is the last piece of this A/B testing essentials mini series. I hope you enjoyed and thanks to all that stuck around until the end!

References