Google Experiments with Multi-Armed Bandits for Improved Conversions

Google has explained how its multi-armed bandit experiments using Google Analytics can help locate the best profitable action and how updating the randomization distribution during the experiment can help save time. In addition, these experiments can also help divert web traffic towards the best variations in a gradual manner without the need to wait until the end of the experiment.

Multi-armed bandit experiments are analytical experiments based on Google’s algorithm that monitors the performance of multiple variations and diverts traffic towards those that seem to be performing better or performing the best. As the experiment progresses, Google Analytics can get a better idea of the best performing variations and thus move maximum traffic towards those variations.

Google states that it would require around 223 days for concluding a simple A/B test if one gets 100 visits each day to the website with a 4 percent conversion rate. In order to achieve a 95 percent probability in detecting a .04 to .05 shift in those conversion rates, one need not wait 223 days thanks to the help of a multi-arm bandit experiment. Google uses the Bayes’ Theorem to find out the better variation after assigning 50 visits to both arms.

In such an experiment, Analytics will divert 70 percent of traffic to the original arm on the second day in case it shows a 70 percent improvement on the first day itself. The experiment is repeated at the end of the second day and additional traffic is again diverted if the original arm outperforms the other arm. Once the variation crosses the confidence level of 95 percent, the experiment concludes. This specific experiment concluded in 66 days, thus saving 157 days out of the 223.

Google conducted 500 such experiments and found out that one could save around 175 days as compared to the simple power calculation test. Average savings calculated were 97.5 conversions. Google also found that the error rate was approximately the same as in a classical test and that only 5 out of the 500 tests actually took longer than the power analysis.

Google ran each multi-armed bandit experiment for a minimum of 2 weeks. The experiment was concluded once it was 95 percent convinced that a variation was better than the original. Users can anyhow play with the 95 percent confidence level as well as the 2 week experiment duration as per their specific needs. Another metric monitored in this experiment was the Potential Value Remaining. If the experiment uncovers a Champion arm then the Value Remaining at the forced end of the experiment would be the amount of enhanced conversion one could receive when switching away from that Champion.

In case one is 100 percent sure of a Champion then the experiment has logically concluded, but if one has only 70 percent confidence then there is a 30 percent chance of the other arm being better and the Bayes’ rule can point out how better that arm actually is. Once there is a minimum of 95 percent probability of a value being 1 percent less than that of the Champion arm’s conversion rate then the experiment is stopped by Google Analytics. In other words, the experiment stops if the value is less than .04 percent points of CvR in an experiment involving a 4 percent conversion rate.

In case of high performing multiple arms, the experiment should be conducted only to decide that switching arms will not be beneficial instead of seeking out the best arm from those well-performing multiple arms.

This experiment is extremely useful for users that want to test several arms. For instance, testing 6 arms would traditionally take 919 days based on 100 visits per day. However, based on the above conversion rates, the experiment could conclude within approximately 88 days with a saved conversion figure of 1173. Even in the worst case scenario, running the multi-armed bandit experiment would save around 800 conversions.

Once the 500 simulation tests begin, Google Analytics quickly identifies arms that are performing very poorly even though there might be some confusion in the initial days. It could take around 50 days for the experiment to locate an arm that fares really well against the original arm. The best possible arm along with the original arm then split the 100 observations per day to locate the ultimate winner.

The cost of running the multi-armed bandit experiment also reduces with each passing day. A classical experiment could cost 1.333 conversions for each day it is run. However, in case of the multi-armed bandit experiment, the cost reduces since there is less wastage due to less weight given to inferior arms as the experiment progresses into the next day.

The multi-armed bandit experiment offered by Google Analytics is certainly much more efficient than classical experiments and can save a lot of time and conversions upon successful completion.