Most ads you see are chosen by a reinforcement learning model — here’s how it works

Every day, digital advertisement agencies serve billions of ads on news websites, search engines, social media networks, video streaming websites, and other platforms. And they all want to answer the same question: Which of the many ads they have in their catalog is more likely to appeal to a certain viewer? Finding the right answer to this question can have a huge impact on revenue when you are dealing with hundreds of websites, thousands of ads, and millions of visitors.

Fortunately (for the ad agencies, at least), reinforcement learning, the branch of artificial intelligence that has become renowned for mastering board and video games, provides a solution. Reinforcement learning models seek to maximize rewards. In the case of online ads, the RL model will try to find the ad that users are more likely to click on.

The digital ad industry generates hundreds of billions of dollars every year and provides an interesting case study of the powers of reinforcement learning.

Naïve A/B/n testing

To better understand how reinforcement learning optimizes ads, consider a very simple scenario: You’re the owner of a news website. To pay for the costs of hosting and staff, you have entered a contract with a company to run their ads on your website. The company has provided you with five different ads and will pay you one dollar every time a visitor clicks on one of the ads.

Your first goal is to find the ad that generates the most clicks. In advertising lingo, you will want to maximize your click-trhough rate (CTR). The CTR is ratio of clicks over number of ads displayed, also called impressions. For instance, if 1,000 ad impressions earn you three clicks, your CTR will be 3 / 1000 = 0.003 or 0.3%.

Before we solve the problem with reinforcement learning, let’s discuss A/B testing, the standard technique for comparing the performance of two competing solutions (A and B) such as different webpage layouts, product recommendations, or ads. When you’re dealing with more than two alternatives, it is called A/B/n testing.

[Read: How do you build a pet-friendly gadget? We asked experts and animal owners]

In A/B/n testing, the experiment’s subjects are randomly divided into separate groups and each is provided with one of the available solutions. In our case, this means that we will randomly show one of the five ads to each new visitor of our website and evaluate the results.

Say we run our A/B/n test for 100,000 iterations, roughly 20,000 impressions per ad. Here are the clicks-over-impression ratio of our ads:

Ad 1: 80/20,000 = 0.40% CTR

Ad 2: 70/20,000 = 0.35% CTR

Ad 3: 90/20,000 = 0.45% CTR

Ad 4: 62/20,000 = 0.31% CTR

Ad 5: 50/20,000 = 0.25% CTR

Our 100,000 ad impressions generated $352 in revenue with an average CTR of 0.35%. More importantly, we found out that ad number 3 performs better than the others, and we will continue to use that one for the rest of our viewers. With the worst performing ad (ad number 2), our revenue would have been $250. With the best performing ad (ad number 3), our revenue would have been $450. So, our A/B/n test provided us with the average of the minimum and maximum revenue and yielded the very valuable knowledge of the CTR rates we sought.

Digital ads have very low conversion rates. In our example, there’s a subtle 0.2% difference between our best- and worst-performing ads. But this difference can have a significant impact on scale. At 1,000 impressions, ad number 3 will generate an extra $2 in comparison to ad number 5. At a million impressions, this difference will become $2,000. When you’re running billions of ads, a subtle 0.2% can have a huge impact on revenue.

Therefore, finding these subtle differences is very important in ad optimization. The problem with A/B/n testing is that it is not very efficient at finding these differences. It treats all ads equally and you need to run each ad tens of thousands of times until you discover their differences at a reliable confidence level. This can result in lost revenue, especially when you have a larger catalog of ads.

Another problem with classic A/B/n testing is that it is static. Once you find the optimal ad, you will have to stick to it. If the environment changes due to a new factor (seasonality, news trends, etc.) and causes one of the other ads to have a potentially higher CTR, you won’t find out unless you run the A/B/n test all over again.

What if we could change A/B/n testing to make it more efficient and dynamic?

This is where reinforcement learning comes into play. A reinforcement learning agent starts by knowing nothing about its environment’s actions, rewards, and penalties. The agent must find a way to maximize its rewards.

In our case, the RL agent’s actions are one of five ads to display. The RL agent will receive a reward point every time a user clicks on an ad. It must find a way to maximize ad clicks.

The multi-armed bandit

multi-armed bandit