Metrics

Let's look at the online and offline metrics used to judge the performance of an ad's prediction system.

The metrics used in our ad prediction system will help select the best machine-learned models to show relevant ads to the user. They should also ensure that these models help the overall improvement of the platform, increase revenue, and provide value for the advertisers.

Like any other optimization problem, there are two types of metrics to measure the effectiveness of our ad prediction system:

  1. Offline metrics
  2. Online metrics

📝 Why are both online and offline metrics important?

Offline metrics are mainly used to compare the models offline quickly and see which one gives the best result. Online metrics are used to validate the model for an end-to-end system to see how the revenue and engagement rate improve before making the final decision to launch the model.

Offline metrics #

As we build models, the best way to compare them is to measure prediction accuracy instead of measuring revenue impact directly. The following are a few metrics that enable us to compare the two models better offline.

Log Loss #

Let’s first go over the area under the receiver operator curve (AUC), which is a commonly used metric for model comparison in binary classification tasks. However, given that the system needs well-calibrated prediction scores, AUC, has the following shortcomings in this ad prediction scenario.

  1. AUC does not penalize for “how far off” predicted score is from the actual label. For example, let’s take two positive examples (i.e., with actual label 1) that have the predicted scores of 0.51 and 0.7 at threshold 0.5. These scores will contribute equally to our loss even though one is much closer to our predicted value.

  2. AUC is insensitive to well-calibrated probabilities.

Calibration measures the ratio of average predicted rate and average empirical rate. In other words, it is the ratio of the number of expected actions to the number of actually observed actions.

Calibration=predictedrateactualhistoricallyobservedrateCalibration =\frac{predicted\;rate}{actual\;historically\;observed\;rate}

Why do we need calibration?

When we have a significant class imbalance, i.e., the distribution is skewed towards positive and negative class, we calibrate our model to estimate the likelihood of a data point belonging to a class.

Since, in our case, we need the model’s predicted score to be well-calibrated to use in Auction, we need a calibration-sensitive metric. Log loss should be able to capture this effectively as Log loss (or more precisely cross-entropy loss) is the measure of our predictive error.

This metric captures to what degree expected probabilities diverge from class labels. As such, it is an absolute measure of quality, which accounts for generating well-calibrated, probabilistic output.

Let’s consider a scenario that differentiates why log loss gives a better output compared to AUC. If we multiply all the predicted scores by a factor of 2 and our average prediction rate is double than the empirical rate, AUC won’t change but log loss will go down.

In our case, it’s a binary classification task; a user engages with ad or not. We use a 0 class label for no-engagement (with an ad) and 1 class label for engagement (with an ad). The formula equals:

1Ni=1N[yilogpi+(1yi)log(1pi)].- \frac{1}{N} \sum_{i=1}^N [y_{i} \log \, p_{i} + (1 - y_{i}) \log \, (1 - p_{i})].

Here:

  • N is the number of observations
  • y is a binary indicator (0 or 1) of whether the class label is the correct classification for observation
  • p is the model’s predicted probability that observation is of class (0 or 1)
svg viewer

Online metrics #

For online systems or experiments, the following are good metrics to track:

Overall revenue #

This captures the overall revenue generated by the system for the cohorts of the user in either an experiment or, more generally, to measure the overall performance of the system. It’s important to call out that just measuring revenue is a very short term approach, as we may not provide enough value to advertisers and they will move away from the system. However, revenue is definitely one critical metric to track. We will discuss other essential metrics to track shortly.

Revenue is basically computed as the sum of the winning bid value (as selected by auction) when the predicted event happens, e.g., if the bid amounts to $0.5 and the user clicks on the ad, the advertiser will be charged $0.5. The business won’t be charged if the user doesn’t click on the ad.

Overall ads engagement rate #

Engagement rate measures the overall action rate, which is selected by the advertiser.

Some of the actions might be:

1. Click rate

This will measure the ratio of user clicks to ads.

2. Downstream action rate

This will measure the action rate of a particular action targeted by the advertiser e.g. add to cart rate, purchase rate, message rate etc.

svg viewer
More positive user engagement on the ad results in more revenue

Counter metrics #

It’s important to track counter metrics to see if the ads are negatively impacting the platform.

We want the users to keep showing engagement with the platform and ads should not hinder that interest. That is why it’s important to measure the impact of ads on the overall platform as well as direct negative feedback provided by users. There is a risk that users can leave the platform if ads degrade the experience significantly.

So, for online ads experiments, we should track key platform metrics, e.g., for search engines, is session success going down significantly because of ads? Are the average queries per user impacted? Are the number of returning users on the platform impacted? These are a few important metrics to track to see if there is a significant negative impact on the platform.

Along with top metrics, it’s important to track direct negative feedback by the user on the ad such as providing following feedback on the ad:

  1. Hide ad
  2. Never see this ad
  3. Report ad as inappropriate

These negative sentiments can lead to the perceived notion of the product as negative.

Created with Fabric.js 3.6.3
The user clicks on the ad gives a positive impression of the product

1 of 2

Created with Fabric.js 3.6.3
The user reporting the ad gives a negative impression of the product

2 of 2

All of the metrics discussed above can be used to measure user engagement with ad and advertiser revenue on user engagement.


Problem Statement
Architectural Components
Mark as Completed