2024-04-22 –, A1
Every day, we engage with news, and more often, these are curated by recommendation engines. Building such an algorithm poses some unique challenges, different from movie or product recommendations: articles have a short lifetime because nothing is older than yesterday's news. The data is heavily biased by the different positioning of articles on the page, and journalistic principles and brand identity should be represented in the article selection. At Axel Springer National Media and Tech, we overcome these challenges by leveraging our domain knowledge combined with simple statistics instead of black-box machine learning models. This talk will share some of our learnings that can be applied to recommendation systems and data science projects in general.
What is special about news recommendations?
-
We are used to recommendations from Netflix, Amazon, or TikTok. All of these apps have logged-in users that can be easily tracked. News websites, on the other hand, have a large share of unknown users that can only be tracked via first-party cookies. Therefore, there is much more cold start in the user dimension. In addition to that, movies, products, and funny videos have relatively long lifetimes, whereas news articles are often only relevant for a few hours. This means that recommendation systems have much less time to collect information about what is relevant for whom, and there is a lot of cold start in the item dimension.
-
Users are more critical with the selection of news articles that are presented to them compared to selections of products or movies. News recommendation is not only about finding the most relevant items; it is also about putting items in the right relationship to each other to reflect journalistic considerations and brand values. For example, often articles should be sorted according to the seriousness of the topic, or the topic's relevance for society. Similar articles should be placed next to each other, etc.
-
The front page plays an outsized role for news websites. Users come here to get an overview of what is happening in the world. Consequently, the data generated by these websites is heavily dominated by effects that originate in the structure and mechanics of the front page. Articles shown on top of this page with a large image will be clicked much more likely, compared to an article at the bottom of the page with just a small headline.
How do news recommendations typically work?
- Recommendation engines are often closely associated with collaborative filtering. However, collaborative filtering systems struggle with cold start, which is especially prevalent for news articles and users of media sites. At the same time, there are many simple ways to rank articles. Articles can be sorted according to their age, their popularity, or according to how often a user has read articles from the same category before. Based on our experience, most systems deployed in practice use a combination of these principles along with collaborative filtering. Especially for smaller widgets, multi-armed bandit approaches are also popular, where the algorithm just tries different articles and keeps showing those that tend to have the highest CTR.
What is special about our approach to news recommendations?
-
One can think of recommendation as a simple click prediction problem. We have one user and many items and want to use features of the user and the items to predict how likely the user will click. The articles can then be ranked and selected based on these probabilities. Therefore, we are not tied to use collaborative filtering algorithms but can use any machine learning algorithm of our choice.
-
A major feature for our system is to identify articles that are trending. Most popular feeds and rankings are widely used, but as an absolute measure, they are heavily influenced by the position bias. The articles on top of the page are most likely to get the most clicks, therefore they will be put on top of the page again. This cycle continues until the story becomes so uninteresting that it starts to perform worse than other stories in worse positions.
In contrast to that, we refer to relative performance as trendingness. If a story performs better than usual for its position, then it is trending. The beauty of this approach is that it makes the performance of articles at the top and at the bottom of the page comparable to each other. You can be 10 percent better or worse than expected in all positions of the page. The ugly part is that numbers at the bottom of the page start to become very small and therefore trendingness becomes very unstable. If an article is expected to get 1/100 of a click in a certain time interval, and there is an accidental click on this article, you suddenly have an incredible trending article. Unfortunately, most news pages contain many articles that are clicked with very low probabilities, therefore you have good chances to produce these outliers quite frequently. The art of constructing a good measure of trendingness is in finding a good way to regularize the trendingness to avoid these effects. -
Position bias on news media sites is so strong that a classification model that predicts clicks solely using the position of an article as a feature will have an AUC of about 0.8. Consequently, a model trained on clicks will mostly just learn patterns that are correlated with the position. For example, if politics articles tend to be placed higher on the page than sports articles, the model will learn that politics articles generally click better than sports articles. We can avoid this by giving the model information about the position, but then the algorithm mostly picks up position-related patterns that cannot be exploited when choosing which article to put in one specific position.
-
When training our recommendation algorithm, we overcome the position bias problem by weighting clicks so that they are compared on neutral grounds. First, we determine the click probability of an article based on its position alone. Then we weight clicks and non-clicks according to their relative probability.
- A click that was supposed to happen with a probability of 0.1 becomes 1/0.1 - 1 = 9, and a click with a probability of 0.01 becomes 1/0.01 - 1 = 99. A likely click gets a lower weight than an unlikely one.
- We also derive information from non-clicks. A non-click with a probability of 0.9 becomes -1/0.9 + 1 = -0.1. If an article is presented in a prominent position, but it is not clicked by the user, this is an expression of disinterest and it can help to feed our algorithm.
- By turning clicks into weighted clicks, we essentially turn the problem from a classification problem into a regression problem. On average, the weighted clicks are equal across all positions, so that the position bias is eliminated.
-
One of the features that surprised us the most with its good performance is our "article already seen" feature. For each user and every recommendable article, we keep a counter that measures how often the article was already shown in a prominent position but not clicked by the user. These scores are based on the position-based click probabilities that we also use for the weighted clicks. If an article gets shown in a position with an average CTR of 0.1, the score is 0.1 the next time the article could potentially be recommended to the user. If the article now gets shown in a lower position with a click probability of 0.01, the score increases to 0.11 next time. The model then learns that articles that were shown multiple times in prominent positions before but were not clicked are likely not going to be clicked next time they are shown, either. As a consequence, the page becomes fresher and A/B test results indicate a meaningful uplift compared to a model without this feature.
What have we learned?
-
Websites usually track what users do, but not what they do themselves. Our algorithms rely heavily on the fact that we track who saw what and in which position. This gives us the ability to overcome the position bias and significantly improve our algorithms.
-
We do simple things for complicated reasons. The key advantage of simple statistical models over black-box algorithms is that they are easier to debug. Every time we replace a boosted tree or something similar with a linear model, we realize that it is not acting the way we expected. We can then make the necessary adjustments - for example, by adding well-crafted features that leverage our domain expertise. At the end of the process, the linear model becomes better than the black-box model was in the beginning.
None
Expected audience expertise: Python:None
Abstract as a tweet (X) or toot (Mastodon):Diving into the world of recommendations! Learn how we overcome the special challenges of recommending news at Axel Springer NMT by using simple statistics.
Dr. Christian Leschinski leads the data science team and the Customer Intelligence team at Axel Springer National Media and Tech. His work is dedicated to build data and AI products that improve the user experience and the monetisation of digital media products and to help organisations to make data-informed decisions. This encompasses use cases ranging from programmatic advertising and subscription pricing to customer analytics and news recommendation.