When static datasets lead to static ideas

Why I left the beaten path for my position bias in features paper

Apr 08, 2024

I wrote recently about benchmark datasets for learning-to-rank and how to use them. In this follow-up I want to explain their limitations for my uses, and the general perils of a literature coalescing around defaults that are too limited and static.

Learning-to-rank and its benchmarks

Learning-to-rank (LTR) is the problem space of ranked recommendations. Think of search engines like Google, or other search or recommendations settings like Amazon or Spotify. Your choices in these products are ranked, one way or another, which is important because people are likely to select the top or most visible items. That creates some complexities in how we model the outcomes.

The dominant benchmark dataset in LTR is one from Yahoo, with anonymized query-document pairs and their true relevance as judged by human reviewers. Many papers use this dataset, allowing for consistent comparisons across their methods.

Limitations of the standard datasets

The Yahoo dataset has some key limitations. The corresponding dataset from Microsoft, MSLR-web30k, has the same structure and limitations too.

Challenges with the datasets:

They are old, from 2010 or before
They are from two search engines of a similar type (web search)
They are small, with only about 30k queries each
They have numeric features
The relevance labels—for better or worse—are determined by human editors and do not specifically relate to user judgment
There is no information on ranking or on user behavior
There aren’t any user or document IDs
1. Yahoo even describes their dataset as having 883k “documents”, which is the number of records, as if documents can’t be repeated1

We don’t have queries, these query-document sets aren’t necessarily from actual searches, thus there is no real ordering of documents shown together in those actual searches, nor outcomes from actual user behavior. These are just features from anonymous query-document pairs combined with a human judgment on how relevant each (anonymous) document would be for its (anonymous) query.

What’s the point?

This still has some uses. We have real features that were engineered for those mature search engines. We have relevance judgments from principled reviewers. It is convenient to model with a large selection of real features such that we have a realistic relationship between features and ground truth labels. We also get a partial distribution (the most relevant part of the distribution) of query-document relevance.

Other than that, we’re not getting much out of this dataset. We could just make our own dataset with basic assumptions about relevance distributions and feature-relevance relationships. User behavior is the part that’s most fundamental to learning-to-rank, and that part ends up simulated anyways.

Patterns become permanent

Paper writing became a bit more automatic, and reading them can start to feel repetitive. There might be more papers because of how consistent the methods are. Code builds up around the standard methods, such as the PT-ranking and allRank projects for learning-to-rank. Those only support the standard data format used by the Yahoo dataset and other datasets like it. New papers are expected to benchmark on the same datasets, even when their methods might not be particularly applicable or the datasets might not fit the setting.

I would argue that LTR was more creative in the years before the Yahoo dataset became popular than in the years after. The behavior models were varied and creative, and people tried eye-tracking studies to examine behavior. Some of the papers feel wildly different in their framing and emphasis. Then came the Yahoo benchmark and its competition, alongside growth in the machine learning field more generally. The volume of LTR papers exploded, yet if anything the methods seem less varied.

LTR isn’t the only field that coalesced, for better or worse, around specific datasets. Consider the common datasets and limited metrics that are dominant in the text summarization literature. Or the dominance of ImageNet for image tasks—although at least ImageNet is larger and supports more tasks than the Yahoo learning-to-rank dataset does.

I pursued an idea instead of a dataset—so I needed a new dataset

These LTR dataset limitations were a problem for me for my Position bias in features paper. I had observed an interesting dynamic with document features, features that by construction had a temporal element and relied on documents occurring repeatedly in searches. Specifically, I constructed and evaluated a series of historical click-through rate (CTR) features for each document. I had a good private dataset to test on, but I wanted to demonstrate on a public one.

Yet let’s go back to our Yahoo dataset. There is no document identifier. I can’t construct features about how a document performs over multiple searches, because I don’t know which documents are which. That’s treated as an irrelevant consideration. That was a fundamental problem for me, that we only have the concept of query-document relevance. An early Microsoft dataset had document identifiers, but (as discussed in my last post) there wasn’t much repetition to those documents.

The idea of intrinsic document relevance seemed straightforward to me, but it is foreign in the learning-to-rank literature.

So I constructed my own dataset. It’s rich, with varying lengths of search results, recurring patterns of documents shown together, varying sizes of those document clusters, and the concept of searches occurring over time. The temporal element was necessary for the thesis of my paper, and the other factors created a setting with crucial variation.

Paths for impact

One way to succeed in academic work is to work within a dominant paradigm. One can use standard datasets and follow a well-established formula for writing papers. This can be repeated indefinitely, each time testing something a bit different from prior experiments. This can compound into good findings.

Another approach is to ask: what questions can’t be answered with the standard datasets and methods? What are we missing? What are the biggest unknowns in the field and what data do we need so that we can discover new truths?

The second approach can be fruitful, but it’s hard. You have to find your own data and write your own code. You have to establish original methods. If you really succeed, you might even create a new setting. But often you won’t succeed, and you’ll have trouble explaining why you’ve done something different. Your outcomes have a wider range, more likely to have extreme positive outcomes, but also more likely to go nowhere. The variance is higher. The path is harder. But it might be a road worth traveling.

They also rounded up, but that’s a side matter.

Simplicity is SOTA

Discussion about this post