Benchmark datasets for learning-to-rank

The use and history of the most important datasets in the field

Mar 11, 2024

Benchmark datasets become an essential part of any particular ML literature. Consider computer vision, where the ImageNet dataset and its associated competition led to incredible advances in image tasks. Even more so, ImageNet holds a bigger place in ML history, as an inflection point that directly led to a boom in ML with advances in algorithms, deep learning libraries, and (ultimately) even hardware. The datasets used in learning-to-rank are less well known, but within that field they had their own impact.

Benchmark datasets

Benchmark datasets are data collected and released from real (and typically notable) projects, often with human ratings or classification. They come pre-split into training and test datasets (or with a validation set, or with cross-validation folds).1 Subsequent papers can be fairly compared on the same datasets.2

Beyond being a standard of measurement for new methods, benchmark datasets can actually affect the direction of research directly. But I’ll get into that another time; first, let’s look at datasets.

There are two dominant data sources in the learning-to-rank literature. Both are from search engines, one from Yahoo and one from Microsoft.3 To give you a sense of their popularity, the Yahoo one has been (at time of writing) cited 577 times and two versions of the Microsoft dataset have been cited 500 and 315 times.

Despite the higher citation count for Microsoft’s search engine, I expect the Yahoo dataset is much more prevalent for benchmarking in the most notable papers in learning-to-rank. For example, it was used in Joachims et al. (2016), a highly influential paper in LTR, while the Microsoft data was not [1]. It seems like every paper on learning-to-rank has to show how its method performs on Yahoo data.

The Yahoo Learning to Rank dataset

Yahoo put together a clean dataset as part of a challenge they organized in 2010. They documented the dataset and the challenge in detail [2]. With 36,251 queries and 882,747 total records, split into domestic and international datasets, at the time it was the largest LTR dataset released. While that might not sound large by modern ML standards, that’s still the largest LTR benchmark dataset in use.4

Let’s take a peek at the data.

This is the start of the first row from the Yahoo dataset:5

0 qid:1 10:0.89028 11:0.75088 12:0.01343 17:0.4484

It extends onwards for a lot longer, but you get the idea. This is in the svm-light/libsvm data format, a basic textual format for sparse datasets.6

That item has relevance of 0 (“bad”), it has query ID of 1, and then it has features numbered 10, 11, 12, and 17, with values 0.89028, 0.75088, 0.01343, and 0.4484. We don’t know what the actual typed query was for the search, because that could contain sensitive information. Nor do we know the actual link (as the “document” in this internet search engine setting) that each row corresponds to. We also don’t know what each of the features actually means, but we do know that they are standardized. The row continues up to feature number 699. Some feature numbers aren’t there (such as 1 through 9 and 13 through 16), because the two Yahoo datasets don’t use all the same features.

Queries have multiple records (the union of the top 5 documents from each of several internal rankers), each with their own row, but we don’t know if documents repeat with multiple records across queries. Relevance is manually labeled by internal evaluators, on a 0-4 scale corresponding to each possibility from a {bad, fair, good, excellent, perfect} set. We are meant to treat these labels as ground truth in analysis.

How to write a paper using this data

The Yahoo data contains anonymous query-document pairs, with anonymous ML features and a ground truth label. For any query we can derive a correct ordering of the documents, against which we could compare any hypothetical alternative. Often there are multiple correct orderings, given the discrete nature of the labels. If a query (with one query ID) has 5 documents with relevances 1, 4, 3, 2, 3, then the most correct ordering of the documents would be the two that match their relevance order (4, 3, 3, 2, 1).

The task is to optimize ranking. Researchers may apply different ML algorithms, such as SVMs, GBDTs, and DNNs. Since user behavior is important in ranking—particularly that users are more likely to select items that are listed higher up in rankings—researchers will also apply different methods to rank effectively in the presence of that ranking bias.

You might notice that ranking bias isn’t inherent to the dataset. There is no position field, nor is actual user behavior shared. There might not be anything to share: these documents may not have ever been shown to users for these particular queries. For position and its behavioral implications, which is essential and at the core of learning-to-rank, researchers assume user behavior and generate data accordingly. The simulated rankings should ideally have some correlation with quality, so a classic method is to first build a naive ranker on 1% of the training data and use that to create an initial plausible ranking of all the training data.7

The user behavior model is important. This is where we make assumptions about how users are affected by ranking bias. For example models, see click_models in the allRank Github repo. Their BaseCascadeModel is typical, where users click on any examined items that are relevant, but only examine items with an exponentially decreasing likelihood at lower positions.8

Summarizing the algorithm for writing a LTR paper:

Choose your ML algorithm
Define how users will behave: in what ways they are affected by document position
Generate a hypothetical initial ranking
Generate clicks probabilistically, following the user behavior model
Fit your LTR algorithm to predict click on the train dataset using the provided features
Evaluate performance on the test dataset, using NDCG or another ranking metric
If you have a real-world search engine at your disposal, test your method against your own data and report the results
1. A/B tests on notable consumer products are the most convincing, but if you don’t want to take the time to run one of those (or are generally not trusted to do so), you can use the same trick of assuming a user behavior model on your own documents and simulating
Write up the results and submit the paper

That omits a lot of detail; a lot of work is hidden behind some of those individual steps.9 Nevertheless, that formula describes many papers in the LTR literature that test different ML algorithms using the same benchmark datasets and assumptions. This standard formula means researchers don’t have to reinvent typical assumptions, and can focus on their topic area, which is typically the ML algorithm.

Microsoft datasets

The Microsoft sets each contain multiple different datasets, so their citation count does not make clear how much each one (and each version of each one, for that matter) has been used. Another dataset from Microsoft is more heavily used than either of the two that have explanatory papers, but it lacks its own reference paper to cite and Microsoft directs users to cite the latter of those aforementioned papers. Let me clear up this mess a little bit.

Some of the Microsoft LTR benchmark datasets predate the Yahoo one.

In 2008, Microsoft released and documented LETOR 3.0, which was allegedly a continuation of earlier LETOR 1.0 and LETOR 2.0 datasets [3]. I say “allegedly”, because I mostly know of LETOR 1.0 and 2.0 from references in the LETOR 3.0 documentation, and the (fairly hidden) link to where they can be found is no longer active and now leads to the general Microsoft Research page. But then again, the download link for LETOR 3.0 is currently broken too. I think the data might be all mixed together in one large public OneDrive.10 Okay, enough about LETORs 1 through 3. Microsoft forgot about them, and so can we.

LETOR 4.0, released in 2009, contains eight connected datasets which are entirely unrelated to LETORs 1.0 through 3.0 [4].11 These datasets are from 2007 and 2009 search corpuses, from Bing and its predecessors. Unlike the Yahoo dataset, LETOR 4.0 has document identifiers. Both datasets are adapted for supervised, semi-supervised, aggregation, and listwise tasks, meaning there are 2 * 4 = 8 datasets in total.

Here’s the complete first line from the supervised MQ2007 dataset from LETOR 4.0:

0 qid:10 1:0.000000 2:0.000000 3:0.000000 4:0.000000 5:0.000000 6:7.240045 7:23.625574 8:22.686609 9:26.509571 10:7.195383 11:0.000000 12:0.000000 13:0.000000 14:0.000000 15:0.000000 16:61.000000 17:0.000000 18:4.000000 19:5.000000 20:70.000000 21:0.000000 22:NULL 23:NULL 24:NULL 25:0.000000 26:NULL 27:NULL 28:NULL 29:0.000000 30:NULL 31:NULL 32:NULL 33:0.000000 34:NULL 35:NULL 36:NULL 37:0.000000 38:NULL 39:NULL 40:NULL 41:0.150000 42:0.000000 43:3.000000 44:1.000000 45:17.000000 46:0.000000 #docid = GX000-00-0000000 inc = 1 prob = 0.0246906

Notably, Microsoft shared feature definitions.

The first 10 features in the LETOR 4.0 dataset, taken from Qin, T., & Liu, T. (2013) [4]

While that dataset—like the Yahoo dataset—contains features in the svm-light/libsvm format, the aggregation dataset is a bit different: it contains rankings for each document in a series of numbered queries.

1 qid:10 1:NULL 2:NULL 3:NULL 4:1 5:8 6:14 7:14 8:14 9:22 10:NULL 11:NULL 12:18 13:252 14:227 15:6 16:214 17:NULL 18:NULL 19:120 20:NULL 21:NULL #docid = GX000-62-7863450 inc = 1 prob = 0.56895

These datasets might seem to have some advantages over the Yahoo dataset, since they contain document identifiers and feature definitions. However, they unfortunately have far fewer records, which makes them less realistic for testing modern ML algorithms with many parameters. They also have fewer features. Furthermore, the document identifiers are not very useful because the documents are rarely shared across queries. For MQ2007 the most common document is rated for 851 queries, but after that the next most common is only in 24, then 14, 13, 10, 8, etc., with the vast majority of documents only included for a single query.

Microsoft also released MSLR-web30k and MSLR-web10k datasets, in 2010 [5]. Although these seem much more commonly used now than the earlier Microsoft datasets, there isn’t much to say about them beyond how similar they are to the Yahoo dataset, to which they are comparable in size and structure. They come with feature definitions, but do not have document identifiers. They’re in the typical svm-light format. MSLR-web30k is used in many LTR papers, but far less commonly than the substitutable and more well-established Yahoo dataset.

Newer alternatives

Since then, Italian and Chinese search engines (Istella and Sogou, respectively) also released LTR datasets in the typical format (here and here, respectively). These also contain human judgements and features, and have a similar number of queries as the Yahoo and MSLR datasets. I haven’t figured out why all of these datasets are the same order of magnitude in size; maybe they’re all just following Yahoo’s precedent. For both of these newer dataset, the papers introducing them are well-cited, at least sometimes for the dataset and sometimes for their methods [6, 7]. Either way, kudos to both sets of researchers for releasing and maintaining benchmark datasets corresponding to the internal data they used for their papers. The Sogou data (known as Tiangong-ULTR) has actual click outcomes on their training data and manually-labeled relevance ratings on their test data.

Then there’s Baidu. In 2022, they released the dream dataset. Orders of magnitude larger than prior datasets. From real searches. Placement heights and other presentation data. Dwell time. Actual positions. Actual click. While they lack constructed features, they have the query, document title, and document abstract represented with numeric tokens. This dataset has everything you could want.

Except, there’s none of that. The various download links redirect or lead to 404 errors. So for now I’ll hold off on crediting Baidu for revolutionizing LTR research.12 Cool paper, though. It’s a good read, challenging the practices in the LTR literature and demonstrating the effectiveness of leading ML approaches on their real-world data [8].

Similarly, other datasets have come and gone, cited in a small number of papers from before the providers stopped hosting their datasets.

Finale, for now

I hope this is useful to anyone who wants to delve into this literature or construct their own benchmark datasets.

If you enjoy reading about learning-to-rank, you might like some of my earlier posts:

[1] Joachims, T., Swaminathan, A., & Schnabel, T. (2017). Unbiased Learning-to-Rank with Biased Feedback. Proceedings of the Tenth ACM International Conference on Web Search and Data Mining.

[2] Chapelle, O., & Chang, Y. (2010). Yahoo! Learning to Rank Challenge Overview. Yahoo! Learning to Rank Challenge.

[3] Liu, T., Xu, J., Qin, T., Xiong, W., & Li, H. (2007). LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval.

[4] Qin, T., & Liu, T. (2013). Introducing LETOR 4.0 Datasets. ArXiv, abs/1306.2597.

[5] Microsoft (2010). Microsoft Learning to Rank Datasets. https://www.microsoft.com/en-us/research/project/mslr/.

[6] Dato, D., Lucchese, C., Nardini, F.M., Orlando, S., Perego, R., Tonellotto, N., & Venturini, R. (2016). Fast Ranking with Additive Ensembles of Oblivious and Non-Oblivious Regression Trees. ACM Transactions on Information Systems (TOIS), 35, 1 - 31.

[7] Ai, Q., Bi, K., Luo, C., Guo, J., & Croft, W.B. (2018). Unbiased Learning to Rank with Unbiased Propensity Estimation. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval.

[8] Zou, L., Mao, H., Chu, X., Tang, J., Ye, W., Wang, S., & Yin, D. (2022). A Large Scale Search Dataset for Unbiased Learning to Rank. ArXiv, abs/2207.03051.

If you think pre-splitting and known evaluation datasets might lead to tuning on the test dataset, well, I won’t do anything to disavow that belief.

Did I mention the possibility of tuning on the test data? Newer papers have a clear number they need to beat, getting published might be contingent on beating that number, and getting published is quite an incentive for researchers…

I will omit the exclamation point from Yahoo’s name. Others do this too, but even if they didn’t, I’ll make my own damn choices on when and where to use punctuation in my own damn sentences.

There are some caveats here. It depends how we measure size (Number of queries? Total number of records? Number of features?) and how we define “in use”. There is a larger dataset that might not be “in use”, which I’ll get to later. Among the leading active datasets, it might be simpler to imagine them as tied when it comes to size.

The training file for set 1, specifically, of two sets.

This svm-light package was written by the prolific and aforementioned Thorsten Joachims. Remarkably, the URL from this 2010 paper to the svm-light page on his personal website still works.

Some papers are explicit about this. Others are not explicit, and I suspect some get by with simply randomizing the initial order.

Note that this is misnamed: this is traditional known by several names (including position-based propensity model and position-based model) with the acronym PBM. In the cascade model, users stop examining after the first click, thus never clicking more than once. The PBM can have multiple clicks on one list because each probability is independent.

Let’s not forget the most important step here: giving the algorithm a good name with a corresponding memorable acronym.

I am moderately amused that Microsoft’s data sharing strategy is just “dump all the data in whatever compression format you want into one big folder and we’ll put it on the internet”.

Is it a nice touch that my citation 3 is for LETOR 3.0 and citation 4 is for LETOR 4.0? I think so, if I say so myself.

They should attend the Thorsten Joachims school of link permanence.

Simplicity is SOTA

Discussion about this post