LongEval 2025

Tasks

This year, the lab provides two evolving test collections, one for the web retrieval IR evaluation and one for scientific article retrieval evaluation.

Task 1. LongEval-Web Retrieval:

Objectives:

The collection aims at answering fundamental questions on the robustness and the stability of Web search engines against the evolution of the data. Regarding Websearch evaluation, LongEval focuses on the following questions:

(1) How does a search engine behave as the collection of documents available to it evolves? Such a question is especially important for commercial systems, as the satisfaction of users is central to such systems.
(2) When do we need to update an IR system as the collection of documents to be searched in changes? If we are able to assess the decrease in performance (if any) of a system on an evolving collection, we may then decide if the system needs to be updated.

In comparison to LongEval 2023 and 2024, in this iteration, we enlarge the training and test collections with additional snapshots that will allow fine-grained analysis of changes in the data collection from one snapshot to another.

To assess an information retrieval system, we provide several datasets of a changing Web documents and users’ queries:

- Training set: 9 monthly data snapshots, acquired from June 2022 to February 2023 composed of documents (18M), queries (9k) , and qrels.
- Test set:, 6 monthly data snapshots, acquired from March 2023 to August 2023, composed of documents and queries.

In total, we will release 15 datasets considering documents and queries adressed at the specific month snapshot.

Task 2. LongEval-Sci Retrieval:

The second task of the LongEval 2025 Lab is similar to the first task, with the difference that the test collections contain scientific publications acquired from the CORE collection of open access scholarly documents.

Similarly to Task 1, we will use the click information to drive the relevance assessments for the test collections which consists of two main components that contain both the search and click information :

- Search Information includes i) unique (anonymous) identifiers for individual user session; ii) search query; iii) returned results.
- Click Information records, for each click, i) a unique (anonymous) identifier for individual user session; ii) the link that was clicked in the results list; iii) the position of clicked link in results list.

Since this is the first time this task is organized, the number of dataset snapshots is lower than those in the first task.

Evaluation

The submitted systems will be evaluated in two ways:

(1) nDCG scores calculated on provided test sets. Such a classical evaluation measure is consistent with Web search, for which the discount emphasises the ordering of the top results.

(2) Relative nDCG Drop (RnD) measured by computing the difference between snapshots test sets. This measure supports the evaluation of the impact of the data changes on the systems’ results.