Tasks

In this third iteration the LongEval Lab at CLEF continues to explore the temporal dynamics in IR. This includes the potential and limitations of temporal relevance signals for ranking, temporal robustness of systems, and novel evaluation methods factoring time. By that, the lab sensitizes for the temporal uncertain validity of conventional evaluations in IR. Considering the temporal dimension provides a new perspective on search and ultimately leads to a more holistic view on the retrieval problem.

This year, the lab provides a unique test bed comprising two evolving test collections. They cover the established retrieval scenarios of Web search and scientific retrieval, which have different goals and distinct dynamics. Participants are invited to submit retrieval runs to two tasks that cope with these dynamics.

Task 1. LongEval-WebRetrieval:

Objectives:

The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. In LongEval, we propose to use evolving Web data to evaluate IR systems longitudinally: the systems are expected to be persistent in their retrieval effectiveness over time. The systems are evaluated on monthly several snapshots of documents and queries (lags), derived from real data acquired from a French Web search engine, Qwant.

The collection aims at answering fundamental questions on the robustness and the stability of Web search engines against the evolution of the data. Regarding Websearch evaluation, LongEval focuses on the following questions:

In comparison to LongEval 2023 and 2024, in this iteration, we enlarge the training and test collections with additional snapshots that will allow fine-grained analysis of changes in the data collection from one snapshot to another.

To assess an information retrieval system, we provide several datasets of a changing Web documents and users’ queries:

In total, we will release 15 datasets considering documents and queries adressed at the specific month snapshot.

Task 2. LongEval-SciRetrieval:

The second task of the LongEval 2025 Lab is similar to the first task, and aims to examine how IR systems’ effectiveness changes over time, when the underlying document collection changes, where the documents are scientific publications. The documents that will make the dataset for this task are acquired from the CORE collection of scholarly documents. To our knowledge, CORE is currently the largest aggregated collection of Open Access full text scholarly documents. CORE provides a range of services built on top of this content and these services are currently used by over 30 million unique users each month. CORE Search provides a web UI for users to query the entire database of scholarly documents. This service registers over one million searches each month.

Similarly to Task 1, we will use the click information to drive the relevance assessments for the test collections which consists of two main components that contain both the search and click information :

Since this is the first time this task is organized, the number of dataset lags is lower than those used in the first task. We aim to release two training datasets and one or two test datasets.

Evaluation

The submitted systems will be evaluated in two ways:

(1) nDCG scores calculated on provided test sets. Such a classical evaluation measure is consistent with Web search, for which the discount emphasises the ordering of the top results.

(2) Relative nDCG Drop (RnD) measured by computing the difference between Lag5 and Lag7 test sets. This measure supports the evaluation of the impact of the data changes on the systems’ results.

These measures will be used to assess the extent to which systems provide good results, but also the extent to which they are robust against the changes within the data (queries/documents) along time. Using these evaluation measures, a system that has good results using nDCG, and also good results according to the RnD measure is considered to be able to cope with the evolution over time of the Information Retrieval collection.