Tasks

In this edition of the LongEval Lab, we look at the temporal persistence of the systems’ performance. In order to include the feature of temporal persistence as an additional quality for models proposed, participants are asked to suggest temporal IR systems (Task 1) and longitudinal text classifiers (Task 2) that generalize well beyond a train set generated within a limited time frame.

We consider two types of temporal persistence tasks: temporal information retrieval and longitudinal text classification. For each task, we look at a short-term and a long-term performance persistence. We aim to answer a high level question:

Given a longitudinal evolving benchmark for a typical NLP task, what types of models offer better temporal persistence over a short term and a long term?

Task 1. LongEval-Retrieval:

Objectives:

The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.

The collection aims at answering fundamental questions on the robustness and the stability of Web search engines against the evolution of the data. Regarding Websearch evaluation, LongEval focuses on the following questions:

To assess an information retrieval system, we provide several datasets which are 3 snapshots of a changing Web documents and users’ queries:

Task 2. LongEval-Classification:

The goal of LongEval-Classification Subtask B of CLEF 2024 Task 2 is to propose a temporal persistence classifier which can mitigate performance drop over short and long periods of time compared to a test set from the same time frame as training.

The organizers will provide a training set collected over a time frame up to a time t and two test sets: test set from time t and test set from time t+i where i=1 for sub-task A and i>1 for subtask B.

Sub-task short-term persistence. Short-term persistence. In this sub-task participants will develop models which demonstrate performance persistence over short periods of time, i.e. using test set within 2-3 years apart from the training data.

Sub-task long-term persistence. Long-term persistence. In this sub-task participants will develop models which demonstrate performance persistence over longer period of time, i.e. test set within 4-5 years apart from the training data and also distant from the short-term test set.

Participants are expected to design an experimental architecture to enhance a text classifier's temporal performance persistence. Participants are asked to evaluate their models in this environment without adjusting them for target years' timing. With a focus on hyperparameter tuning and objective function optimization techniques, this will enable the model to be ranked.

Evaluation

Task 1. LongEval-Retrieval:

The submitted systems will be evaluated in two ways:

(1) nDCG scores calculated on provided test sets. Such a classical evaluation measure is consistent with Web search, for which the discount emphasises the ordering of the top results.

(2) Relative nDCG Drop (RnD) measured by computing the difference between Lag5 and Lag7 test sets. This measure supports the evaluation of the impact of the data changes on the systems’ results.

These measures will be used to assess the extent to which systems provide good results, but also the extent to which they are robust against the changes within the data (queries/documents) along time. Using these evaluation measures, a system that has good results using nDCG, and also good results according to the RnD measure is considered to be able to cope with the evolution over time of the Information Retrieval collection.


Task 2. LongEval-Classification:

The performance of the submissions will be evaluated in two ways:

(1) Macro-averaged F1-score on the testing set of the corresponding sub-task;

(2) Relative Performance Drop (RPD) measured by computing the difference between performance on "within-period" data vs short- or long-term distant testing sets.

The submissions will be ranked based on the first metric of Macro-averaged F1.