In this edition of the LongEval Lab, we look at the temporal persistence of the systems’ performance. In order to include the feature of temporal persistence as an additional quality for models proposed, participants are asked to suggest temporal IR systems (Task 1) and longitudinal text classifiers (Task 2) that generalize well beyond a train set generated within a limited time frame.
We consider two types of temporal persistence tasks: temporal information retrieval and longitudinal text classification. For each task, we look at a short-term and a long-term performance persistence. We aim to answer a high level question:
Given a longitudinal evolving benchmark for a typical NLP task, what types of models offer better temporal persistence over a short term and a long term?
The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.
The collection aims at answering fundamental questions on the robustness and the stability of Web search engines against the evolution of the data. Regarding Websearch evaluation, LongEval focuses on the following questions:
To assess an information retrieval system, we provide several datasets which are 3 snapshots of a changing Web documents and users’ queries:
Sub-task A, short-term persistence. In this task, participants will be asked to examine the retrieval effectiveness when the test documents are dated right after the documents available in the train collection.
Sub-task B, long-term persistence. In this task, participants will be asked to examine retrieval effectiveness on the documents published 3 months after the documents in the train collection were published.
The goal of Task 2 is to propose a temporal persistence classifier which can mitigate performance drop over short and long periods of time compared to a test set from the same time frame as training.
Given a test set from the same time frame as training and evaluation sets from short/long time periods, the task is to design a classifier that can mitigate short/long term performance drops.
The organizers will provide a training set collected over a time frame up to a time t and two test sets: test set from time t and test set from time t+i where i=1 for sub-task A and i>1 for subtask B.
Sub-task short-term persistence. In this task, participants will be asked to develop models that demonstrate performance persistence over short periods of time (within 1 year from the training data).
Sub-task long-term persistence. In this task, participants will be asked to develop models that demonstrate performance persistence over a longer period of time (over 1 year apart from the training data).
The submitted systems will be evaluated in two ways:
(1) nDCG scores calculated on a test set provided for the sub-tasks. Such a classical evaluation measure is consistent with Web search, for which the discount emphasises the ordering of the top results.
(2) Relative nDCG Drop (RnD) measured by computing the difference between nDCG on within a time heldout test data vs short- or long-term testing sets. This measure relies on the within a time data, and supports the evaluation of the impact of the data changes on the systems’ results.
These measures will be used to assess the extent to which systems provide good results, but also the extent to which they are robust against the changes within the data (queries/documents) along time. Using these evaluation measures, a system that has good results using nDCG, and also good results according to the RnD measure is considered to be able to cope with the evolution over time of the Information Retrieval collection.
The performance of the submissions will be evaluated in two ways:
(1) Macro-averaged F1-score on the testing set of the corresponding sub-task;
(2) Relative Performance Drop (RPD) measured by computing the difference between performance on "within-period" data vs short- or long-term distant testing sets.
The submissions will be ranked based on the first metric of Macro-averaged F1.