LongEval CLEF 2024 Lab
Longitudinal Evaluation of Model Performance
Data
Task 1. LongEval-Retrieval
The data for this task is a sequence of web document collections and queries provided by Qwant.
Description of the Data
Queries:
The queries are extracted from Qwant’s search logs, based on a set of selected topics. The query set was created in French and was automatically translated to English. To deal with the translation quality, 4 translations are provided for each query, sorted by their estimated translation probability.
Documents:
The document collection includes documents that are selected to be retrieved for each query. The first step for creating the document collection is to extract from the index the content of all the documents that have been displayed in SERPs for the queries that we selected. In addition to these documents, potentially non-relevant documents are randomly sampled from Qwant index in order to better represent the nature of a Web test collection. A random sampling process has been applied to alleviate bias and prevalence of relevant documents. Filters were applied to exclude spam and adult content.
Relevance estimates:
The relevance estimates for LongEval-Retrieval are obtained through automatic collection of user implicit feedback. This implicit feedback is obtained with a click model, based on Dynamic Bayesian Networks trained on Qwant data. The output of the click model represents an attractiveness probability, which is turned to a 3-level scale score (0 = not relevant, 1 = relevant, 2 = highly relevant). This set of relevance estimates will be completed with explicit relevance assessment after the submission deadline.
The overview of the data creation process is displayed in the Figure below:
Collections
Participants to LongEval 2024 can use the LongEval 2023 for training as well.
- - 2023 Training set: composed of documents collected in June 2022, queries, and qrels (on the Lindat/Clarin website).
- - 2023 Test set: composed of documents collected in July and September 2022, queries, and, separately, the qrels (both datasets on the Lindat/Clarin website).
- - 2024 Training set: composed of documents collected in January 2023, queries, and qrels (on the TU Wien Research Data Repository).
- - 2024 Test Set: composed of documents collected in June and August 2023, queries (on the TU Wien Research Data Repository).
- - 2024 DocID to url mapping: composed of list of document ids for each of the Training test and Test sets document id to url of the document source January 2023, June 2023, and August 2023.
If you experience any problems with loggin to the Lindat/Clarin website, please first check the instructions and contact the organizers.
References:
More details about the collection can be found in a paper:
P. Galuscakova, R. Deveaud, G. Gonzalez-Saez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel: LongEval-Retrieval: French-English Dynamic Test Collection for Continuous Web Search Evaluation.
Task 2. LongEval-Classification
In this edition of the task, we utilize and expand upon the Climate Change Twitter dataset , known as CC-SD, focusing on climate change stance, post time, and tweet content.
The CC-SD dataset spans a comprehensive 13-year period and encompasses over 15 million tweets from various years. Employing the BERT model,
tweets are annotated with three stance labels: believers, deniers, and neutral. The annotated tweets are distributed across the timeline, with 11,292,424 believers, 1,191,386 deniers, and 3,305,601 neutral tweets.
Annotation is conducted using transfer learning with BERT as distant supervision, based on another sentiment climate change dataset, allowing for manual annotation to enhance precision.
Task 2. LongEval-Classification Datasets
The train set , spanning from 2009 to 2011, provides a comprehensive dataset for model training, sampled from CC-SD. It is not human annotated.
Task 2. LongEval-Classification Datasets will be released into two phases.
Practice [Pre-Evaluation] Phase
In the Practice phase, participants undertake pre-Evaluation tasks with datasets from 2010 and 2014, sampled from CC-SD, allowing them to practice within a recent time frame and over a short duration.
These datasets are manually verified. Additionally, human-annotated within-time and short-term practice sets are provided, also sampled from CC-SD, to refine model development before formal evaluation.
Practice Datasets
Practice sets: within-practice and short-practice
Evaluation Phase
The Evaluation phase assesses models using datasets from 2011, 2015, and the longer period of 2018-2019, all sampled from CC-SD. These datasets undergo manual verification and encompass within-time predictions,
short-term predictions, and long-term predictions, offering a holistic evaluation of model performance across various temporal contexts.
By incorporating datasets covering different years, the evaluation ensures thorough testing and understanding of models' temporal persistence and performance.
Evaluation Datasets
Testing Evaluation sets (gold-labels)
Good Luck!