Data

Task 1. LongEval-Retrieval

The data for this task is a sequence of web document collections and queries provided by Qwant.

Description of the Data

Queries:
The queries are extracted from Qwant’s search logs, based on a set of selected topics. The query set was created in French and was automatically translated to English. To deal with the translation quality, 4 translations are provided for each query, sorted by their estimated translation probability.

Documents:
The document collection includes documents that are selected to be retrieved for each query. The first step for creating the document collection is to extract from the index the content of all the documents that have been displayed in SERPs for the queries that we selected. In addition to these documents, potentially non-relevant documents are randomly sampled from Qwant index in order to better represent the nature of a Web test collection. A random sampling process has been applied to alleviate bias and prevalence of relevant documents. Filters were applied to exclude spam and adult content.

Relevance estimates:
The relevance estimates for LongEval-Retrieval are obtained through automatic collection of user implicit feedback. This implicit feedback is obtained with a click model, based on Dynamic Bayesian Networks trained on Qwant data. The output of the click model represents an attractiveness probability, which is turned to a 3-level scale score (0 = not relevant, 1 = relevant, 2 = highly relevant). This set of relevance estimates will be completed with explicit relevance assessment after the submission deadline.

The overview of the data creation process is displayed in the Figure below:


Collections

Participants to LongEval 2024 can use the LongEval 2023 for training as well.

If you experience any problems with loggin to the Lindat/Clarin website, please first check the instructions and contact the organizers.


References:

More details about the collection can be found in a paper: P. Galuscakova, R. Deveaud, G. Gonzalez-Saez, P. Mulhem, L. Goeuriot, F. Piroi, M. Popel: LongEval-Retrieval: French-English Dynamic Test Collection for Continuous Web Search Evaluation.

Task 2. LongEval-Classification

In this edition of the task, we utilize and expand upon the Climate Change Twitter dataset , known as CC-SD, focusing on climate change stance, post time, and tweet content. The CC-SD dataset spans a comprehensive 13-year period and encompasses over 15 million tweets from various years. Employing the BERT model, tweets are annotated with three stance labels: believers, deniers, and neutral. The annotated tweets are distributed across the timeline, with 11,292,424 believers, 1,191,386 deniers, and 3,305,601 neutral tweets. Annotation is conducted using transfer learning with BERT as distant supervision, based on another sentiment climate change dataset, allowing for manual annotation to enhance precision.

Task 2. LongEval-Classification Datasets

The train set , spanning from 2009 to 2011, provides a comprehensive dataset for model training, sampled from CC-SD. It is not human annotated.

Task 2. LongEval-Classification Datasets will be released into two phases.

Practice [Pre-Evaluation] Phase

In the Practice phase, participants undertake pre-Evaluation tasks with datasets from 2010 and 2014, sampled from CC-SD, allowing them to practice within a recent time frame and over a short duration. These datasets are manually verified. Additionally, human-annotated within-time and short-term practice sets are provided, also sampled from CC-SD, to refine model development before formal evaluation.

Practice Datasets
Practice sets: within-practice and short-practice

Evaluation Phase

The Evaluation phase assesses models using datasets from 2011, 2015, and the longer period of 2018-2019, all sampled from CC-SD. These datasets undergo manual verification and encompass within-time predictions, short-term predictions, and long-term predictions, offering a holistic evaluation of model performance across various temporal contexts. By incorporating datasets covering different years, the evaluation ensures thorough testing and understanding of models' temporal persistence and performance.

Evaluation Datasets
Testing Evaluation sets (gold-labels)

Good Luck!