Tasks

LongEval 2026 continues the Sci-Retrieval task from 2025 and introduces three new tasks: LongEval-TopEx (Topic Extraction), LongEval-USim (User Simulation), and LongEval-RAG (Retrieval-Augmented Generation). Short descriptions, data sources and evaluation procedures for each task appear below. To receive announcements and participate in discussions, please subscribe to our Google Group or join the project Slack at longeval.slack.com.

Task 1. LongEval-Sci: Ad-Hoc Scientific Retrieval

Description

Build IR systems that keep retrieval effectiveness over time as the scientific collection evolves. Submit runs on multiple time-stamped snapshots.

Data

CORE documents and queries across multiple snapshots. Two training snapshots (e.g. 2024-11, 2025-01) and two test snapshots for short- and long-term persistence (~4M docs, ~1k queries).

Evaluation

nDCG per snapshot; robustness across snapshots measured by Relative Improvement (RI), Delta Relative Improvement (DRI) and Effect Ratio (ER).

Task 2. LongEval-TopEx: Topic Extraction From Query Logs

Description

Extract TREC-style topics (query + description + narrative) from query logs to formalize the information need aligned with observed usage.

Data

Training snapshot at time t; build a top-10 pool from Task-1 runs overlapping training queries and annotate with multiple LLM-based relevance assessors to obtain alternative qrels per topic.

Evaluation

Assess topics on Alignment (agreement with click-log qrels), Distinguishability (ability to separate runs via nDCG) and Clarity (consistency across LLM annotations). Measure short- and long-term drops. Topics feed Task-3.

Task 3. LongEval-USim: User Simulation

Description

Model user behaviors and query reformulations from CORE search sessions to predict the next query in a sequence.

Data & setup

Pre-filtered session log excerpts (search_id, uid, queries, SERPs, clicks). A pre-configured SimIIR v3 environment is provided for simulations; Task-2 topics may be used as context.

Task

Produce simulated queries or user-model configurations that replicate observed session behavior; train on historic logs and evaluate on held-out sessions.

Evaluation

Compare simulated vs real queries/interactions: indistinguishability and performance-prediction. Use reproducibility measures and track short-/long-term simulator quality.

Task 4. LongEval-RAG: Retrieval Augmented Generation (RAG)

Description & data

The Task 4 aims at studying to which extent does RAGs cope with evolution of scientific knowledge in time. Do to that, we provide the task participants a dataset composed of two parts:
i) A textual query (along with its query id) for which the participating system has to provide textual answer;
ii) A set of document ids, mined from the corpus, on which the answer generation for particular query must be solely  based (this set may be seen as a first stage filtering in a RAG system). This set of document is not expected to be composed only of relevant documents to the query, but it also contains some relevant documents. The provided documents will be a part of the dataset provided within the Task 1.

The problem that has to be solved by the participants of this task is to provide (based on the query and the list of provided documents):
a) The generated answer using the extracts of documents in part b) below.
b) the document extracts that are relevant to the query (text extracted from the textual content of the documents provided in ii), using the contents from the LongEval 2026 corpus).

Submission

For the submission of participants runs:
The responses for the part a) will be submitted as a string corresponding to the generated answer: (query_id, answer)
The responses for the part b) will be submitted as a set of pairs (maximum 5):
(query_id, {document extract}) : where a document extract is a character string considered to be relevant by the participant for the query. As several extracts may be defined, a run will contain a set of extracts.
The examples of an exact submission format and 3 training queries (based on the 2025 dataset) will be provided by the end of February.

Overall, each submitted run is a zip file composed of three text files, one for the a) for all queries, one for b) for all queries, and one file "description.txt" that shortly describes the submitted system. The name of the zip file will be used as the run name during the evaluation process.
A maximum of 3 runs may be submitted by each participating team.

Example of task 4 query (based on the 2025 dataset):
- Query_id: 00005
- Query: "What operation dominates the cost of homomorphic AES evaluation, and how can hardware architectures reduce its impact?"
- Docs_id: {7964767, 66637364, 42856664, 31316619, 19943691, 19531678,156069243, 156069094, 150704088, 141753453}

The result for this query:
- Part a: (query_id: 00005, answer: "Large-degree polynomial multiplication dominates the cost of homomorphic AES evaluation. Hardware architectures reduce its impact by decomposing coefficients using CRT, performing fast NTT-based multiplications in parallel, and minimizing modular conversion overhead through optimized arithmetic pipelines.")
- Part b: (query_id: 00005, extracts: {"Polynomial multiplication dominates AES evaluation cost."", "Hardware mitigation via CRT + NTT + optimized modular arithmetic."})

Evaluation

The evaluation of a) will be done using classical text generation evaluation measure, such as ROUGE or BERTscore. The evaluation of b) will be done using the computation of the overlap (e.g. Lenvenshtein distance) between the manually assessed relevant extracts and the retrieved parts.