LongEval 2026 continues the Sci-Retrieval task from 2025 and introduces three new tasks: LongEval-TopEx (Topic Extraction), LongEval-USim (User Simulation), and LongEval-RAG (Retrieval-Augmented Generation). Short descriptions, data sources and evaluation procedures for each task appear below. To receive announcements and participate in discussions, please subscribe to our Google Group or join the project Slack at longeval.slack.com.
This task aims at answer fundamental questions on the robustness and the stability of retrieval systems against the evolution of the search scenario. The test collections for this year contain scientific publications acquired from the CORE collection of open access scholarly documents. LongEval focuses on two core questions:
In comparison to the previous year, we enlarged the snapshots to three months to account for the unique dynamics of scientific search. We provide three snapshots of a growing document corpus and ask participants to submit ad-hoc ranking systems, potentially with a focus on scientific search and temporal changes.
We provide a dynamic test collection consisting of three snapshots, a train and test query set, and two pseudo qrel sets per snapshot. Additionally, after submission, LLM judgments will be created for pools from the submitted runs.
Train:To access the queries, please use the ir-datasets-longeval integration that is available on pypi (we hope that this improves the ease re-use of your code in longitudinal retrieval settings. You can download the dataset directly from the TU Wien Data Repository.
Submissions will be handled via TIRA.io. Please visit the TIRA task page and register. We highly encourage software submissions as it facilitates repeated executions over time. We provide submission templates for your convenience. However, TREC-style run file submissions are also welcome. For further information, please have a look at the GitHub Submission Guide.
For Task-1, preliminary relevance judgments could help for some approaches (e.g., to boost documents that were relevant in the past). We want to create LLM relevance judgments that only use the query for their judgments, if you can make use of those relevance judgments in your approach, please reach out in the Discord Chat of CLEF or the LongEval Slack.
Offline evaluations of retrieval systems in the Cranfield Paradigm often formalize information needs into TREC-style topics that consist of a query (what searchers submit to the search engine), a description (what searchers actually mean), and a narrative (defining which documents are relevant or not). When queries are extracted from query logs, none of those formalization exists which can reduce the reliability of an derived offline evaluation collection (because the subjectivity of relevance judgments is increased). Task 2 of LongEval aims to extract TREC-style topics (query + description + narrative) from query logs to formalize the information need so that subsequent offline retrieval evalutaion corpora get more reliable relevance judgments.
We have created two baselines (one with an LLM, one that injects the query into a predefined template) that you can use as starting point in the longeval-code repository.
We use the test queries for Task-1. I.e., approaches should generate a topic for each test query of Task-1. For accessing the queries, please use the LongEval ir_datasets extension that is available on pypi (we hope that this improves the simplified re-use of your code in longitudinal retrieval settings.)
We require that topics are generated in .jsonl format (please have a look at the baselines for more details). Every line should have the fields qid, query, description, and narrative with:{"qid": "1", "query": "ransomware detection", "description": "I want to know which algorithms for ransomware detection are effective.", "narrative": "Papers that describe algorithms for ransomware detection are relevant when they also have an evaluation. Evaluation papers that just compare multiple ransomware detection algorithms are also relevant. Other aspects, such as describing which ransomware attacks exist, the history of ransomware detection, etc., are not relevant."}
For Task-2, preliminary relevance judgments could help for some approaches (e.g., when you want to inject one positive and one negative document into your prompt to generate a topic). We want to create LLM relevance judgments that only use the query for their judgments, if you can make use of those relevance judgments in your approach, please reach out in the Discord Chat of CLEF or the LongEval Slack.
Submissions will be handled via TIRA.io. Please visit the TIRA task page and register. We accept both uploads of generated topics as well as code submissions (that can access OpenAI compatible LLMs and/or GPUs in TIRA, we have the capacity to help with code submissions, please reach out in the Discord Chat if you need help). Please see the baselines as starting point for your submission.
We will create relevance judgments for the generated TREC-style topics with large language model relevance assessors and will verify how well the created relevance judgments are aligned over time with the future query logs (agreement with click-log qrels), how well they can distinguish retrieval systems submitted to Task-1 (ability to separate runs via nDCG) and clarity (how consist are the annotations across different large language models). We will measure the short- and long-term drops of those aspects. We also envision to feed the created topics to Task-3.
A central goal of the LongEval lab is to measure the performance of retrieval models over time. Other models, like user models, were not in focus. Like retrieval models, user models are not meant to be static and time-agnostic, although in practice, time and the evolution of retrieval environments are often ignored. Therefore, in Task 3, we aim to evaluate user models in longitudinal user simulations.
The core assignment in Subtask 3 is next query prediction. This task was previously introduced in the Sim4IA micro shared task. For LongEval, we focus on the longitudinal aspects of this user simulation task. The objective is to accurately predict the final query of a given session (which the organizers will withhold), based on the preceding interaction history and the best-fitting user simulator.
For each provided session, participants must submit up to 5 candidate queries, ranked by their likelihood of being the next query.
Participants will receive the sessions containing the following rich set of features:
The data will be provided in the following form:
Index, User Name, Session Number, Query String, Timestamp of Query Submission,
SERP, SearchID, "[('DocId', Time of Click, Type of Click)]"
Session File Example:
1,test_user,44,effects of media literacy on learning experience ,2025-02-08 15:54:49,"[213271, 8868621, 36996622, 77231420, 62528977, 44759354, 267373, 128496322, 256706, 256108]",4dd111537a7a8d21a1ca912fe1d611ad,"[('44759354', datetime.datetime(2025, 2, 8, 15, 55, 11), 'works')]"
2,test_user,44,media literacy skills,2025-02-08 15:55:11,"[156462828, 4762625, 84628214, 153245377, 155446828, 141526785, 43469067, 31614868, 62323511, 126958714]",73071860414ec67d0fc5f01d94884421,"[('62323511', datetime.datetime(2025, 2, 8, 15, 55, 33), 'works')]"
3,test_user,44, ...
For each implemented persona, submit a run file containing:
For each session, predict 5 diverse query candidates that remain semantically similar to the original query. Rank them in descending order of confidence.
Run file example:
{
"meta": {
"team_name": "",
"description": "",
"run_name": ""
},
"1": [ "Q1", "Q2", "Q3", "Q4", "Q5" ],
"2": [ "Q1", "Q2", "Q3", "Q4", "Q5" ],
"...": [ "Q1", "Q2", "Q3", "Q4", "Q5" ],
"45": [ "Q1", "Q2", "Q3", "Q4", "Q5" ]
}
We will evaluate the query candidates based on semantic similarity to the original, withheld query, and the redundancy among the candidates themselves. We will use the Rank-Diversity Score (RDS) from the Sim4IA-Bench suite.
We will measure differences in RDS across snapshots. At the end, we would like to see which user simulation or user modeling approach was most successful across the available snapshots.
To ensure an easy start, we offer an adapted version of the SimIIR 3.0 Framework, including sample simulators and video tutorials. However, the choice of framework remains entirely up to the participants.
For constructing the user simulator that performs the next-query prediction, we encourage participants not only to effectively leverage all provided signals but also to integrate supplementary or self-engineered features to enhance their simulator's predictive power. Feel free to try out strategies like personas or specific user attributes (e.g., degree, research field, or preferred interaction style). Different initial ideas can be drawn from the results of the Sim4IA Micro Shared Task 2025.
We will not provide topic descriptions for the sessions or direct relevance judgments, although inputs and overlaps from previous tasks are especially welcome. Participants engaged in Subtask 2 ("Topic Generation") are strongly encouraged to leverage their approach. This would allow them to build a comprehensive topic representation not only from a single query but also from multiple queries within a session, and to use this insight to enhance their query prediction simulator.
The Task 4 aims at studying to which extent does RAGs cope with evolution of scientific knowledge in
time. Do to that, we provide the task participants a dataset composed
of two parts:
i) A textual query (along with its query id) for which the participating system has to provide
textual answer;
ii) A set of document ids, mined from the corpus, on which the answer generation for particular
query must be solely based (this set may be seen as a first stage filtering in a RAG system). This
set of document is not expected to be composed only of relevant documents to the query, but it also
contains some relevant documents. The provided documents will be a part of the dataset provided
within the Task 1.
The problem that has to be solved by the participants of this task is to provide (based on the query
and the list of provided documents):
a) The generated answer using the extracts of documents in part b) below.
b) the document extracts that are relevant to the query (text extracted from the textual content of
the documents provided in ii), using the contents from the LongEval 2026 corpus).
For the submission of participants runs:
The responses for the part a) will be submitted as a string corresponding to the generated answer:
(query_id, answer)
The responses for the part b) will be submitted as a set of pairs (maximum 5):
(query_id, {document extract}) : where a document extract is a character string considered to be
relevant by the participant for the query. As several extracts may be defined, a run will contain a
set of extracts.
The examples of an exact submission format and 3 training queries (based on the 2025 dataset) will
be provided by the end of February.
Overall, each submitted run is a zip file composed of three text files, one for the a) for all
queries, one for b) for all queries, and one file "description.txt" that shortly describes the
submitted system. The name of the zip file will be used as the run name during the evaluation
process.
A maximum of 3 runs may be submitted by each participating team.
Example of task 4 query (based on the 2025 dataset):
- Query_id: 00005
- Query: "What operation dominates the cost of homomorphic AES evaluation, and how can hardware
architectures reduce its impact?"
- Docs_id: {7964767, 66637364, 42856664, 31316619, 19943691, 19531678,156069243, 156069094,
150704088, 141753453}
The result for this query:
- Part a: (query_id: 00005, answer: "Large-degree polynomial multiplication dominates the cost of
homomorphic AES evaluation. Hardware architectures reduce its impact by decomposing coefficients
using CRT, performing fast NTT-based multiplications in parallel, and minimizing modular conversion
overhead through optimized arithmetic pipelines.")
- Part b: (query_id: 00005, extracts: {"Polynomial multiplication dominates AES evaluation cost."",
"Hardware mitigation via CRT + NTT + optimized modular arithmetic."})
The evaluation of a) will be done using classical text generation evaluation measure, such as ROUGE or BERTscore. The evaluation of b) will be done using the computation of the overlap (e.g. Lenvenshtein distance) between the manually assessed relevant extracts and the retrieved parts.