LongEval 2026

LongEval 2026 continues the Sci-Retrieval task from 2025 and introduces three new tasks: LongEval-TopEx (Topic Extraction), LongEval-USim (User Simulation), and LongEval-RAG (Retrieval-Augmented Generation). Short descriptions, data sources and evaluation procedures for each task appear below. To receive announcements and participate in discussions, please subscribe to our Google Group or join the project Slack at longeval.slack.com.

Task 1. LongEval-Sci: Ad-Hoc Scientific Retrieval

Description

This task aims at answer fundamental questions on the robustness and the stability of retrieval systems against the evolution of the search scenario. The test collections for this year contain scientific publications acquired from the CORE collection of open access scholarly documents. LongEval focuses on two core questions:

(1) How does a search engine behave as the collection of documents available to it evolves? Such a question is especially important for commercial systems, as the satisfaction of users is central to such systems.
(2) When do we need to update an IR system as the collection of documents to be searched in changes? If we are able to assess the decrease in performance (if any) of a system on an evolving collection, we may then decide if the system needs to be updated

In comparison to the previous year, we enlarged the snapshots to three months to account for the unique dynamics of scientific search. We provide three snapshots of a growing document corpus and ask participants to submit ad-hoc ranking systems, potentially with a focus on scientific search and temporal changes.

Baselines

We have created two baselines (BM25 and Qwen3-Embedding-4B) that you can use as starting point in the longeval-code repository.

Data

We provide a dynamic test collection consisting of three snapshots, a train and test query set, and two pseudo qrel sets per snapshot. Additionally, after submission, LLM judgments will be created for pools from the submitted runs.

Train:

- 100 Train queries
- Snapshot March to June 2025
- Pseudo Qrles (document click through rate and raw clicks)

Test:

- 338 New Test queries
- Three Snapshots March to June 2025, July to August 2025, and September to November 2025
- Pseudo Qrles per snapshot (document click through rate and raw clicks)
- LLM judgments based on submission pools

To access the queries, please use the ir-datasets-longeval integration that is available on pypi (we hope that this improves the ease re-use of your code in longitudinal retrieval settings. You can download the dataset directly from the TU Wien Data Repository.

We provide citation data for the scientific documents. We share a filtered OpenCitations Index (password: longeval) file that contains citation triplets, the timestamp, and additionally a mapping from the OMID to the CORE document ID. When using the citation data, please ensure that the citations match the snapshot.

Submission

Submissions will be handled via TIRA.io. Please visit the TIRA task page and register. We highly encourage software submissions as it facilitates repeated executions over time. We provide submission templates for your convenience. However, TREC-style run file submissions are also welcome. For further information, please have a look at the GitHub Submission Guide.

Access to Large Language Model Relevance Judgments

For Task-1, preliminary relevance judgments could help for some approaches (e.g., to boost documents that were relevant in the past). We want to create LLM relevance judgments that only use the query for their judgments, if you can make use of those relevance judgments in your approach, please reach out in the Discord Chat of CLEF or the LongEval Slack.

Task 2. LongEval-TopEx: Topic Extraction From Query Logs

Description

Offline evaluations of retrieval systems in the Cranfield Paradigm often formalize information needs into TREC-style topics that consist of a query (what searchers submit to the search engine), a description (what searchers actually mean), and a narrative (defining which documents are relevant or not). When queries are extracted from query logs, none of those formalization exists which can reduce the reliability of an derived offline evaluation collection (because the subjectivity of relevance judgments is increased). Task 2 of LongEval aims to extract TREC-style topics (query + description + narrative) from query logs to formalize the information need so that subsequent offline retrieval evalutaion corpora get more reliable relevance judgments.

Baselines

We have created two baselines (one with an LLM, one that injects the query into a predefined template) that you can use as starting point in the longeval-code repository.

Data

We use the test queries for Task-1. I.e., approaches should generate a topic for each test query of Task-1. For accessing the queries, please use the LongEval ir_datasets extension that is available on pypi (we hope that this improves the simplified re-use of your code in longitudinal retrieval settings.)

We require that topics are generated in .jsonl format (please have a look at the baselines for more details). Every line should have the fields qid, query, description, and narrative with:

qid: The ID of the query.
query: The original query.
description: The generated description for the information need, i.e., what searchers had in their mind.
narrative: The generated narrative for the information need, i.e., defining which documents are relevant or not.

For instance, for the query "ransomware detection" with id "1" from the spot-check dataset, a valid output could look like:

{"qid": "1", "query": "ransomware detection", "description": "I want to know which algorithms for ransomware detection are effective.", "narrative": "Papers that describe algorithms for ransomware detection are relevant when they also have an evaluation. Evaluation papers that just compare multiple ransomware detection algorithms are also relevant. Other aspects, such as describing which ransomware attacks exist, the history of ransomware detection, etc., are not relevant."}

Access to Large Language Model Relevance Judgments

For Task-2, preliminary relevance judgments could help for some approaches (e.g., when you want to inject one positive and one negative document into your prompt to generate a topic). We want to create LLM relevance judgments that only use the query for their judgments, if you can make use of those relevance judgments in your approach, please reach out in the Discord Chat of CLEF or the LongEval Slack.

Submissions

Submissions will be handled via TIRA.io. Please visit the TIRA task page and register. We accept both uploads of generated topics as well as code submissions (that can access OpenAI compatible LLMs and/or GPUs in TIRA, we have the capacity to help with code submissions, please reach out in the Discord Chat if you need help). Please see the baselines as starting point for your submission.

Evaluation

We will create relevance judgments for the generated TREC-style topics with large language model relevance assessors and will verify how well the created relevance judgments are aligned over time with the future query logs (agreement with click-log qrels), how well they can distinguish retrieval systems submitted to Task-1 (ability to separate runs via nDCG) and clarity (how consist are the annotations across different large language models). We will measure the short- and long-term drops of those aspects. We also envision to feed the created topics to Task-3.

Task 3. LongEval-USim: User Simulation

Task Description

A central goal of the LongEval lab is to measure the performance of retrieval models over time. Other models, like user models, were not in focus. Like retrieval models, user models are not meant to be static and time-agnostic, although in practice, time and the evolution of retrieval environments are often ignored. Therefore, in Task 3, we aim to evaluate user models in longitudinal user simulations.

The core assignment in Subtask 3 is next query prediction. This task was previously introduced in the Sim4IA micro shared task. For LongEval, we focus on the longitudinal aspects of this user simulation task. The objective is to accurately predict the final query of a given session (which the organizers will withhold), based on the preceding interaction history and the best-fitting user simulator.

For each provided session, participants must submit up to 5 candidate queries, ranked by their likelihood of being the next query.

Data & setup

Participants will receive the sessions containing the following rich set of features:

The sequence of queries submitted.
Timestamps for query submissions.
The top-10 SERP (Search Engine Results Page) retrieved for each query.
Which documents were clicked.

The data will be provided in the following form:

Index, User Name, Session Number, Query String, Timestamp of Query Submission, 
SERP, SearchID, "[('DocId', Type of Click)]"

Session File Example:

1,test_user,44,effects of media literacy on learning experience ,2025-02-08 15:54:49,"[213271, 8868621, 36996622, 77231420, 62528977, 44759354, 267373, 128496322, 256706, 256108]",4dd111537a7a8d21a1ca912fe1d611ad,"[('44759354', 'works')]" 
2,test_user,44,media literacy skills,2025-02-08 15:55:11,"[156462828, 4762625, 84628214, 153245377, 155446828, 141526785, 43469067, 31614868, 62323511, 126958714]",73071860414ec67d0fc5f01d94884421,"[('62323511', 'works')]" 
3,test_user,44, ...

Submission and Submission Format

Submissions will be handled via TIRA.io. Please visit the TIRA task page and register. For Subtask 3, we only accept run submissions. However, we encourage you to include a link to your GitHub repository in your run file under "Description". For the submission, please provide the results of your approach for both test datasets as well as for the training dataset. Note that for the training dataset, the last query in each session should not be used for prediction, as it contains the ground truth. In the UI of TIRA, please upload your submission as a zip file containing three files snapshot-1.json, snapshot-2.json, and snapshot-3.json. The naming convention is as follows: snapshot-1.json must contain the training results, while snapshot-2.json and snapshot-3.json must contain the results for the two test datasets. Specifically, snapshot-2.json must correspond to task3_longeval_usim-sessions-06-08_2025.csv, and snapshot-3.json must correspond to task3_longeval_usim-sessions-09-11_2025.csv.

In the following, we describe the expected submission file format.

team_name: Can be freely chosen
description: Provide a brief summary of the underlying approach. This can be a link to a repository, a prompt definition, or anything that helps explain your approach.
run_name: Should be meaningful and align with the naming used in your lab notes, though it can still be chosen individually.

For each session, predict 5 diverse query candidates that remain semantically similar to the original query. Rank them in descending order of confidence.

Run file example:

{
  "meta": {
    "team_name": "",
    "description": "",
    "run_name": ""
  },
  "1": [ "Q1", "Q2", "Q3", "Q4", "Q5" ],
  "2": [ "Q1", "Q2", "Q3", "Q4", "Q5" ],
  "...": [ "Q1", "Q2", "Q3", "Q4", "Q5" ],
  "45": [ "Q1", "Q2", "Q3", "Q4", "Q5" ]
}

Evaluation

We will evaluate the query candidates based on semantic similarity to the original, withheld query, and the redundancy among the candidates themselves. We will use the Rank-Diversity Score (RDS) from the Sim4IA-Bench suite.

We will measure differences in RDS across snapshots. At the end, we would like to see which user simulation or user modeling approach was most successful across the available snapshots.

Starting Points, Framework, and Synergies

To ensure an easy start, we offer an adapted version of the SimIIR 3.0 Framework, including sample simulators and video tutorials. However, the choice of framework remains entirely up to the participants.

For constructing the user simulator that performs the next-query prediction, we encourage participants not only to effectively leverage all provided signals but also to integrate supplementary or self-engineered features to enhance their simulator's predictive power. Feel free to try out strategies like personas or specific user attributes (e.g., degree, research field, or preferred interaction style). Different initial ideas can be drawn from the results of the Sim4IA Micro Shared Task 2025.

We will not provide topic descriptions for the sessions or direct relevance judgments, although inputs and overlaps from previous tasks are especially welcome. Participants engaged in Subtask 2 ("Topic Generation") are strongly encouraged to leverage their approach. This would allow them to build a comprehensive topic representation not only from a single query but also from multiple queries within a session, and to use this insight to enhance their query prediction simulator.

Task 4. LongEval-RAG: Retrieval Augmented Generation (RAG)

Description & data

The Task 4 aims at studying to which extent does RAGs cope with evolution of scientific knowledge in time. Do to that, we provide the task participants a dataset composed of two parts:
i) A textual query (along with its query id) for which the participating system provide a textual answer;
ii) A set of document ids, mined from the corpus, on which the answer generation for particular query must be solely based (this set may be seen as a first stage filtering in a RAG system). This set of document is not expected to be composed only of relevant documents to the query, but it also contains some relevant documents. The provided documents will be a part of the dataset provided within the Task 1.

Participants should generate a generated answer in the [TREC RAG run format](https://github.com/hltcoe/rag-run-validator) that contains links to the referenced documents that were used to generate the answer (out of the documents provided in ii).

To acquire the queries, please use the ir-datasets-longeval integration that is available on pypi. You can download the dataset and also the queries from the TU Wien Data Repository.

Though no training data are available in this task, we provide an example of 4 queries using 2025 collection:

Baselines

We have created two baselines (one with an LLM, one that just concatenates the responses) that you can use as starting point in the longeval-code repository.

Submission

Submissions will be handled via TIRA.io. Please visit the TIRA task page and register. We accept both uploads of generated topics as well as code submissions. In the UI of TIRA, please upload your submission as a zip file containing two files responses.jsonl and ir-metadata.yml. The runs must be submitted in the TREC RAG run format in jsonl. Every line should have the fields metadata, references, and answer with:

metadata: having the team_id, the run_id, the type, narrative (which is the query) and the narrative_id (i.e. the unique query id).
references: is the list of documents that you cite in your response. These are the 10 candidate documents provided to generate the answer.
answer: which provides the output of your system to the query. It consists of the text (the generated answer to the question/narrative, based on the referenced documents) and citations (Indices pointing to which documents in the references list were actually used to support the answer. Indicies start from 0.).

For instance, for the query "What operation dominates the cost of homomorphic AES evaluation, and how can hardware architectures reduce its impact?" with id "4" from the 2025 dataset, on output line could look as follows:

{
    "metadata": {
        "team_id": "organizer",
        "run_id": "fs_rank-qwen3-32b_ag-qwen3-32b",
        "type": "automatic",
        "narrative_id": "4",
        "narrative": "What operation dominates the cost of homomorphic AES evaluation, and how can hardware architectures reduce its impact?"
    },
    "references": [
        7964767, 
        66637364, 
        42856664, 
        31316619, 
        19943691, 
        19531678,
        156069243, 
        156069094, 
        150704088, 
        141753453
    ],
    "answer": [
        {
            "text": "Large-degree polynomial multiplication dominates the cost of homomorphic AES evaluation. Hardware architectures reduce its impact by decomposing coefficients using CRT, performing fast NTT-based multiplications in parallel, and minimizing modular conversion overhead through optimized arithmetic pipelines",
            "citations": [
                3,
                5
            ]
        }
    ]
}

The participants need to also provide a description of each submitted system. This needs to be provided as a yml file (for instance, see the submission skeleton for a complete example).

Evaluation

The evaluation of a) will be done using classical text generation evaluation measure, such as ROUGE or BERTscore. Apart from this, we plan to run LLM-as-a-judge evaluation. The evaluation of b) will be done using the computation of the overlap (e.g. done using set overlap precision and recall between the files used to generate the answer and the ground truth files.

Tasks

Task 1. LongEval-Sci: Ad-Hoc Scientific Retrieval

Description

Baselines

Data

Submission

Access to Large Language Model Relevance Judgments

Task 2. LongEval-TopEx: Topic Extraction From Query Logs

Description

Baselines

Data

Access to Large Language Model Relevance Judgments

Submissions

Evaluation

Task 3. LongEval-USim: User Simulation

Task Description

Data & setup

Submission and Submission Format

Evaluation

Starting Points, Framework, and Synergies

Task 4. LongEval-RAG: Retrieval Augmented Generation (RAG)

Description & data

Baselines

Submission

Evaluation