Tasks

LongEval 2026 continues the Sci-Retrieval task from 2025 and introduces three new tasks: LongEval-TopEx (Topic Extraction), LongEval-USim (User Simulation), and LongEval-RAG (Retrieval-Augmented Generation). Short descriptions, data sources and evaluation procedures for each task appear below. To receive announcements and participate in discussions, please subscribe to our Google Group or join the project Slack at longeval.slack.com.

Task 1. LongEval-Sci: Ad-Hoc Scientific Retrieval

Description

Build IR systems that keep retrieval effectiveness over time as the scientific collection evolves. Submit runs on multiple time-stamped snapshots.

Data

CORE documents and queries across multiple snapshots. Two training snapshots (e.g. 2024-11, 2025-01) and two test snapshots for short- and long-term persistence (~4M docs, ~1k queries).

Evaluation

nDCG per snapshot; robustness across snapshots measured by Relative Improvement (RI), Delta Relative Improvement (DRI) and Effect Ratio (ER).

Task 2. LongEval-TopEx: Topic Extraction From Query Logs

Description

Extract TREC-style topics (query + description + narrative) from query logs to formalize the information need aligned with observed usage.

Data

Training snapshot at time t; build a top-10 pool from Task-1 runs overlapping training queries and annotate with multiple LLM-based relevance assessors to obtain alternative qrels per topic.

Evaluation

Assess topics on Alignment (agreement with click-log qrels), Distinguishability (ability to separate runs via nDCG) and Clarity (consistency across LLM annotations). Measure short- and long-term drops. Topics feed Task-3.

Task 3. LongEval-USim: User Simulation

Description

Model user behaviors and query reformulations from CORE search sessions to predict the next query in a sequence.

Data & setup

Pre-filtered session log excerpts (search_id, uid, queries, SERPs, clicks). A pre-configured SimIIR v3 environment is provided for simulations; Task-2 topics may be used as context.

Task

Produce simulated queries or user-model configurations that replicate observed session behavior; train on historic logs and evaluate on held-out sessions.

Evaluation

Compare simulated vs real queries/interactions: indistinguishability and performance-prediction. Use reproducibility measures and track short-/long-term simulator quality.

Task 4. LongEval-RAG: Retrieval Augmented Generation (RAG)

Description & data

Evaluate RAG systems' ability to retrieve time-aware evidence and generate correct answers as scientific information evolves. Use short- and long-term CORE collections; queries target temporal changes or contradictions.

Submission

For each query submit the generated response and the supporting documents/passages (with creation/publication/update times).

Evaluation

Manual scoring on Answer Relevancy and Faithfulness (weighted equally).