TREC 2026

Research Agenda

Motivation

User simulation offers a scalable alternative to expensive user studies for evaluating interactive search. However, the community lacks standardized methods for validating simulators and trusting their results, which hinders their widespread adoption and slows progress.

Goal

To establish a systematic framework for validating user simulators, understand the criteria for what makes a simulator “good enough,” and create best practices for simulation-based evaluation.

Key Research Questions

  • How can we systematically validate user simulators and quantify their fidelity to real user behavior?
  • What are the criteria for determining if a user simulator is “good enough” to produce reliable system evaluations?
  • How well do current user simulation techniques generalize across different conversational search systems and task types?
  • What are the key dimensions of human search behavior that simulators need to capture for effective evaluation?

Expected Outcomes and Learning

  • The development of best practices, standardized metrics, and shared resources for creating and validating user simulators.
  • Insights into the methodological challenges of simulation-based evaluation.
  • A deeper understanding of the strengths and weaknesses of different user simulation approaches.
  • A clearer roadmap on the feasibility of augmenting or replacing expensive user studies with simulation, potentially transforming how interactive IR systems are evaluated.

Methodology

General Framework

To effectively simulate a user’s behaviour within an interactive system, configuration variables that influence this behaviour must be defined:

  • Task (T): The task the user is trying to accomplish.
  • System (S): The system’s functionality, user interface, resources (such as data sources) and the overall usability and support for the task (dictating the set of actions the user can perform).
  • User (U): Individual user, characterized by a profile/persona such as age, technical proficiency, preferences, and cognitive styles.

With these variables defined, user simulation can be stated as the following computational problem:

Given the variables T, S, and U, the goal is to create an agent that can simulate every action that user U may take when attempting to complete task T using system S.
General Evaluation Dimensions

To assess how well a simulator solves this computational problem, we investigate three main types of evaluation:

  • Qualitative Evaluation: Measures the perceived human-likeness and realism of the simulated interactions through direct human assessment.
  • Behavioral Metrics: Measures how closely the simulated interaction patterns and variance match the quantitative data found in real human usage logs.
  • Outcome Metrics: Measures the final result of the interaction (e.g., task success or efficiency) on a conversation level to see if simulators reach the same outcomes as humans.

The First Edition (Year 1)

Objectives

The primary focus of the first edition is to lay the groundwork for long-term user simulation research. Measurable success criteria include:

  • Setting up a robust, standardized infrastructure for simulator-agent interactions.
  • Collecting high-quality baseline data and establishing annotation guidelines for human-likeness.
  • Developing initial metrics and baselines for evaluating the quality and fidelity of user simulators on a constrained conversational task.
Tasks

For the first year, we focus on conversational data search, specifically simulating a researcher looking for datasets to help answer a research question. We define this task as an instantiation of our general framework as follows:

  • Task (T): Find datasets to help answer a research question.
  • System (S): Conversational search system with a chat-based UI.
  • User (U): Researcher.

Participants will tackle this scenario through two specific challenges:

  • Task 1: Turn-level Next Utterance Prediction: Given a partial conversation history and an initial information need, predict the user’s immediate next utterance (and potential dialogue acts). Focus: Local conversational coherence and reactive behavioral realism.
  • Task 2: Session-level End-to-End Conversation Generation: Given an initial information need, generate a complete, multi-turn interaction until the simulator decides the need is satisfied or the search should be abandoned. Focus: High-level planning, persistence, and strategic goal achievement.
Methodology

We will utilize a standardized API setup where participant-developed user simulators communicate directly with a set of provided conversational search agents.

First Edition Evaluation Focus

In year one, we are intentionally deferring Outcome Metrics to focus heavily on the mechanics of the conversation itself. We will evaluate submissions using:

  • Qualitative Evaluation (Human Raters):
    • We will conduct a form of Turing Test where trained annotators are presented with two dialogues and must attempt to guess which one is simulated.
    • We intend to identify qualitative patterns and specific points of failure beyond simple binary judgments. We hypothesize that current simulators will struggle to pass this test, and this evaluation will help us pinpoint exactly where and how they exhibit non-realistic behavior.
  • Behavioral Metrics (Quantitative Log Comparison):
    • We will quantitatively compare the logs of simulated dialogues against real human usage logs. This involves analyzing the distributions of key interaction behaviors (e.g., query length, turn count, clarification requests) and utilizing dialogue act annotations to model and compare semantic intent.
    • We intend to determine if simulators can accurately reproduce the prototypical interaction patterns and natural variance found in human dialogues, revealing how closely simulated behavior mirrors actual human search strategies.
Specific Outcomes

By the conclusion of the first edition, we aim to produce:

  • A dataset of conversational data search interactions with both human and simulated users.
  • A learned model (such as an LLM-as-a-judge) trained on our annotated data that can automatically predict a human-likeness score for new simulated dialogues.
  • Reusable, open-source resources for evaluating (1) a new simulator on this specific task, and (2) a new conversational agent against a validated pool of submitted simulators.