SimIIR Studio

Introduction to SimIIR Studio

Welcome to SimIIR Studio! This guide will teach you how to create and run search simulations step by step. A simulation lets you test different search strategies and behaviors to see what works best.

Think of it like this: instead of having real people search for information, you create a computer program that simulates how people search. You can then test different approaches, change parameters, and see the results quickly without needing actual users.

What you'll learn in this guide

• What a simulation is and why it's useful
• How to create simulation configuration files
• How to set up user configurations
• How to use and create components
• How the search process works
• How to put it all together in a working simulation

What is a Simulation?

A simulation in SimIIR is like a virtual search experiment. It consists of three main parts working together:

Simulated Users

Virtual users that perform searches. Each user has a goal (topic) and behaves according to their configuration.

Search Engine

The system that finds documents. This could be a real search engine or a simulated one with pre-loaded results.

Components

Building blocks that control user behavior: how they generate queries, decide which results to click, and when to stop.

Real-world example

Imagine you want to test if using AI to generate search queries works better than random queries. You'd create a simulation where virtual users search for topics like "climate change" or "machine learning". Some users use AI-generated queries, others use random queries. After running the simulation, you compare which approach found more relevant documents.

Creating Simulation Files

A simulation file is an XML document that defines your entire experiment. It tells SimIIR what you want to test, which users to simulate, what search engine to use, and where to save the results.

Basic Structure

Simple Simulation File Example

This is a basic simulation configuration

<?xml version="1.0" encoding="UTF-8"?>
<simulation>
  <!-- Give your simulation a name -->
  <output>
    <baseDir>output/my_first_simulation</baseDir>
    <saveInteraction>true</saveInteraction>
    <saveLog>true</saveLog>
  </output>

  <!-- Define which users will participate -->
  <users>
    <user>
      <configFile>users/basic_user.xml</configFile>
      <topics>
        <topic>301</topic>
        <topic>310</topic>
        <topic>320</topic>
      </topics>
    </user>
  </users>

  <!-- Configure the search engine -->
  <searchInterface>
    <interface>whooshtrec</interface>
    <indexPath>example_data/index_CORE</indexPath>
  </searchInterface>

  <!-- Set how many times to run each topic -->
  <runs>1</runs>
</simulation>

Understanding Each Section

Output Settings

This section controls where and what gets saved:

baseDir

Where to save all results (logs, queries, interactions)

saveInteraction

Save user interactions (which results were clicked, how long they viewed)

saveLog

Save detailed logs for debugging

Users Section

Define which simulated users participate and what they search for:

configFile

Path to the user configuration file (we'll create this next)

topics

Topics (search goals) for this user. Each topic is like "find information about X"

Search Engine Configuration

Configure which search system to use:

interface

Type of search engine (whooshtrec, terrier, etc.)

indexPath

Path to your document index (pre-built collection of documents)

User Configurations

User configuration files define how your simulated users behave. Each user has components that control different aspects of their search behavior.

Basic User Configuration

Example: basic_user.xml

<?xml version="1.0" encoding="UTF-8"?>
<user>
  <!-- User's search goal -->
  <searchContext>
    <topicFile>example_data/topics/{topic}.txt</topicFile>
  </searchContext>

  <!-- How to generate queries -->
  <queryGenerator>
    <class>simiir.user.query_generators.SmartQueryGenerator</class>
    <params>
      <maxTerms>3</maxTerms>
      <minTerms>1</minTerms>
    </params>
  </queryGenerator>

  <!-- How to decide what to click -->
  <documentClassifier>
    <class>simiir.user.classifiers.ProbabilisticClassifier</class>
    <params>
      <clickProbsFile>example_data/probs/click.prob</clickProbsFile>
    </params>
  </documentClassifier>

  <!-- When to stop searching -->
  <stoppingDecisionMaker>
    <class>simiir.user.stopping.FixedDepthStoppingStrategy</class>
    <params>
      <depth>5</depth>
    </params>
  </stoppingDecisionMaker>

  <!-- Search behavior settings -->
  <searchContext>
    <queryListFile>example_data/topics/{topic}.queries</queryListFile>
  </searchContext>
</user>

User Configuration Components

Query Generator

Controls how search queries are created. Options include:

• Smart: Uses topic words intelligently
• Random: Picks random topic terms
• LLM-based: Uses AI models to generate queries

Document Classifier

Decides which search results to click on:

• Probabilistic: Click based on position
• Relevance-based: Click if relevant
• Perfect: Always click relevant docs

Stopping Strategy

Determines when to stop searching:

• Fixed Depth: Stop after N queries
• Satisfaction: Stop when enough good results found
• Frustration: Stop if no good results

Search Context

Provides context for the search:

• Topic files (what to search for)
• Query lists (predefined queries)
• Background knowledge

Components

Components are Python classes that define specific behaviors. You can use built-in components or create your own.

Creating a Custom Component

Custom Query Generator Example

from simiir.user.query_generators.base import QueryGenerator

class MyCustomQueryGenerator(QueryGenerator):
    """
    A custom query generator that combines topic words 
    with trending terms
    """
    
    def __init__(self, topic, stopword_file, 
                 max_terms=3, trending_terms=None):
        super().__init__(topic, stopword_file)
        self.max_terms = max_terms
        self.trending_terms = trending_terms or []
    
    def generate_query(self):
        """
        Generate a query by mixing topic words 
        with trending terms
        """
        # Get words from the topic
        topic_words = self.get_topic_words()
        
        # Pick some topic words
        query_terms = self.pick_random_terms(
            topic_words, 
            self.max_terms - 1
        )
        
        # Add a trending term if available
        if self.trending_terms:
            trending = self.random.choice(self.trending_terms)
            query_terms.append(trending)
        
        # Join into a query string
        return ' '.join(query_terms)
    
    def get_topic_words(self):
        """Extract words from the topic text"""
        words = self.topic.content.split()
        # Remove stopwords
        words = [w for w in words 
                 if w.lower() not in self.stopwords]
        return words
    
    def pick_random_terms(self, terms, count):
        """Randomly pick N terms from the list"""
        import random
        count = min(count, len(terms))
        return random.sample(terms, count)

Using Your Custom Component

Once you've created a component, you can use it in your user configuration by specifying its class path:

<queryGenerator>
  <class>my_components.MyCustomQueryGenerator</class>
  <params>
    <maxTerms>3</maxTerms>
    <trendingTerms>
      <term>AI</term>
      <term>machine learning</term>
      <term>neural networks</term>
    </trendingTerms>
  </params>
</queryGenerator>

Using the Component Studio

SimIIR Studio includes a Component Studio where you can create, test, and manage your custom components through a user-friendly interface. You can write code, test it, and deploy it directly to your simulations.

The Search Process

Understanding how a simulation runs helps you design better experiments. Here's what happens when you run a simulation:

Initialization

SimIIR loads your simulation file, creates the users, connects to the search engine, and prepares everything to run. Each user gets their topic (e.g., "Find information about climate change").

Query Generation

The user's Query Generator component creates a search query. This might be as simple as picking words from the topic, or as complex as using an AI model to generate an intelligent query.

Search Execution

The query is sent to the search engine, which returns a ranked list of documents. The search engine uses its own algorithms (BM25, language models, etc.) to find relevant documents.

Result Examination

The user looks at the search results. The Document Classifier component decides which results look interesting and worth clicking. This mimics how real users scan search results and choose what to click.

Document Interaction

The user "reads" the clicked documents. The system records which documents were clicked, how long they were viewed, and whether they were relevant. This data is crucial for evaluating search performance.

Stopping Decision

The Stopping Decision Maker component checks if the user should continue or stop. They might stop because they found enough good results, got frustrated, or hit a query limit. If they continue, go back to step 2.

Save Results

When the user stops, SimIIR saves everything: all queries issued, results clicked, time spent, and relevant documents found. These logs let you analyze what happened and evaluate performance.

The Complete Cycle

This entire process (steps 2-6) repeats in a loop until the user decides to stop. A typical simulation might have a user issue 5-10 queries, click 10-30 documents, and spend several simulated minutes searching before deciding they've found enough information or need to give up.

Using Global Variables

Global Variables allow you to securely store API keys and configuration settings that can be accessed by your custom SimIIR components. This is particularly useful when integrating with external LLM providers like OpenAI, Anthropic, Cohere, or Google Vertex AI.

Setting up Global Variables

To configure your API keys and global variables:

Click on your profile icon in the navigation bar
Select "Profile & Settings" from the dropdown
Navigate to the "Global Variables" tab
Add your API keys for the LLM providers you want to use

Accessing Variables in Custom Components

When creating custom SimIIR components, you can access global variables as environment variables. The system automatically injects your configured variables into the component execution environment.

Example: Using LangChain with Global Variables

Here's an example of a custom query generator that uses the LangChain wrapper to access OpenAI via your stored API key:

Python Example

import os
from simiir.user.utils.langchain_wrapper import LangChainWrapper
from langchain_core.prompts import PromptTemplate

class MyCustomQueryGenerator:
    """
    Custom query generator using LangChain and global variables.
    The OPENAI_API_KEY from your global variables is automatically
    available as an environment variable.
    """
    
    def __init__(self):
        # The API key is automatically loaded from environment
        # No need to hardcode it!
        
        # Define your prompt template
        prompt = PromptTemplate(
            template="Generate a search query about: {topic}",
            input_variables=["topic"]
        )
        
        # Initialize LangChain wrapper
        # Provider options: 'openai', 'anthropic', 'cohere', 'vertexai', 'ollama'
        self.llm_wrapper = LangChainWrapper(
            prompt=prompt,
            provider="openai",      # Uses OPENAI_API_KEY
            model="gpt-4",
            temperature=0.7,
            verbose=False
        )
    
    def generate_query(self, topic):
        """Generate a query using the LLM"""
        from langchain_core.output_parsers import JsonOutputParser
        from langchain_core.pydantic_v1 import BaseModel, Field
        
        # Define response schema
        class QueryResponse(BaseModel):
            query: str = Field(description="The generated search query")
        
        response_schema = [
            {"name": "query", "type": "string", "description": "Generated query"}
        ]
        
        parser = JsonOutputParser(pydantic_object=QueryResponse)
        
        # Generate response with retry mechanism
        result = self.llm_wrapper.generate_response(
            output_parser=parser,
            params={"topic": topic},
            response_schema=response_schema
        )
        
        return result["query"]

Supported LLM Providers

OpenAI

OPENAI_API_KEY

GPT-4, GPT-3.5, etc.

Anthropic

ANTHROPIC_API_KEY

Claude models

Cohere

COHERE_API_KEY

Cohere models

Vertex AI

VERTEXAI_PROJECT, VERTEXAI_LOCATION

Google Cloud models

Accessing Other Environment Variables

You can access any of your global variables using standard environment variable access in Python:

import os

# Access your API keys
openai_key = os.getenv('OPENAI_API_KEY')
anthropic_key = os.getenv('ANTHROPIC_API_KEY')

# Access custom variables
my_custom_var = os.getenv('MY_CUSTOM_KEY')

# With default value
some_config = os.getenv('CONFIG_VALUE', 'default_value')

Security Best Practices

• Never hardcode API keys in your component code
• Always use global variables for sensitive information
• API keys are stored securely and masked in the UI
• Each user has their own isolated set of global variables

Complete Example: Putting It All Together

Let's walk through a complete example of creating and running a simulation from start to finish. We'll create a simulation that tests AI-generated queries versus simple keyword queries.

Our Goal

We want to compare two query generation strategies: one using OpenAI's GPT-4 to generate queries, and another using a simple smart query generator. We'll test them on 5 different topics and see which finds more relevant documents.

Step 1: Create User Configurations

AI User (users/ai_user.xml)

Uses GPT-4 for query generation

<?xml version="1.0" encoding="UTF-8"?>
<user>
  <searchContext>
    <topicFile>example_data/topics/{topic}.txt</topicFile>
  </searchContext>

  <queryGenerator>
    <class>simiir.user.query_generators.LMQueryGenerator</class>
    <params>
      <provider>openai</provider>
      <model>gpt-4</model>
      <temperature>0.7</temperature>
      <maxTerms>5</maxTerms>
    </params>
  </queryGenerator>

  <documentClassifier>
    <class>simiir.user.classifiers.ProbabilisticClassifier</class>
    <params>
      <clickProbsFile>example_data/probs/click.prob</clickProbsFile>
    </params>
  </documentClassifier>

  <stoppingDecisionMaker>
    <class>simiir.user.stopping.FixedDepthStoppingStrategy</class>
    <params>
      <depth>10</depth>
    </params>
  </stoppingDecisionMaker>
</user>

Smart User (users/smart_user.xml)

Uses simple keyword selection

<?xml version="1.0" encoding="UTF-8"?>
<user>
  <searchContext>
    <topicFile>example_data/topics/{topic}.txt</topicFile>
  </searchContext>

  <queryGenerator>
    <class>simiir.user.query_generators.SmartQueryGenerator</class>
    <params>
      <maxTerms>3</maxTerms>
      <minTerms>1</minTerms>
    </params>
  </queryGenerator>

  <documentClassifier>
    <class>simiir.user.classifiers.ProbabilisticClassifier</class>
    <params>
      <clickProbsFile>example_data/probs/click.prob</clickProbsFile>
    </params>
  </documentClassifier>

  <stoppingDecisionMaker>
    <class>simiir.user.stopping.FixedDepthStoppingStrategy</class>
    <params>
      <depth>10</depth>
    </params>
  </stoppingDecisionMaker>
</user>

Step 2: Create the Simulation File

Main Simulation (simulation.xml)

Configures both users with the same topics

<?xml version="1.0" encoding="UTF-8"?>
<simulation>
  <!-- Output settings -->
  <output>
    <baseDir>output/ai_vs_smart_comparison</baseDir>
    <saveInteraction>true</saveInteraction>
    <saveLog>true</saveLog>
  </output>

  <!-- Define both users -->
  <users>
    <!-- AI-powered user -->
    <user>
      <configFile>users/ai_user.xml</configFile>
      <topics>
        <topic>301</topic>
        <topic>310</topic>
        <topic>320</topic>
        <topic>330</topic>
        <topic>340</topic>
      </topics>
    </user>

    <!-- Smart query user -->
    <user>
      <configFile>users/smart_user.xml</configFile>
      <topics>
        <topic>301</topic>
        <topic>310</topic>
        <topic>320</topic>
        <topic>330</topic>
        <topic>340</topic>
      </topics>
    </user>
  </users>

  <!-- Search engine configuration -->
  <searchInterface>
    <interface>whooshtrec</interface>
    <indexPath>example_data/index_CORE</indexPath>
  </searchInterface>

  <!-- Run 3 times for statistical significance -->
  <runs>3</runs>
</simulation>

Step 3: Set Up Global Variables

Before running

Since our AI user needs OpenAI access, make sure to set up your API key in Global Variables:

Click your profile icon in the navigation bar
Select "Profile & Settings"
Go to "Global Variables" tab
Add: OPENAI_API_KEY = your-api-key

Step 4: Run the Simulation

Using SimIIR Studio

You can run your simulation through the SimIIR Studio interface:

Go to the Workspace page in SimIIR Studio

Upload your simulation.xml file and both user configuration files

Click "Run Simulation"

Monitor progress in real-time

View and download results when complete

Step 5: Analyze Results

What to Look For

After the simulation completes, you'll find several output files:

*.queries

All queries generated by each user

*.interaction

Which documents were clicked and viewed

*.log

Detailed logs of the entire search session

*.trec

Results in TREC format for evaluation with standard tools

You've Created Your First Simulation!

You now know how to:

• Create user configurations with different components
• Set up a simulation file to compare different strategies
• Use global variables to securely store API keys
• Run simulations and analyze the results

Try experimenting with different components, parameters, and topics to see how they affect search performance!

API Reference

SimIIR Studio provides a REST API wrapper for running simulations at scale programmatically. The API supports asynchronous execution, status tracking, pause/resume, and result retrieval.

Why Use the API?

• Run large-scale experiments with multiple users and topics
• Execute simulations asynchronously without blocking
• Monitor progress and status in real-time
• Pause, resume, or stop long-running simulations
• Integrate with your existing workflows and tools

Getting Started

1. Setup the API

First, clone and set up the API wrapper locally:

# Clone simIIR framework
git clone https://github.com/simint-ai/simiir-3.git simiir
cd simiir && pip install -r requirements.txt && cd ..

# Navigate to API directory and install
cd simiir-api
poetry install

# Configure environment (use absolute paths)
cat > .env << 'EOF'
SIMIIR_REPO_PATH=/absolute/path/to/simiir
SIMIIR_PYTHON_PATH=/absolute/path/to/simiir
SIMIIR_DATA_PATH=/absolute/path/to/simiir/example_data
EOF

# Start the API server
poetry run simiir-api

Key Endpoints

POST/simulations/

Create a new simulation with XML configuration

Request Body:

{
  "name": "My Experiment",
  "description": "Testing query strategies",
  "config_content": "<?xml version='1.0'?>...",
  "users": ["user1", "user2"],
  "topics": ["topic1", "topic2"],
  "metadata": {
    "experiment_type": "query_comparison"
  }
}

POST/simulations/{id}/start

Start executing a simulation

Starts the simulation asynchronously. Returns immediately with updated status.

GET/simulations/{id}

Get simulation status and progress

Response:

{
  "id": "sim_123",
  "name": "My Experiment",
  "status": "running",
  "progress": 45.5,
  "created_at": "2025-10-29T12:00:00Z",
  "started_at": "2025-10-29T12:01:00Z",
  "config_file_path": "/path/to/config.xml",
  "output_path": "/path/to/outputs/sim_123",
  "error_message": null
}

POST/simulations/{id}/pause

Pause a running simulation

Gracefully pauses execution. Can be resumed later from the same point.

POST/simulations/{id}/resume

Resume a paused simulation

Continues execution from where it was paused.

POST/simulations/{id}/stop

Stop a running simulation

Terminates execution. Partial results may be available.

GET/simulations/{id}/results

Download simulation results

Returns a zip file containing all output files (.queries, .log, .trec, .interaction).

Example Usage

Python Example

import httpx
import time

# API base URL
BASE_URL = "http://localhost:8000"

# Read your simulation XML
with open("simulation.xml") as f:
    config_xml = f.read()

# Create simulation
response = httpx.post(
    f"{BASE_URL}/simulations/",
    json={
        "name": "My Large Scale Experiment",
        "description": "Testing with 100 users",
        "config_content": config_xml,
        "users": [f"user_{i}" for i in range(100)],
        "topics": ["301", "310", "320", "330"],
    }
)
sim = response.json()
sim_id = sim["id"]

# Start simulation
httpx.post(f"{BASE_URL}/simulations/{sim_id}/start")

# Monitor progress
while True:
    response = httpx.get(f"{BASE_URL}/simulations/{sim_id}")
    status = response.json()
    
    print(f"Status: {status['status']}, Progress: {status['progress']}%")
    
    if status["status"] in ["completed", "failed"]:
        break
    
    time.sleep(5)

# Download results
if status["status"] == "completed":
    response = httpx.get(f"{BASE_URL}/simulations/{sim_id}/results")
    with open("results.zip", "wb") as f:
        f.write(response.content)
    print("Results downloaded!")

Interactive API Documentation

The API provides interactive Swagger documentation where you can test all endpoints directly in your browser.

Once the API is running, visit: http://localhost:8000/docs