Introduction to SimIIR Studio
Welcome to SimIIR Studio! This guide will teach you how to create and run search simulations step by step. A simulation lets you test different search strategies and behaviors to see what works best.
Think of it like this: instead of having real people search for information, you create a computer program that simulates how people search. You can then test different approaches, change parameters, and see the results quickly without needing actual users.
What you'll learn in this guide
- • What a simulation is and why it's useful
- • How to create simulation configuration files
- • How to set up user configurations
- • How to use and create components
- • How the search process works
- • How to put it all together in a working simulation
What is a Simulation?
A simulation in SimIIR is like a virtual search experiment. It consists of three main parts working together:
Simulated Users
Virtual users that perform searches. Each user has a goal (topic) and behaves according to their configuration.
Search Engine
The system that finds documents. This could be a real search engine or a simulated one with pre-loaded results.
Components
Building blocks that control user behavior: how they generate queries, decide which results to click, and when to stop.
Real-world example
Imagine you want to test if using AI to generate search queries works better than random queries. You'd create a simulation where virtual users search for topics like "climate change" or "machine learning". Some users use AI-generated queries, others use random queries. After running the simulation, you compare which approach found more relevant documents.
Creating Simulation Files
A simulation file is an XML document that defines your entire experiment. It tells SimIIR what you want to test, which users to simulate, what search engine to use, and where to save the results.
Basic Structure
Simple Simulation File Example
This is a basic simulation configuration
<?xml version="1.0" encoding="UTF-8"?>
<simulation>
<!-- Give your simulation a name -->
<output>
<baseDir>output/my_first_simulation</baseDir>
<saveInteraction>true</saveInteraction>
<saveLog>true</saveLog>
</output>
<!-- Define which users will participate -->
<users>
<user>
<configFile>users/basic_user.xml</configFile>
<topics>
<topic>301</topic>
<topic>310</topic>
<topic>320</topic>
</topics>
</user>
</users>
<!-- Configure the search engine -->
<searchInterface>
<interface>whooshtrec</interface>
<indexPath>example_data/index_CORE</indexPath>
</searchInterface>
<!-- Set how many times to run each topic -->
<runs>1</runs>
</simulation>Understanding Each Section
Output Settings
This section controls where and what gets saved:
baseDirWhere to save all results (logs, queries, interactions)
saveInteractionSave user interactions (which results were clicked, how long they viewed)
saveLogSave detailed logs for debugging
Users Section
Define which simulated users participate and what they search for:
configFilePath to the user configuration file (we'll create this next)
topicsTopics (search goals) for this user. Each topic is like "find information about X"
Search Engine Configuration
Configure which search system to use:
interfaceType of search engine (whooshtrec, terrier, etc.)
indexPathPath to your document index (pre-built collection of documents)
User Configurations
User configuration files define how your simulated users behave. Each user has components that control different aspects of their search behavior.
Basic User Configuration
Example: basic_user.xml
<?xml version="1.0" encoding="UTF-8"?>
<user>
<!-- User's search goal -->
<searchContext>
<topicFile>example_data/topics/{topic}.txt</topicFile>
</searchContext>
<!-- How to generate queries -->
<queryGenerator>
<class>simiir.user.query_generators.SmartQueryGenerator</class>
<params>
<maxTerms>3</maxTerms>
<minTerms>1</minTerms>
</params>
</queryGenerator>
<!-- How to decide what to click -->
<documentClassifier>
<class>simiir.user.classifiers.ProbabilisticClassifier</class>
<params>
<clickProbsFile>example_data/probs/click.prob</clickProbsFile>
</params>
</documentClassifier>
<!-- When to stop searching -->
<stoppingDecisionMaker>
<class>simiir.user.stopping.FixedDepthStoppingStrategy</class>
<params>
<depth>5</depth>
</params>
</stoppingDecisionMaker>
<!-- Search behavior settings -->
<searchContext>
<queryListFile>example_data/topics/{topic}.queries</queryListFile>
</searchContext>
</user>User Configuration Components
Query Generator
Controls how search queries are created. Options include:
- • Smart: Uses topic words intelligently
- • Random: Picks random topic terms
- • LLM-based: Uses AI models to generate queries
Document Classifier
Decides which search results to click on:
- • Probabilistic: Click based on position
- • Relevance-based: Click if relevant
- • Perfect: Always click relevant docs
Stopping Strategy
Determines when to stop searching:
- • Fixed Depth: Stop after N queries
- • Satisfaction: Stop when enough good results found
- • Frustration: Stop if no good results
Search Context
Provides context for the search:
- • Topic files (what to search for)
- • Query lists (predefined queries)
- • Background knowledge
Components
Components are Python classes that define specific behaviors. You can use built-in components or create your own.
Creating a Custom Component
Custom Query Generator Example
from simiir.user.query_generators.base import QueryGenerator
class MyCustomQueryGenerator(QueryGenerator):
"""
A custom query generator that combines topic words
with trending terms
"""
def __init__(self, topic, stopword_file,
max_terms=3, trending_terms=None):
super().__init__(topic, stopword_file)
self.max_terms = max_terms
self.trending_terms = trending_terms or []
def generate_query(self):
"""
Generate a query by mixing topic words
with trending terms
"""
# Get words from the topic
topic_words = self.get_topic_words()
# Pick some topic words
query_terms = self.pick_random_terms(
topic_words,
self.max_terms - 1
)
# Add a trending term if available
if self.trending_terms:
trending = self.random.choice(self.trending_terms)
query_terms.append(trending)
# Join into a query string
return ' '.join(query_terms)
def get_topic_words(self):
"""Extract words from the topic text"""
words = self.topic.content.split()
# Remove stopwords
words = [w for w in words
if w.lower() not in self.stopwords]
return words
def pick_random_terms(self, terms, count):
"""Randomly pick N terms from the list"""
import random
count = min(count, len(terms))
return random.sample(terms, count)Using Your Custom Component
Once you've created a component, you can use it in your user configuration by specifying its class path:
<queryGenerator>
<class>my_components.MyCustomQueryGenerator</class>
<params>
<maxTerms>3</maxTerms>
<trendingTerms>
<term>AI</term>
<term>machine learning</term>
<term>neural networks</term>
</trendingTerms>
</params>
</queryGenerator>Using the Component Studio
SimIIR Studio includes a Component Studio where you can create, test, and manage your custom components through a user-friendly interface. You can write code, test it, and deploy it directly to your simulations.
The Search Process
Understanding how a simulation runs helps you design better experiments. Here's what happens when you run a simulation:
Initialization
SimIIR loads your simulation file, creates the users, connects to the search engine, and prepares everything to run. Each user gets their topic (e.g., "Find information about climate change").
Query Generation
The user's Query Generator component creates a search query. This might be as simple as picking words from the topic, or as complex as using an AI model to generate an intelligent query.
Search Execution
The query is sent to the search engine, which returns a ranked list of documents. The search engine uses its own algorithms (BM25, language models, etc.) to find relevant documents.
Result Examination
The user looks at the search results. The Document Classifier component decides which results look interesting and worth clicking. This mimics how real users scan search results and choose what to click.
Document Interaction
The user "reads" the clicked documents. The system records which documents were clicked, how long they were viewed, and whether they were relevant. This data is crucial for evaluating search performance.
Stopping Decision
The Stopping Decision Maker component checks if the user should continue or stop. They might stop because they found enough good results, got frustrated, or hit a query limit. If they continue, go back to step 2.
Save Results
When the user stops, SimIIR saves everything: all queries issued, results clicked, time spent, and relevant documents found. These logs let you analyze what happened and evaluate performance.
The Complete Cycle
This entire process (steps 2-6) repeats in a loop until the user decides to stop. A typical simulation might have a user issue 5-10 queries, click 10-30 documents, and spend several simulated minutes searching before deciding they've found enough information or need to give up.
Using Global Variables
Global Variables allow you to securely store API keys and configuration settings that can be accessed by your custom SimIIR components. This is particularly useful when integrating with external LLM providers like OpenAI, Anthropic, Cohere, or Google Vertex AI.
Setting up Global Variables
To configure your API keys and global variables:
- Click on your profile icon in the navigation bar
- Select "Profile & Settings" from the dropdown
- Navigate to the "Global Variables" tab
- Add your API keys for the LLM providers you want to use
Accessing Variables in Custom Components
When creating custom SimIIR components, you can access global variables as environment variables. The system automatically injects your configured variables into the component execution environment.
Example: Using LangChain with Global Variables
Here's an example of a custom query generator that uses the LangChain wrapper to access OpenAI via your stored API key:
import os
from simiir.user.utils.langchain_wrapper import LangChainWrapper
from langchain_core.prompts import PromptTemplate
class MyCustomQueryGenerator:
"""
Custom query generator using LangChain and global variables.
The OPENAI_API_KEY from your global variables is automatically
available as an environment variable.
"""
def __init__(self):
# The API key is automatically loaded from environment
# No need to hardcode it!
# Define your prompt template
prompt = PromptTemplate(
template="Generate a search query about: {topic}",
input_variables=["topic"]
)
# Initialize LangChain wrapper
# Provider options: 'openai', 'anthropic', 'cohere', 'vertexai', 'ollama'
self.llm_wrapper = LangChainWrapper(
prompt=prompt,
provider="openai", # Uses OPENAI_API_KEY
model="gpt-4",
temperature=0.7,
verbose=False
)
def generate_query(self, topic):
"""Generate a query using the LLM"""
from langchain_core.output_parsers import JsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
# Define response schema
class QueryResponse(BaseModel):
query: str = Field(description="The generated search query")
response_schema = [
{"name": "query", "type": "string", "description": "Generated query"}
]
parser = JsonOutputParser(pydantic_object=QueryResponse)
# Generate response with retry mechanism
result = self.llm_wrapper.generate_response(
output_parser=parser,
params={"topic": topic},
response_schema=response_schema
)
return result["query"]Supported LLM Providers
OPENAI_API_KEYGPT-4, GPT-3.5, etc.
ANTHROPIC_API_KEYClaude models
COHERE_API_KEYCohere models
VERTEXAI_PROJECT, VERTEXAI_LOCATIONGoogle Cloud models
Accessing Other Environment Variables
You can access any of your global variables using standard environment variable access in Python:
import os
# Access your API keys
openai_key = os.getenv('OPENAI_API_KEY')
anthropic_key = os.getenv('ANTHROPIC_API_KEY')
# Access custom variables
my_custom_var = os.getenv('MY_CUSTOM_KEY')
# With default value
some_config = os.getenv('CONFIG_VALUE', 'default_value')Security Best Practices
- • Never hardcode API keys in your component code
- • Always use global variables for sensitive information
- • API keys are stored securely and masked in the UI
- • Each user has their own isolated set of global variables
Complete Example: Putting It All Together
Let's walk through a complete example of creating and running a simulation from start to finish. We'll create a simulation that tests AI-generated queries versus simple keyword queries.
Our Goal
We want to compare two query generation strategies: one using OpenAI's GPT-4 to generate queries, and another using a simple smart query generator. We'll test them on 5 different topics and see which finds more relevant documents.
Step 1: Create User Configurations
AI User (users/ai_user.xml)
Uses GPT-4 for query generation
<?xml version="1.0" encoding="UTF-8"?>
<user>
<searchContext>
<topicFile>example_data/topics/{topic}.txt</topicFile>
</searchContext>
<queryGenerator>
<class>simiir.user.query_generators.LMQueryGenerator</class>
<params>
<provider>openai</provider>
<model>gpt-4</model>
<temperature>0.7</temperature>
<maxTerms>5</maxTerms>
</params>
</queryGenerator>
<documentClassifier>
<class>simiir.user.classifiers.ProbabilisticClassifier</class>
<params>
<clickProbsFile>example_data/probs/click.prob</clickProbsFile>
</params>
</documentClassifier>
<stoppingDecisionMaker>
<class>simiir.user.stopping.FixedDepthStoppingStrategy</class>
<params>
<depth>10</depth>
</params>
</stoppingDecisionMaker>
</user>Smart User (users/smart_user.xml)
Uses simple keyword selection
<?xml version="1.0" encoding="UTF-8"?>
<user>
<searchContext>
<topicFile>example_data/topics/{topic}.txt</topicFile>
</searchContext>
<queryGenerator>
<class>simiir.user.query_generators.SmartQueryGenerator</class>
<params>
<maxTerms>3</maxTerms>
<minTerms>1</minTerms>
</params>
</queryGenerator>
<documentClassifier>
<class>simiir.user.classifiers.ProbabilisticClassifier</class>
<params>
<clickProbsFile>example_data/probs/click.prob</clickProbsFile>
</params>
</documentClassifier>
<stoppingDecisionMaker>
<class>simiir.user.stopping.FixedDepthStoppingStrategy</class>
<params>
<depth>10</depth>
</params>
</stoppingDecisionMaker>
</user>Step 2: Create the Simulation File
Main Simulation (simulation.xml)
Configures both users with the same topics
<?xml version="1.0" encoding="UTF-8"?>
<simulation>
<!-- Output settings -->
<output>
<baseDir>output/ai_vs_smart_comparison</baseDir>
<saveInteraction>true</saveInteraction>
<saveLog>true</saveLog>
</output>
<!-- Define both users -->
<users>
<!-- AI-powered user -->
<user>
<configFile>users/ai_user.xml</configFile>
<topics>
<topic>301</topic>
<topic>310</topic>
<topic>320</topic>
<topic>330</topic>
<topic>340</topic>
</topics>
</user>
<!-- Smart query user -->
<user>
<configFile>users/smart_user.xml</configFile>
<topics>
<topic>301</topic>
<topic>310</topic>
<topic>320</topic>
<topic>330</topic>
<topic>340</topic>
</topics>
</user>
</users>
<!-- Search engine configuration -->
<searchInterface>
<interface>whooshtrec</interface>
<indexPath>example_data/index_CORE</indexPath>
</searchInterface>
<!-- Run 3 times for statistical significance -->
<runs>3</runs>
</simulation>Step 3: Set Up Global Variables
Before running
Since our AI user needs OpenAI access, make sure to set up your API key in Global Variables:
- Click your profile icon in the navigation bar
- Select "Profile & Settings"
- Go to "Global Variables" tab
- Add:
OPENAI_API_KEY= your-api-key
Step 4: Run the Simulation
Using SimIIR Studio
You can run your simulation through the SimIIR Studio interface:
Go to the Workspace page in SimIIR Studio
Upload your simulation.xml file and both user configuration files
Click "Run Simulation"
Monitor progress in real-time
View and download results when complete
Step 5: Analyze Results
What to Look For
After the simulation completes, you'll find several output files:
*.queriesAll queries generated by each user
*.interactionWhich documents were clicked and viewed
*.logDetailed logs of the entire search session
*.trecResults in TREC format for evaluation with standard tools
You've Created Your First Simulation!
You now know how to:
- • Create user configurations with different components
- • Set up a simulation file to compare different strategies
- • Use global variables to securely store API keys
- • Run simulations and analyze the results
Try experimenting with different components, parameters, and topics to see how they affect search performance!
API Reference
SimIIR Studio provides a REST API wrapper for running simulations at scale programmatically. The API supports asynchronous execution, status tracking, pause/resume, and result retrieval.
Why Use the API?
- • Run large-scale experiments with multiple users and topics
- • Execute simulations asynchronously without blocking
- • Monitor progress and status in real-time
- • Pause, resume, or stop long-running simulations
- • Integrate with your existing workflows and tools
Getting Started
1. Setup the API
First, clone and set up the API wrapper locally:
# Clone simIIR framework git clone https://github.com/simint-ai/simiir-3.git simiir cd simiir && pip install -r requirements.txt && cd .. # Navigate to API directory and install cd simiir-api poetry install # Configure environment (use absolute paths) cat > .env << 'EOF' SIMIIR_REPO_PATH=/absolute/path/to/simiir SIMIIR_PYTHON_PATH=/absolute/path/to/simiir SIMIIR_DATA_PATH=/absolute/path/to/simiir/example_data EOF # Start the API server poetry run simiir-api
Key Endpoints
POST/simulations/
Create a new simulation with XML configuration
Request Body:
{
"name": "My Experiment",
"description": "Testing query strategies",
"config_content": "<?xml version='1.0'?>...",
"users": ["user1", "user2"],
"topics": ["topic1", "topic2"],
"metadata": {
"experiment_type": "query_comparison"
}
}POST/simulations/{id}/start
Start executing a simulation
Starts the simulation asynchronously. Returns immediately with updated status.
GET/simulations/{id}
Get simulation status and progress
Response:
{
"id": "sim_123",
"name": "My Experiment",
"status": "running",
"progress": 45.5,
"created_at": "2025-10-29T12:00:00Z",
"started_at": "2025-10-29T12:01:00Z",
"config_file_path": "/path/to/config.xml",
"output_path": "/path/to/outputs/sim_123",
"error_message": null
}POST/simulations/{id}/pause
Pause a running simulation
Gracefully pauses execution. Can be resumed later from the same point.
POST/simulations/{id}/resume
Resume a paused simulation
Continues execution from where it was paused.
POST/simulations/{id}/stop
Stop a running simulation
Terminates execution. Partial results may be available.
GET/simulations/{id}/results
Download simulation results
Returns a zip file containing all output files (.queries, .log, .trec, .interaction).
Example Usage
Python Example
import httpx
import time
# API base URL
BASE_URL = "http://localhost:8000"
# Read your simulation XML
with open("simulation.xml") as f:
config_xml = f.read()
# Create simulation
response = httpx.post(
f"{BASE_URL}/simulations/",
json={
"name": "My Large Scale Experiment",
"description": "Testing with 100 users",
"config_content": config_xml,
"users": [f"user_{i}" for i in range(100)],
"topics": ["301", "310", "320", "330"],
}
)
sim = response.json()
sim_id = sim["id"]
# Start simulation
httpx.post(f"{BASE_URL}/simulations/{sim_id}/start")
# Monitor progress
while True:
response = httpx.get(f"{BASE_URL}/simulations/{sim_id}")
status = response.json()
print(f"Status: {status['status']}, Progress: {status['progress']}%")
if status["status"] in ["completed", "failed"]:
break
time.sleep(5)
# Download results
if status["status"] == "completed":
response = httpx.get(f"{BASE_URL}/simulations/{sim_id}/results")
with open("results.zip", "wb") as f:
f.write(response.content)
print("Results downloaded!")Interactive API Documentation
The API provides interactive Swagger documentation where you can test all endpoints directly in your browser.
Once the API is running, visit: http://localhost:8000/docs
