I built an agent benchmark around Exa-style semantic retrieval.

The Multi-Step Agent Retrieval Benchmark (MARB) evaluates how much better LLM agents perform on real coding and infra tasks when you swap in Exa search instead of generic web search—or no search at all.

Retrieval for software engineering

MARB snapshot · 8 tasks

ACCURACY

75.0%

none

87.5%

exa

62.5%

serpapi

75.0%

parallel

75.0%

tavily

Why a benchmark for Exa & agentic workflows?

Exa is a search engine made for AIs, with APIs like /search and /contents that surface high-signal technical content for models to consume directly (see the official docs for more: Exa – Getting Started).

Most existing benchmarks (like MTEB) focus on static embedding quality. They're useful, but they don't answer the question many Exa customers actually care about:

“If I plug Exa into my LLM agent instead of a generic web search API, do I solve more real multi-step tasks?”

MARB: Multi-Step Agent Retrieval Benchmark

MARB is a small but realistic benchmark that evaluates web search in the context of LLM agents completing multi-step software tasks. Each task is something an engineer might actually delegate to a capable agent.

Unlike generic QA datasets, MARB tasks are designed to be unsolvable without external knowledge or hallucination-prone if the model relies solely on training data. They target specific, often niche, library versions or configuration syntax.

“Find a Python library for OCR, read the docs, and write code to extract text from PDFs.”
“Find recent best practices for Dockerfiles and optimize this example file.”
“Given a simple Kubernetes Deployment, add a HorizontalPodAutoscaler with sane defaults.”

For each task, the agent can optionally call a search provider, read the returned documents, and then synthesize a final answer.

@dataclass
class AgentTask:
    id: str
    instruction: str       # e.g. "Find a Python library for OCR..."
    input_context: str     # e.g. "Use this specific PDF layout..."
    success_keywords: list # e.g. ["pytesseract", "pdf2image"]

Agent loop & evaluation protocol

The benchmark uses a very simple, model-agnostic agent loop with three phases. We keep the loop intentionally simple to isolate retrieval quality as the primary variable.

# Pseudo-code of the MARB agent loop
def run_agent(task, search_client, model):
    # 1. Plan: Model generates search queries based on task
    queries = model.plan_queries(task.instruction)
    
    # 2. Retrieve: Search API returns raw results
    search_results = []
    for q in queries:
        items = search_client.search(q)
        search_results.extend(items)
        
    # 3. Answer: Model synthesizes final response
    return model.answer(task.instruction, context=search_results)

Planning: Given a MARB task, the model proposes a small set of concrete web search queries. This tests if the search engine can handle the phrasing an agent naturally produces.
Retrieval: The benchmark calls a configured web search API (e.g. Exa, Parallel, Brave, SerpAPI) with those queries and collects top-k results. We normalize these into a standard format (URL, title, snippet) to ensure fair comparison.
Answering: The model receives the original task plus the retrieved documents and produces a final answer.

To stay focused on retrieval, MARB keeps the LLM backbone fixed (in the reference implementation, a Gemini 2.5 Flash model via GEMINI_API_KEY) and only swaps out the search provider.

Each task comes with lightweight success criteria. We use deterministic keyword matching as a proxy for correctness. For example, if a task asks to "extract text from PDFs", finding pytesseract or pdf2image in the answer counts as a success. This avoids the variance and cost of "LLM-as-a-judge" while remaining directionally accurate for engineering tasks.

Comparing Exa against baseline web search

The reference MARB implementation is intentionally provider-agnostic. It defines a tiny SearchClient interface. This abstraction allows us to plug in any search API by simply writing a small adapter class.

class SearchClient(Protocol):
    name: str
    
    def search(self, query: str, top_k: int = 10) -> List[SearchResult]:
        """
        Standard interface for all providers.
        Returns normalized SearchResult objects.
        """
        pass

Currently supported providers include:

No search – Baseline. The agent relies only on its pretraining.
Exa – via the /search API (docs).
Parallel – via their Search API (docs).
Brave Search API – privacy-focused web search (docs).
SerpAPI – a meta-search wrapper around Google and others (docs).

Running MARB for all of them yields a simple comparison of task success rate: for each provider, what percentage of tasks did the agent complete end-to-end? The CLI also records wall-clock runtime per provider, so you can see not just which search engine helps the agent solve more tasks, but how long each one takes to do so.

Provider        Agent                Solved     Total      Success%
-----------------------------------------------------------------
none            simple_llm_agent     6          8          75.0
serpapi         simple_llm_agent     5          8          62.5
exa             simple_llm_agent     7          8          87.5
parallel        simple_llm_agent     6          8          75.0
tavily          simple_llm_agent     6          8          75.0

How to run the benchmark

The entire benchmark is open source. To run it yourself, you'll need to set up your environment variables in a .env file:

EXA_API_KEY=...
GEMINI_API_KEY=...    # The LLM backbone
PARALLEL_API_KEY=...  # Optional
SERPAPI_API_KEY=...   # Optional
TAVILY_API_KEY=...    # Optional

Then, you can run the full comparison with a single CLI command. This script iterates through the providers, runs the agent loop for each task, and prints the final summary table.

# Install dependencies
pip install -r requirements.txt

# Run MARB against all configured providers
python -m exa_benchmark.cli \
  --provider none \
  --provider exa \
  --provider parallel \
  --provider serpapi \
  --provider tavily \
  --tasks marb_tasks

Under the hood, the CLI runs all selected providers in parallel using a thread pool. Each provider gets its own progress bar, so you can see Exa, Parallel, SerpAPI, Tavily, and the no-search baseline all advancing at the same time while the benchmark executes. The summary table at the end includes both success rate and wall-clock time per provider.

Mapping search quality to real engineering outcomes

MARB is deliberately small and opinionated rather than exhaustive. The goal is not to publish a leaderboard; it's to give Exa and potential customers a concrete, reproducible way to answer:

“On the kinds of coding and infra tasks we care about, does Exa actually help our agents ship more things, faster?”

Because the tasks are grounded in realistic workflows (Dockerfiles, CI, k8s, FastAPI, logging, LLM eval harnesses), improvements on this benchmark should correlate with less time debugging agents and more tasks completed automatically.

How to improve benchmark performance

Based on the results from MARB, here are several strategies to improve search provider performance on agent-based tasks:

For providers scoring below 70%

Improve query understanding: Focus on better semantic parsing of technical queries and domain-specific terminology.
Enhance result relevance: Implement better ranking algorithms that prioritize authoritative technical documentation and recent content.
Optimize snippet extraction: Ensure snippets contain actionable code examples and configuration patterns, not just descriptions.

For providers scoring 70-85%

Fine-tune context windows: Experiment with different amounts of context returned to balance comprehensiveness with relevance.
Add re-ranking mechanisms: Implement a second-pass ranking based on task-specific signals like recency, code presence, and source authority.
Support structured queries: Allow agents to specify filters like date ranges, file types, or specific domains for more targeted results.

For all providers

Analyze failure patterns: Review tasks that commonly fail to identify gaps in coverage or retrieval strategy.
Implement query expansion: Automatically expand technical terms with synonyms and related concepts (e.g., "k8s" → "Kubernetes").
Cache and learn from usage: Build provider-specific knowledge of what types of queries work best and optimize accordingly.
Consider hybrid approaches: Combine multiple search strategies (semantic + keyword) or multiple providers for better coverage.

The key insight from MARB is that agent-oriented search differs from human search. Agents need structured, actionable information with working code examples, not just conceptual explanations. Providers that optimize for these needs—like Exa's focus on technical content—consistently outperform generic web search on engineering tasks.

References

By Nicholas Chen