What is MedAgentBench?

MedAgentBench is a comprehensive evaluation suite designed to benchmark the agent capabilities of large language models (LLMs) in medical records settings. Unlike traditional medical AI benchmarks that focus on question-answering, MedAgentBench challenges AI agents with 300 clinically relevant tasks which require interactions with a FHIR-compliant environment. By offering a structured yet unsaturated benchmark, MedAgentBench enables researchers and developers to track progress and optimize AI-driven medical agents for enhanced clinical integration.

MedAgentBench Overview

Why MedAgentBench?

  • Bridges the Gap in Medical AI Benchmarking – Existing benchmarks focus on static question-answering, but MedAgentBench evaluates AI agents in interactive clinical scenarios.
  • Agent-Centric Evaluation – Assesses decision-making, planning, and execution of tasks within electronic health records (EHRs), going beyond traditional chatbot capabilities.
  • Comprehensive and Clinically Relevant – Features 300 physician-written tasks across 10 medical categories, covering essential clinical workflows.
  • Realistic Virtual Environment – Built on a FHIR-compliant interactive setup with 100 de-identified realistic patient profiles and 700,000+ medical data points, enabling practical and scalable AI testing.
  • Supports AI Development and Deployment – Provides an unsaturated benchmark to drive AI innovation in healthcare, helping developers improve AI agents for integration into clinical practice.

How is the MedAgentBench dataset structured?

Benchmark examples are based on real patient cases extracted from a deidentified clinical data warehouse curated by the STARR (STAnford Research Repository) project STARR. The timestamps in the data warehouse are jittered at the patient level. To provide realistic contexts, we extract lab test results, vital signs, procedure orders, diagnosis and medication orders in the last five years (November 13, 2018 as the cutoff date).

  • (1) Patient profiles, cohort, and demographics
  • (2) Lab test results
  • (3) Vital signs
  • (4) Procedure orders
  • (5) Diagnosis
  • (6) Medication orders

Table 1. Characteristics of patient cohort

Name Value
Unique individuals 100
Age (avg. ± SD) 58.15 ± 19.82
% Female 47%
Number of records (total) 785,207
Number of Observation records 563,426
Number of Procedure records 124,969
Number of Condition records 74,821
Number of MedicationRequest records 21,991

How well do AI agents perform on MedAgentBench?

Our dataset can be directly downloaded from Our Github Repository. Please refer to Our Github Instructions for how to read and use the data.

Table 2. Success rate (SR) of state-of-the-art LLMs on MedAgentBench

Model Size Form Overall SR Query SR Action SR
Claude 3.5 Sonnet v2 N/A API 69.67% 85.33% 54.00%
GPT-4o N/A API 64.00% 72.00% 56.00%
DeepSeek-V3 685B Open 62.67% 70.67% 54.67%
Gemini-1.5 Pro N/A API 62.00% 52.67% 71.33%
GPT-4o-mini N/A API 56.33% 59.33% 53.33%
o3-mini N/A API 51.67% 54.67% 48.67%
Qwen2.5 72B Open 51.33% 38.67% 64.00%
Llama 3.3 70B Open 46.33% 50.00% 42.67%
Gemini 2.0 Flash N/A API 38.33% 34.00% 42.67%
Gemma2 27B Open 19.33% 38.67% 0.00%
Gemini 2.0 Pro N/A API 18.00% 25.33% 10.67%
Mistral v0.3 7B Open 4.00% 8.00% 0.00%

To learn more, read our publication

If you have questions about our work, contact us at: jiang6@cs.stanford.edu and kb633@stanford.edu and gll2027@stanford.edu