What is MedAgentBench?

MedAgentBench is a comprehensive evaluation suite designed to benchmark the agent capabilities of large language models (LLMs) in medical records settings. Unlike traditional medical AI benchmarks that focus on question-answering, MedAgentBench challenges AI agents with 300 clinically relevant tasks which require interactions with a FHIR-compliant environment. By offering a structured yet unsaturated benchmark, MedAgentBench enables researchers and developers to track progress and optimize AI-driven medical agents for enhanced clinical integration.

Read The Paper

Download the MedAgentBench Dataset

Why MedAgentBench?

Bridges the Gap in Medical AI Benchmarking – Existing benchmarks focus on static question-answering, but MedAgentBench evaluates AI agents in interactive clinical scenarios.
Agent-Centric Evaluation – Assesses decision-making, planning, and execution of tasks within electronic health records (EHRs), going beyond traditional chatbot capabilities.
Comprehensive and Clinically Relevant – Features 300 physician-written tasks across 10 medical categories, covering essential clinical workflows.
Realistic Virtual Environment – Built on a FHIR-compliant interactive setup with 100 de-identified realistic patient profiles and 700,000+ medical data points, enabling practical and scalable AI testing.
Supports AI Development and Deployment – Provides an unsaturated benchmark to drive AI innovation in healthcare, helping developers improve AI agents for integration into clinical practice.

How is the MedAgentBench dataset structured?

Benchmark examples are based on real patient cases extracted from a deidentified clinical data warehouse curated by the STARR (STAnford Research Repository) project STARR. The timestamps in the data warehouse are jittered at the patient level. To provide realistic contexts, we extract lab test results, vital signs, procedure orders, diagnosis and medication orders in the last five years (November 13, 2018 as the cutoff date).

(1) Patient profiles, cohort, and demographics
(2) Lab test results
(3) Vital signs
(4) Procedure orders
(5) Diagnosis
(6) Medication orders

Table 1. Characteristics of patient cohort

Name	Value
Unique individuals	100
Age (avg. ± SD)	58.15 ± 19.82
% Female	47%
Number of records (total)	785,207
Number of Observation records	563,426
Number of Procedure records	124,969
Number of Condition records	74,821
Number of MedicationRequest records	21,991

How well do AI agents perform on MedAgentBench?

Our dataset can be directly downloaded from Our Github Repository. Please refer to Our Github Instructions for how to read and use the data.

Table 2. Success rate (SR) of state-of-the-art LLMs on MedAgentBench

Model	Size	Form	Overall SR	Query SR	Action SR
Claude 3.5 Sonnet v2	N/A	API	69.67%	85.33%	54.00%
GPT-4o	N/A	API	64.00%	72.00%	56.00%
DeepSeek-V3	685B	Open	62.67%	70.67%	54.67%
Gemini-1.5 Pro	N/A	API	62.00%	52.67%	71.33%
GPT-4o-mini	N/A	API	56.33%	59.33%	53.33%
o3-mini	N/A	API	51.67%	54.67%	48.67%
Qwen2.5	72B	Open	51.33%	38.67%	64.00%
Llama 3.3	70B	Open	46.33%	50.00%	42.67%
Gemini 2.0 Flash	N/A	API	38.33%	34.00%	42.67%
Gemma2	27B	Open	19.33%	38.67%	0.00%
Gemini 2.0 Pro	N/A	API	18.00%	25.33%	10.67%
Mistral v0.3	7B	Open	4.00%	8.00%	0.00%

To learn more, read our publication

If you have questions about our work, contact us at: jiang6@cs.stanford.edu and kb633@stanford.edu and gll2027@stanford.edu