Yixing Jiang∗, Kameron C. Black∗, Gloria Geng, Danny Park, James Zou, Andrew Y. Ng, Jonathan H. Chen
MedAgentBench is a comprehensive evaluation suite designed to benchmark the agent capabilities of large language models (LLMs) in medical records settings. Unlike traditional medical AI benchmarks that focus on question-answering, MedAgentBench challenges AI agents with 300 clinically relevant tasks which require interactions with a FHIR-compliant environment. By offering a structured yet unsaturated benchmark, MedAgentBench enables researchers and developers to track progress and optimize AI-driven medical agents for enhanced clinical integration.
Benchmark examples are based on real patient cases extracted from a deidentified clinical data warehouse curated by the STARR (STAnford Research Repository) project STARR. The timestamps in the data warehouse are jittered at the patient level. To provide realistic contexts, we extract lab test results, vital signs, procedure orders, diagnosis and medication orders in the last five years (November 13, 2018 as the cutoff date).
Name | Value |
---|---|
Unique individuals | 100 |
Age (avg. ± SD) | 58.15 ± 19.82 |
% Female | 47% |
Number of records (total) | 785,207 |
Number of Observation records | 563,426 |
Number of Procedure records | 124,969 |
Number of Condition records | 74,821 |
Number of MedicationRequest records | 21,991 |
Our dataset can be directly downloaded from Our Github Repository. Please refer to Our Github Instructions for how to read and use the data.
Model | Size | Form | Overall SR | Query SR | Action SR |
---|---|---|---|---|---|
Claude 3.5 Sonnet v2 | N/A | API | 69.67% | 85.33% | 54.00% |
GPT-4o | N/A | API | 64.00% | 72.00% | 56.00% |
DeepSeek-V3 | 685B | Open | 62.67% | 70.67% | 54.67% |
Gemini-1.5 Pro | N/A | API | 62.00% | 52.67% | 71.33% |
GPT-4o-mini | N/A | API | 56.33% | 59.33% | 53.33% |
o3-mini | N/A | API | 51.67% | 54.67% | 48.67% |
Qwen2.5 | 72B | Open | 51.33% | 38.67% | 64.00% |
Llama 3.3 | 70B | Open | 46.33% | 50.00% | 42.67% |
Gemini 2.0 Flash | N/A | API | 38.33% | 34.00% | 42.67% |
Gemma2 | 27B | Open | 19.33% | 38.67% | 0.00% |
Gemini 2.0 Pro | N/A | API | 18.00% | 25.33% | 10.67% |
Mistral v0.3 | 7B | Open | 4.00% | 8.00% | 0.00% |
If you have questions about our work, contact us at: jiang6@cs.stanford.edu and kb633@stanford.edu and gll2027@stanford.edu