RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Shen, Chengzhi; Shen, Weixiang; Susetzky, Tobias; Chen, Chen (Cherise); Li, Jun; Liu, Yuyuan; Zhang, Xuepeng; Gong, Zhenyu; Rueckert, Daniel; Pan, Jiazhen

RealICU: Do LLM Agents Understand Long-Context ICU Data?
A Benchmark Beyond Behavior Imitation

Chengzhi Shen^1,2,10, Weixiang Shen^1,2,3, Tobias Susetzky^1,2, Chen (Cherise) Chen⁴, Jun Li¹, Yuyuan Liu⁵, Xuepeng Zhang⁶, Zhenyu Gong^7,†, Daniel Rueckert^{1,2,8,9,10,†}, Jiazhen Pan^1,2,9,†

¹Technical University of Munich (TUM)   ²TUM University Hospital   ³LMU Munich
⁴University of Sheffield   ⁵University of Oxford   ⁶Zhongshan Hospital Fudan University
⁷Sun Yat-sen University Cancer Center   ⁸Imperial College London
⁹Munich Center for Machine Learning (MCML)   ¹⁰relAI – Konrad Zuse School of Excellence in Reliable AI

^†Corresponding Authors

Paper Code Dataset (Coming Soon)

ICU AI co-pilot decision support overview

ICU decisions are made under massive data volume and time pressure. An ICU AI co-pilot integrates evolving data streams into a decision-support panel that assesses Patient Status, identifies Acute Problems, proposes Recommended Actions, and warns against unsafe Red Flag actions.

Key Contributions

1

We define the core capabilities of a useful ICU AI co-pilot with over 30 clinicians: assessing Patient Status, Acute Problems, Recommended Actions, and Red Flag Actions.

2

We propose a systematic benchmark and evaluation framework. RealICU-Gold: 930 physician-consensus windows. RealICU-Scale: 11,862 windows labeled by Oracle, a physician-validated LLM evaluator. All annotations are created using hindsight review of the full patient trajectory, reflecting clinical correctness beyond than behavior imitation.

3

We introduce a solution to study memory-augmented agents. ICU-Evo, a structured-memory agent that organizes clinical context into heterogeneous memory types aligned with how clinicians reason.

4

We identify key failure modes. A recall–safety tradeoff and anchoring bias across frontier LLMs. Structured memory helps long-horizon reasoning but is not sufficient for safe ICU decision support.

Abstract

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems.

We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory using hindsight. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler.

Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall–safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care.

Four Physician-Motivated Tasks

In consultations with over 30 board-certified clinicians, RealICU evaluates four core capabilities required for a clinically useful ICU co-pilot.

🩺

Patient Status

Assess the current overall status of the patient based on all available clinical data.

🔍

Acute Problems

Identify acute clinical problems requiring continuous attention and monitoring in future bedside care.

💊

Recommended Actions

Propose short-term treatment recommendations appropriate for the patient's condition.

⚠️

Red Flag Actions

Identify actions that should be avoided for this specific patient that might cause unsafe outcomes.

The RealICU Benchmark

Based on MIMIC-IV, labels are produced by physicians reviewing the complete patient trajectory with hindsight.

Beyond Behavior Imitation: Hindsight Reviewing

Traditional ICU benchmarks treat recorded clinician actions as ground truth, but these actions reflect decisions made under uncertainty, not necessarily optimal care. RealICU's evaluation is grounded in hindsight review of the full patient trajectory, scoring LLM agents based on clinical correctness rather than behavioral imitation.

RealICU-Gold

930

Annotated Windows

94

ICU Patients

Physician-consensus annotations from five senior ICU physicians reviewing full patient trajectories in hindsight. The gold standard for reliable model evaluation.

RealICU-Scale

11,862

Annotated Windows

13×

Scale over Gold

Large-scale extension using Oracle, a physician-validated LLM-based hindsight evaluator calibrated against expert consensus.

Summary statistics of the RealICU patient cohort derived from MIMIC-IV, including age, gender, ICU duration, disease categories, and outcome distributions.

ICU-Evo: Agent Pipeline

We introduce ICU-Evo as an instance of the memory-augmented agent frameworks to study how structured memory design shapes clinical decision-making.

⚙️

Observation Agent

Rule-based

Normalize measurements, align them to the 30-min window, and extract vital trend signals.

→

⚙️

Assessment Agent

LLM

Compress trajectory summary and find critical clinical events.

→

⚙️

Insight Agent

LLM

Proposes patient-specific hypotheses.

→

⚙️

Predictor

LLM

Read the full memory state, and predict downstream tasks.

→

Shared Memory

📝 Working 📈 Trend ⚡ Critical Events 📖 Trajectory 🧩 Insight

Benchmark Results and Failure Modes

Benchmark Results

RealICU-Gold Evaluation Results — Physician-Annotated Gold Standard (930 windows)

Backbone	System	Patient Status		Acute Problems		Action Recom.		Red Flags
Backbone	System	Acc ↑	F1 ↑	Hit@5 ↑	R@5 ↑	Hit@5 ↑	R@5 ↑	HRR@5 ↓
Gemini-3.1-pro	Full-context	0.298	0.258	0.486	0.308	0.259	0.152	0.137
	Local-window	0.315	0.239	0.459	0.258	0.395	0.260	0.151
	RAG	0.402	0.348	0.596	0.342	0.496	0.313	0.216
	ICU-Evo	0.459	0.365	0.823	0.526	0.676	0.534	0.300
GPT-5.4	Full-context	0.294	0.233	0.510	0.348	0.404	0.300	0.298
	Local-window	0.233	0.184	0.500	0.293	0.380	0.281	0.165
	RAG	0.288	0.256	0.599	0.349	0.480	0.398	0.234
	ICU-Evo	0.312	0.264	0.867	0.570	0.676	0.534	0.473
Qwen3-235B	Full-context	0.225	0.188	0.384	0.226	0.329	0.222	0.117
	Local-window	0.152	0.154	0.213	0.126	0.352	0.242	0.080
	RAG	0.315	0.271	0.379	0.211	0.453	0.324	0.095
	ICU-Evo	0.253	0.197	0.600	0.362	0.526	0.357	0.117

bold = best · underline = second best (per column, per backbone) · HRR@5 ↓ lower is better

RealICU-Scale Evaluation Results — Oracle-Labeled Scale Set (11,862 windows)

Backbone	System	Patient Status		Acute Problems		Action Recom.		Red Flags
Backbone	System	Acc ↑	F1 ↑	Hit@5 ↑	R@5 ↑	Hit@5 ↑	R@5 ↑	HRR ↓
Gemini-3.1-pro	Full-context	—	—	—	—	—	—	—
	Local-window	0.405	0.264	0.487	0.265	0.447	0.307	0.066
	RAG	0.442	0.312	0.568	0.315	0.466	0.331	0.073
	ICU-Evo	0.519	0.348	0.827	0.518	0.514	0.330	0.087
GPT-5.4	Full-context	—	—	—	—	—	—	—
	Local-window	0.415	0.265	0.475	0.266	0.451	0.308	0.073
	RAG	0.411	0.269	0.584	0.321	0.509	0.435	0.096
	ICU-Evo	0.438	0.327	0.852	0.562	0.575	0.368	0.090
Qwen3-235B	Full-context	0.201	0.116	0.401	0.232	0.455	0.299	0.215
	Local-window	0.175	0.159	0.254	0.142	0.440	0.295	0.207
	RAG	0.367	0.282	0.379	0.207	0.446	0.342	0.225
	ICU-Evo	0.304	0.177	0.649	0.375	0.515	0.327	0.292

bold = best · underline = second best (per column, per backbone) · — = not evaluated · HRR ↓ lower is better

Performance over ICU stay hoursGemini-3.1-pro

Performance over ICU stay hoursGPT-5.4

Performance over ICU stay hoursQwen3-235B

1 / 3

Failure Modes

RealICU reveals critical limitations in current frontier LLMs that must be addressed before deployment in clinical settings. These findings underscore that RealICU remains largely unsolved.

⚠️ Recall–Safety Tradeoff

Higher recommendation recall can increase unsafe recommendations by up to 47.3%. Agents that suggest more actions to improve coverage simultaneously propose more potentially harmful interventions, creating a fundamental tension between completeness and safety in clinical recommendation systems.

⚠️ Anchoring Bias

Agents preserve early interpretations of a patient despite later contradictory evidence. As patient trajectories evolve over hours in the ICU, LLMs fail to adequately revise their clinical assessments in response to new laboratory results, vital signs, or medication changes, resulting in stale recommendations that do not reflect the current patient state.

BibTeX

TODO