RealICU: Do LLM Agents Understand Long-Context ICU Data?
A Benchmark Beyond Behavior Imitation

Chengzhi Shen1,2,10, Weixiang Shen1,2,3, Tobias Susetzky1,2, Chen (Cherise) Chen4, Jun Li1, Yuyuan Liu5, Xuepeng Zhang6, Zhenyu Gong7,†, Daniel Rueckert1,2,8,9,10,†, Jiazhen Pan1,2,9,†
1Technical University of Munich (TUM)   2TUM University Hospital   3LMU Munich
4University of Sheffield   5University of Oxford   6Zhongshan Hospital Fudan University
7Sun Yat-sen University Cancer Center   8Imperial College London
9Munich Center for Machine Learning (MCML)   10relAI – Konrad Zuse School of Excellence in Reliable AI
Corresponding Authors
ICU AI co-pilot decision support overview

ICU decisions are made under massive data volume and time pressure. An ICU AI co-pilot integrates evolving data streams into a decision-support panel that assesses Patient Status, identifies Acute Problems, proposes Recommended Actions, and warns against unsafe Red Flag actions.

Key Contributions

1

We define the core capabilities of a useful ICU AI co-pilot with over 30 clinicians: assessing Patient Status, Acute Problems, Recommended Actions, and Red Flag Actions.

2

We propose a systematic benchmark and evaluation framework. RealICU-Gold: 930 physician-consensus windows. RealICU-Scale: 11,862 windows labeled by Oracle, a physician-validated LLM evaluator. All annotations are created using hindsight review of the full patient trajectory, reflecting clinical correctness beyond than behavior imitation.

3

We introduce a solution to study memory-augmented agents. ICU-Evo, a structured-memory agent that organizes clinical context into heterogeneous memory types aligned with how clinicians reason.

4

We identify key failure modes. A recall–safety tradeoff and anchoring bias across frontier LLMs. Structured memory helps long-horizon reasoning but is not sufficient for safe ICU decision support.

Abstract

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems.

We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory using hindsight. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler.

Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall–safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care.

Four Physician-Motivated Tasks

In consultations with over 30 board-certified clinicians, RealICU evaluates four core capabilities required for a clinically useful ICU co-pilot.

🩺

Patient Status

Assess the current overall status of the patient based on all available clinical data.

🔍

Acute Problems

Identify acute clinical problems requiring continuous attention and monitoring in future bedside care.

💊

Recommended Actions

Propose short-term treatment recommendations appropriate for the patient's condition.

⚠️

Red Flag Actions

Identify actions that should be avoided for this specific patient that might cause unsafe outcomes.

The RealICU Benchmark

Based on MIMIC-IV, labels are produced by physicians reviewing the complete patient trajectory with hindsight.

Beyond Behavior Imitation: Hindsight Reviewing

Traditional ICU benchmarks treat recorded clinician actions as ground truth, but these actions reflect decisions made under uncertainty, not necessarily optimal care. RealICU's evaluation is grounded in hindsight review of the full patient trajectory, scoring LLM agents based on clinical correctness rather than behavioral imitation.

RealICU-Gold

930
Annotated Windows
94
ICU Patients

Physician-consensus annotations from five senior ICU physicians reviewing full patient trajectories in hindsight. The gold standard for reliable model evaluation.

RealICU-Scale

11,862
Annotated Windows
13×
Scale over Gold

Large-scale extension using Oracle, a physician-validated LLM-based hindsight evaluator calibrated against expert consensus.

RealICU patient cohort statistics

Summary statistics of the RealICU patient cohort derived from MIMIC-IV, including age, gender, ICU duration, disease categories, and outcome distributions.

ICU-Evo: Agent Pipeline

We introduce ICU-Evo as an instance of the memory-augmented agent frameworks to study how structured memory design shapes clinical decision-making.

⚙️

Observation Agent

Rule-based

Normalize measurements, align them to the 30-min window, and extract vital trend signals.

⚙️

Assessment Agent

LLM

Compress trajectory summary and find critical clinical events.

⚙️

Insight Agent

LLM

Proposes patient-specific hypotheses.

⚙️

Predictor

LLM

Read the full memory state, and predict downstream tasks.

Shared Memory
📝 Working 📈 Trend ⚡ Critical Events 📖 Trajectory 🧩 Insight

Benchmark Results and Failure Modes

Benchmark Results

RealICU-Gold Evaluation Results — Physician-Annotated Gold Standard (930 windows)
Backbone System Patient Status Acute Problems Action Recom. Red Flags
Acc ↑ F1 ↑ Hit@5 ↑ R@5 ↑ Hit@5 ↑ R@5 ↑ HRR@5 ↓
Gemini-3.1-pro Full-context 0.2980.258 0.4860.308 0.2590.152 0.137
Local-window 0.3150.239 0.4590.258 0.3950.260 0.151
RAG 0.4020.348 0.5960.342 0.4960.313 0.216
ICU-Evo 0.4590.365 0.8230.526 0.6760.534 0.300
GPT-5.4 Full-context 0.2940.233 0.5100.348 0.4040.300 0.298
Local-window 0.2330.184 0.5000.293 0.3800.281 0.165
RAG 0.2880.256 0.5990.349 0.4800.398 0.234
ICU-Evo 0.3120.264 0.8670.570 0.6760.534 0.473
Qwen3-235B Full-context 0.2250.188 0.3840.226 0.3290.222 0.117
Local-window 0.1520.154 0.2130.126 0.3520.242 0.080
RAG 0.3150.271 0.3790.211 0.4530.324 0.095
ICU-Evo 0.2530.197 0.6000.362 0.5260.357 0.117

bold = best  ·  underline = second best (per column, per backbone)  ·  HRR@5 ↓ lower is better

RealICU-Scale Evaluation Results — Oracle-Labeled Scale Set (11,862 windows)
Backbone System Patient Status Acute Problems Action Recom. Red Flags
Acc ↑ F1 ↑ Hit@5 ↑ R@5 ↑ Hit@5 ↑ R@5 ↑ HRR ↓
Gemini-3.1-pro Full-context
Local-window 0.4050.264 0.4870.265 0.4470.307 0.066
RAG 0.4420.312 0.5680.315 0.4660.331 0.073
ICU-Evo 0.5190.348 0.8270.518 0.5140.330 0.087
GPT-5.4 Full-context
Local-window 0.4150.265 0.4750.266 0.4510.308 0.073
RAG 0.4110.269 0.5840.321 0.5090.435 0.096
ICU-Evo 0.4380.327 0.8520.562 0.5750.368 0.090
Qwen3-235B Full-context 0.2010.116 0.4010.232 0.4550.299 0.215
Local-window 0.1750.159 0.2540.142 0.4400.295 0.207
RAG 0.3670.282 0.3790.207 0.4460.342 0.225
ICU-Evo 0.3040.177 0.6490.375 0.5150.327 0.292

bold = best  ·  underline = second best (per column, per backbone)  ·  — = not evaluated  ·  HRR ↓ lower is better

Performance over time — Gemini-3.1-pro

Performance over ICU stay hoursGemini-3.1-pro

Performance over time — GPT-5.4

Performance over ICU stay hoursGPT-5.4

Performance over time — Qwen3-235B

Performance over ICU stay hoursQwen3-235B

1 / 3

Failure Modes

RealICU reveals critical limitations in current frontier LLMs that must be addressed before deployment in clinical settings. These findings underscore that RealICU remains largely unsolved.

⚠️ Recall–Safety Tradeoff

Higher recommendation recall can increase unsafe recommendations by up to 47.3%. Agents that suggest more actions to improve coverage simultaneously propose more potentially harmful interventions, creating a fundamental tension between completeness and safety in clinical recommendation systems.

⚠️ Anchoring Bias

Agents preserve early interpretations of a patient despite later contradictory evidence. As patient trajectories evolve over hours in the ICU, LLMs fail to adequately revise their clinical assessments in response to new laboratory results, vital signs, or medication changes, resulting in stale recommendations that do not reflect the current patient state.

BibTeX

TODO