【Nature medicine】Integrated image-based deep learning and language models for primary diabetes care

阅读量：

Integrated image-based deep learning and language models for primary diabetes care

Abstract
Introduction
- Fig. 1 | Architecture of the DeepDR-LLM system. The DeepDR-LLMsystem consists of two modules
Results
- Study design and participants
- Fig. 2 | Study design overview for the DeepDR-LLM system evaluation.
- Performance of the LLM module (experiment 2a)
- Fig. 3 | Head-to-head comparison between DeepDR-LLM, nontuned LLaMA, PCP and endocrinology residents in both English and Chinese.
- Multiethnic validation of DeepDR-Transformer (experiment 2b)
- DeepDR-Transformer as an assistive tool (experiment 2c)
- Fig. 4 | Receiver operating characteristic curves showing performance of DeepDR-Transformer alone versus PCPs (when unassisted and assisted by DeepDR-Transformer) in identifying referable DR.
- Multiethnic validation of DeepDR-Transformer (experiment 2b)
- DeepDR-Transformer as an assistive tool (experiment 2c)
- Prospective real-world study of DeepDR-LLM (experiment 2d)
- Fig. 6 | Envisioning the future of primary diabetes care with the clinical system processes the accumulated clinical data to concurrently deliver DR integration of the DeepDR-LLM system.
Discussion
Methods Ethical approval
- Data acquisition and diagnosis criteria
- The architecture of the DeepDR-LLM system
- DeepDR-Transformer fine-tuning for the classification and segmentation from standard fundus images
- Transfer learning from standard to portable fundus images
- Evaluation of the LLM module in a retrospective dataset
- Evaluation of the performance of the DeepDR-Transformer on retrospective datasets
- Evaluation of DeepDR-Transformer as an assistive tool in identifying referable DR
- Real-world prospective study
Statistical analysis
Reporting summary
Code availability
Additional information

Abstract

Primary diabetes care and diabetic retinopathy (DR) screening persist as major public health challenges due to a shortage of trained primary care physicians (PCPs), particularly in low-resource settings. Here, to bridge the gaps, we developed an integrated image–language system (DeepDR-LLM), combining a large language model (LLM module) and image-based deep learning (DeepDR-Transformer), to provide individualized diabetes management recommendations to PCPs. In a retrospective evaluation, the LLM module demonstrated comparable performance to PCPs and endocrinology residents when tested in English and outperformed PCPs and had comparable performance to endocrinology residents in Chinese. For identifying referable DR, the average PCP’s accuracy was $81.0\%$ unassisted and $92.3\%$ assisted by DeepDR-Transformer. Furthermore, we performed a single-center real-world prospective study, deploying DeepDR-LLM. We compared diabetes management adherence of patients under the unassisted PCP arm ( $(n\!=\!397)$ ) with those under the $\mathsf{P C P}\!+\!i$ DeepDR-LLM arm $(n\!=\!372)$ ). Patients with newly diagnosed diabetes in the $\mathsf{P C P+}$ DeepDR-LLM arm showed better self-management behaviors throughout follow-up $(P\!<\!0.05)$ . For patients with referral DR, those in the PCP+DeepDR-LLM arm were more likely to adhere to DR referrals $(P<0.01)$ . Additionally, DeepDR-LLM deployment improved the quality and empathy level of management recommendations. Given its multifaceted performance, DeepDR-LLM holds promise as a digital solution for enhancing primary diabetes care and DR screening.

由于训练有素的初级保健医生（PCP）短缺，尤其是在资源匮乏的环境中，初级糖尿病护理和糖尿病视网膜病变（DR）筛查仍然是主要的公共卫生挑战。在此，为了弥合这一差距，我们开发了一个集成图像和语言的系统（DeepDR-LLM），结合了大型语言模型（LLM模块）和基于图像的深度学习（DeepDR-Transformer），为PCP提供个性化的糖尿病管理建议。在回顾性评估中，LLM模块在英语测试中表现出与PCP和内分泌科住院医师相当的表现，并且在中文测试中表现优于PCP，且与内分泌科住院医师表现相当。在识别可转诊的DR方面，平均PCP的准确率为81.0%，而在DeepDR-Transformer辅助下为92.3%。此外，我们进行了一项单中心的真实世界前瞻性研究，部署了DeepDR-LLM。我们比较了未辅助的PCP组（n=397）与PCP+i DeepDR-LLM组（n=372）的患者糖尿病管理依从性。在PCP+ DeepDR-LLM组中，新诊断糖尿病患者在随访过程中表现出更好的自我管理行为（P<0.05）。对于转诊DR患者，PCP+DeepDR-LLM组的患者更有可能遵守DR转诊（P<0.01）。此外，DeepDR-LLM的部署提高了管理建议的质量和同理心水平。鉴于其多方面的性能，DeepDR-LLM有望成为增强初级糖尿病护理和DR筛查的数字解决方案。

Introduction

It has been estimated that more than 500 million people had diabetes worldwide in 2021, with $80\%$ living in low- and middle-income coun- tries $({\mathsf{L}}{\mathsf{M I C s}})^{1,2}$ . The escalating prevalence imposes a substantial public health challenge, particularly in these low-resource settings 1 , – 5 . In LMICs, insufficient healthcare resource and a lack of trained primary care physicians (PCPs) remain principal barriers, resulting in wide- spread underdiagnosis, poor primary diabetes management and inad- equate and/or inappropriate referrals to diabetes specialist care 4 , , . This not only impacts on individual health outcomes but also has broader socioeconomic consequences 4 , – 10 .

据估计，2021年全球有超过5亿人患有糖尿病，其中80%生活在低收入和中等收入国家（LMICs）。这种日益增加的患病率带来了巨大的公共卫生挑战，尤其是在这些资源匮乏的环境中。在LMICs中，医疗资源不足以及缺乏训练有素的初级保健医生（PCPs）仍然是主要障碍，导致糖尿病普遍未被诊断，初级糖尿病管理不善，转诊至糖尿病专家护理的不足和/或不当。这不仅影响了个人健康结果，还带来了更广泛的社会经济后果。

Diabetic retinopathy (DR) is the most common specific compli- cation of diabetes, affecting $30–40\%$ of individuals with diabetes 11 – 13 and remains the leading cause of blindness in economically active, working-aged adults 11 , , . The presence of DR also signifies a height- ened risk of other complications elsewhere (for example, kidney, heart and brain) 16 . Thus, regular DR screening has been universally recom- mended as a key part of primary diabetes care 17 . However, DR screen- ing is often neglected in low-resource settings in LMICs owing to a scarcity of infrastructure, manpower and sustainable cost-effective DR screening programs.

糖尿病视网膜病变（DR）是糖尿病最常见的特定并发症，影响30-40%的糖尿病患者，并且仍然是经济活跃的工作年龄成人失明的主要原因。DR的存在也意味着其他并发症（例如肾脏、心脏和大脑）的风险增加。因此，定期进行DR筛查已被普遍推荐为初级糖尿病护理的关键部分。然而，由于基础设施、人力和可持续的低成本DR筛查计划的稀缺，DR筛查在LMICs中的低资源环境中常被忽视。

Several digital technologies have emerged to address gaps in diabetes care and DR screening, including telemedicine 18 – 20 , artificial intelligence (AI)-assisted glucose monitoring and prediction 21 , retinal image-based deep learning (DL) models 22 – 24 and the development of low-cost and portable retinal cameras 25 , . However, these solutions often focus either on enhancing diabetes management or on providing DR screening but rarely integrate both important aspects for diabetes care. These current solutions also require sufficiently trained PCPs capable of utilizing these digital tools, understanding diabetes care, and referral guidelines for severe DR cases that require specialists interventions, but there are few trained PCPs in low-resource settings 27 .

为了解决糖尿病护理和DR筛查中的差距，几种数字技术应运而生，包括远程医疗、人工智能（AI）辅助的血糖监测和预测、基于视网膜图像的深度学习（DL）模型以及低成本便携式视网膜相机的发展。然而，这些解决方案通常只关注改善糖尿病管理或提供DR筛查，但很少将糖尿病护理的这两个重要方面整合在一起。这些当前的解决方案还需要具备使用这些数字工具的充分训练的PCPs，理解糖尿病护理并为需要专家干预的严重DR病例提供转诊指南，但在资源匮乏的环境中训练有素的PCPs很少。

Recently, large language models (LLMs) 28 – 31 , achieving natural language understanding and generation, have been developing rapidly and show promise in enhancing healthcare service delivery. LLMs have the potential to optimize patient monitoring, personalization of treat- ment plans, and patient education, potentially resulting in improved outcomes for patients with diabetes 32 – 34 and retinal diseases 35 , . However, while they perform well in answering some general medical queries 31 , , current LLMs fall short in providing reliable and detailed management recommendations for major specific diseases 31 , , , such as diabetes.

最近，具备自然语言理解和生成能力的大型语言模型（LLMs）迅速发展，并在增强医疗服务提供方面显示出前景。LLMs有潜力优化患者监测、个性化治疗计划以及患者教育，可能会改善糖尿病患者和视网膜疾病患者的结果。然而，虽然它们在回答一些常规医学问题上表现良好，但目前的LLMs在为糖尿病等主要特定疾病提供可靠且详细的管理建议方面仍然不足。

To address these interrelated gaps in diabetes care, we developed an innovative image–language system—DeepDR-LLM—which integrates an LLM module with an image-based DL module to offer a comprehen- sive approach for primary diabetes care and DR screening. Our system is tailored for PCPs, particularly those working in high-volume and low-resource settings. The DeepDR-LLM system comprises two core components: an LLM module and an image-based DL module, referred to as DeepDR-Transformer (Fig. 1 ). Our evaluation of DeepDR-LLM’s performance relied on four experiments outlined in Fig. 2a–d . First, we developed the LLM module by fine-tuning LLaMA 38 , an open-source LLM that used 371,763 real-world management recommendations from 267,730 participants. We then performed a head-to-head comparative analysis, where we examined the system’s LLM module’s proficiency in providing evidence-based diabetes management recommendations against that of LLaMA, PCPs and in-training specialists (endocrinology residents), with assessments conducted in both English and Chinese languages (Fig. 2a ). Second, we trained and tested the performance of DeepDR-Transformer for referable DR detection, using multiethnic, multicountry datasets comprising 1,085,295 standard (table-top) and 161,840 portable (mobile) retinal images (Fig. 2b ). Third, we evalu- ated the impact of DeepDR-Transformer in assisting PCPs and profes- sional graders to identify referable DR (Fig. 2c ). Finally, we conducted a two-arm, real-world prospective study to determine the impact of DeepDR-LLM system when integrated into clinical workflow in the pri- mary care setting. Over a 4-week period, we monitored and compared the adherence to diabetes management recommendations between patients under the care of unassisted PCPs and those under the care of PCPs assisted by DeepDR-LLM (Fig. 2d ). Collectively, our work offers a digital solution for primary diabetes care combining DR screening and referral, particularly useful in high-volume, low-resource settings in LMICs.

为了解决糖尿病护理中的这些相互关联的差距，我们开发了一种创新的图像-语言系统——DeepDR-LLM，该系统将LLM模块与基于图像的DL模块结合起来，为初级糖尿病护理和DR筛查提供全面的方法。我们的系统专为PCPs设计，特别是那些在高工作量和低资源环境中工作的PCPs。DeepDR-LLM系统包含两个核心组件：LLM模块和基于图像的DL模块，称为DeepDR-Transformer。我们对DeepDR-LLM性能的评估依赖于四项实验。首先，我们通过微调LLaMA（一种使用来自267,730名参与者的371,763条真实世界管理建议的开源LLM）开发了LLM模块。然后，我们进行了正面对比分析，比较了系统的LLM模块提供基于证据的糖尿病管理建议的能力与LLaMA、PCPs和在职培训的专家（内分泌科住院医师）的表现，评估分别在英语和中文环境中进行。其次，我们训练并测试了DeepDR-Transformer在可转诊DR检测中的表现，使用了由1,085,295张标准（台式）和161,840张便携式（移动）视网膜图像组成的多种族、多国家数据集。第三，我们评估了DeepDR-Transformer在帮助PCPs和专业分级员识别可转诊DR方面的影响。最后，我们进行了一项两组的真实世界前瞻性研究，旨在确定将DeepDR-LLM系统集成到初级护理环境中的临床工作流程中的影响。在为期4周的时间里，我们监测并比较了未辅助PCPs和使用DeepDR-LLM辅助的PCPs所照顾患者的糖尿病管理建议依从性。总体而言，我们的工作提供了一种结合DR筛查和转诊的数字解决方案，特别适用于LMICs中高工作量、低资源环境中的初级糖尿病护理。

Fig. 1 | Architecture of the DeepDR-LLM system. The DeepDR-LLMsystem consists of two modules

在这里插入图片描述
Fig. 1 | Architecture of the DeepDR-LLM system.
The DeepDR-LLMsystem consists of two modules: (1) module I (LLM module), which providesindividualized management recommendations for patients with diabetes;(2) module II (DeepDR-Transformer module), which performs image qualityassessment, DR lesion segmentation and DR/DME grading from standard orportable fundus images. There are two modes of integrating module I andmodule II in the DeepDR-LLM system. In the physician-involved integrationmode, the outputs of module II (that is, fundus image gradability; the lesionsegmentation of microaneurysm, cotton-wool spot, hard exudate and hemorrhage; DR grade; and DME grade) could assist physicians in generating DR/DME diagnosis results (that is, fundus image gradability, DR grade, DME grade and the presence of lesions). In the automated integration mode, the DR/DME diagnosis results include fundus image gradability, DR grade, DME grade classified by module II, and the presence of lesions segmented out by module II. These DR/DME diagnosis results and other clinical metadata will be fed into module I to generate individualized management recommendations for people with diabetes.

图1 | DeepDR-LLM系统架构。
DeepDR-LLM系统由两个模块组成：（1）模块I（LLM模块），为糖尿病患者提供个性化的管理建议；（2）模块II（DeepDR-Transformer模块），对标准或便携式眼底图像进行图像质量评估、DR病变分割及DR/DME分级。在DeepDR-LLM系统中，模块I和模块II有两种集成模式。在医生参与的集成模式下，模块II的输出（即眼底图像可分级性；微动脉瘤、棉絮斑、硬性渗出和出血的病变分割；DR分级；以及DME分级）可以帮助医生生成DR/DME诊断结果（包括眼底图像可分级性、DR分级、DME分级以及病变的存在情况）。在自动化集成模式下，DR/DME诊断结果包括模块II分类的眼底图像可分级性、DR分级、DME分级，以及由模块II分割出的病变。这些DR/DME诊断结果和其他临床元数据将被输入到模块I中，以生成糖尿病患者的个性化管理建议。

Results

Study design and participants

The DeepDR-LLM system consists of two modules: (1) module I (the LLM module), which provides individualized management recommenda- tions for patients with diabetes; (2) module II (the DeepDR-Transformer
module), which performs image quality assessment, lesion segmenta- tion and DR grading from standard or portable fundus images for each patient. The outputs of module II (results of real-time DR screening) can also be used as inputs for the LLM module (module I). Extended Data Fig. 1 depicts a schematic overview of the DeepDR-LLM system.

DeepDR-LLM系统由两个模块组成：（1）模块I（LLM模块），为糖尿病患者提供个性化管理建议；（2）模块II（DeepDR-Transformer模块），对每位患者的标准或便携式眼底图像进行图像质量评估、病变分割和DR分级。模块II的输出（实时DR筛查结果）也可以作为模块I（LLM模块）的输入。扩展数据图1展示了DeepDR-LLM系统的示意图。

The LLM module was retrospectively evaluated in head-to-head comparisons against the nontuned LLaMA by PCPs and endocrinology residents, in both English and Chinese languages.

LLM模块通过与未调优的LLaMA、PCPs和内分泌科住院医生在中英文语言环境下的对比评估进行回顾性评估。

The DeepDR-Transformer module was developed and validated in 14 datasets across 5 countries (China, Singapore, India, Thailand and the UK) with standard fundus images, and 7 datasets across 3 countries (China, Algeria and Uzbekistan) with portable fundus images. The characteristics of the datasets are summarized in Supplementary Tables 1 and 2.

DeepDR-Transformer模块在5个国家（中国、新加坡、印度、泰国和英国）的14个标准眼底图像数据集和3个国家（中国、阿尔及利亚和乌兹别克斯坦）的7个便携式眼底图像数据集上开发并验证。数据集的特征总结在补充表1和表2中。

Fig. 2 | Study design overview for the DeepDR-LLM system evaluation.

Fig. 2 | Study design overview for the DeepDR-LLM system evaluation.
(a) Head-to-head comparative assessment of diabetes management recommendations generated by DeepDR-LLM, nontuned LLaMA, PCPs, and endocrinology residents, using 100 cases randomly selected from CNDCS.
(b) Efficacy analysis of the DeepDR-Transformer module on multiethnic datasets of standard and portable fundus images.
© Utility evaluation of the DeepDR-Transformer module as an assistive tool for PCPs and professional graders in the detection of referable DR.
(d) Study design of a two-arm, real-world, prospective study to evaluate the impact of DeepDR-LLM on patients’ self-management behavior. In the outcome analysis, for substudy I, 253 participants in the unassisted PCP arm and 234 participants in the PCP+DeepDR-LLM arm were included; for substudy II, 154 participants in the unassisted PCP arm and 144 participants in the PCP+DeepDR-LLM arm were included.

图2 | DeepDR-LLM系统评估的研究设计概述。
(a) 对比评估DeepDR-LLM、未调优的LLaMA、PCPs和内分泌科住院医生生成的糖尿病管理建议，使用从CNDCS随机选择的100例病例。
(b) 对DeepDR-Transformer模块在多种族数据集的标准和便携式眼底图像上的有效性进行分析。
© 评估DeepDR-Transformer模块作为辅助工具帮助PCPs和专业分级员检测可转诊DR的实用性。
(d) 设计一项两组的真实世界前瞻性研究，评估DeepDR-LLM对患者自我管理行为的影响。在结果分析中，子研究I中包含了未辅助PCP组的253名参与者和PCP+DeepDR-LLM组的234名参与者；子研究II中包含了未辅助PCP组的154名参与者和PCP+DeepDR-LLM组的144名参与者。

Performance of the LLM module (experiment 2a)

To evaluate the DeepDR-LLM system’s proficiency in providing dia- betes management recommendations in both English and Chinese languages, we compared DeepDR-LLM against LLaMA, PCPs and endocrinology residents on the basis of 100 cases randomly selected from China National Diabetic Complications Study (CNDCS) (Sup- plementary Table 3 and Extended Data Fig. 2). The recommendations were evaluated on the basis of three axes, namely the extent of inap- propriate content, extent of missing content and likelihood of possible harm (Supplementary Table 4).

为了评估DeepDR-LLM系统在提供中英文糖尿病管理建议方面的能力，我们将DeepDR-LLM与LLaMA、PCPs和内分泌科住院医生进行了比较，基于从中国国家糖尿病并发症研究（CNDCS）中随机选择的100例病例（补充表3和扩展数据图2）。管理建议根据三个方面进行评估：不适当内容的程度、缺失内容的程度和可能危害的可能性（补充表4）。

Figure 3a reports evaluations of diabetes management recommen- dations generated in four different ways (DeepDR-LLM, LLaMA, PCP and resident) summarized into three different domains (inappropri- ate content, missing content and likelihood of possible harm) in both English and Chinese languages. In English, $71\%$ of DeepDR-LLM recom- mendations were judged to have no inappropriate content, higher than LLaMA $(51\%)$ , but comparable to the PCP $(71\%)$ . In addition, $36\%$ of DeepDR-LLM recommendations were judged not to have missing content $(\mathsf{P C P}{:}27\%)$ . Lastly, $57\%$ of DeepDR-LLM recommendations were rated as ‘low likelihood’ for possible harm, comparable to $55\%$ in PCP. In Chinese, $77\%$ of DeepDR-LLM recommendations were judged to have no inappropriate content, higher than LLaMA $(66\%)$ and PCP $(54\%)$ Additionally, $63\%$ of DeepDR-LLM recommendations were judged not to have missing content, compared to $46\%$ in PCP. Eighty-eight percent of DeepDR-LLM recommendations were rated as ‘low likelihood’ for possible harm, compared to $60\%$ in PCP.

图3a 显示了四种不同方式生成的糖尿病管理建议的评估结果（DeepDR-LLM、LLaMA、PCP和住院医生），并总结为三个领域（不适当内容、缺失内容和可能的危害可能性），分别在英文和中文语言环境下进行评估。在英文中，71%的DeepDR-LLM建议被认为没有不适当内容，高于LLaMA的51%，但与PCP的71%相当。此外，36%的DeepDR-LLM建议被认为没有缺失内容（PCP为27%）。最后，57%的DeepDR-LLM建议被评为“低危害可能性”，与PCP的55%相当。在中文中，77%的DeepDR-LLM建议被认为没有不适当内容，高于LLaMA的66%和PCP的54%。此外，63%的DeepDR-LLM建议被认为没有缺失内容，而PCP为46%。88%的DeepDR-LLM建议被评为“低危害可能性”，而PCP为60%。

Figure 3b shows the total scores (defined as the sum of domain- specific scores) of the management recommendations generated in four different ways. In English, management recommendations given by DeepDR-LLM were significantly better than those given by LLaMA $(P\!<\!0.001)$ ) and comparable to the PCP and endocrinology resident. In Chinese, management recommendations given by DeepDR-LLM were significantly better than those by LLaMA ( $(P\!<\!0.001)$ and PCP ( $\scriptstyle P=0.010)$ ) but comparable to the endocrinology resident.

图3b 显示了四种不同方式生成的管理建议的总分（定义为领域特定分数的总和）。在英文中，DeepDR-LLM给出的管理建议显著优于LLaMA给出的建议（P<0.001），并且与PCP和内分泌科住院医生的建议相当。在中文中，DeepDR-LLM给出的管理建议显著优于LLaMA（P<0.001）和PCP（P=0.010）的建议，但与内分泌科住院医生的建议相当。

Fig. 3 | Head-to-head comparison between DeepDR-LLM, nontuned LLaMA, PCP and endocrinology residents in both English and Chinese.

在这里插入图片描述

图3 | DeepDR-LLM、未微调的LLaMA、PCP和内分泌科住院医生在英文和中文中的头对头比较。

(a) Evaluators were invited to rate management recommendations for patients with diabetes based on three domains: the extent of inappropriate content, the extent of missing content, and the likelihood of possible harm, using 100 cases randomly selected from the China National Diabetic Complications Study (CNDCS). The box plot ( $n\!=\!100$ ) shows the median and quartiles with whiskers representing the data range. The comparison was performed using two-sided Friedman tests. Post-hoc pairwise comparisons were conducted using two-sided Wilcoxon signed-rank tests, and $P$ values for multiple comparisons were adjusted using the Bonferroni method.
(a) 评估人员受邀根据三个领域（不适当内容的程度、缺失内容的程度和可能的危害可能性），对糖尿病患者的管理建议进行评分，使用了从中国国家糖尿病并发症研究（CNDCS）中随机选择的100个病例。箱线图 ( $n\!=\!100$ ) 显示了中位数和四分位数，须状线表示数据范围。比较是通过双侧Friedman测试进行的。事后成对比较使用了双侧Wilcoxon符号秩检验， $P$ 值使用Bonferroni方法进行了多重比较的调整。

(b) The total scores of management recommendations generated by LLaMA, DeepDR-LLM, PCPs, and endocrinology residents, using 100 cases randomly selected from CNDCS. The box plot ( $n\!=\!100$ ) displays the median and quartiles with whiskers representing the data range. The comparison was conducted using two-sided Friedman tests, and post-hoc pairwise comparisons were performed using two-sided Wilcoxon signed-rank tests. $P$ values for multiple comparisons were adjusted using the Bonferroni method. $^{**}\!P\!=\!0.010$ , $^{***}P\! <\!0.001$ .
(b) 使用从CNDCS随机选择的100个病例，比较了由LLaMA、DeepDR-LLM、PCP和内分泌科住院医生生成的管理建议的总分。箱线图 ( $n\!=\!100$ ) 显示了中位数和四分位数，须状线表示数据范围。比较是通过双侧Friedman测试进行的，事后成对比较使用了双侧Wilcoxon符号秩检验， $P$ 值使用Bonferroni方法进行了多重比较的调整。 $^{**}\!P\!=\!0.010$ ， $^{***}\!P\! <\!0.001$ 。

Multiethnic validation of DeepDR-Transformer (experiment 2b)

DeepDR-Transformer的多族裔验证（实验2b）

The DeepDR-Transformer module was retrospectively developed and validated in 14 datasets with standard fundus images and 7 datasets with portable fundus images. The characteristics of datasets used in the performance evaluation of DeepDR-Transformer are summarized in Supplementary Tables 1 and 2.

DeepDR-Transformer模块在14个标准眼底图像数据集和7个便携式眼底图像数据集上进行了回顾性开发和验证。用于DeepDR-Transformer性能评估的数据集特点总结在补充表1和2中。

Supplementary Tables 5 and 6 summarize the performances of DeepDR-Transformer in image quality assessment and lesion segmentation. For DR grading, we assessed the performance of the DeepDR-Transformer model in detecting early-to-late stages of DR (multiclass) from standard fundus images and referable DR from portable fundus images (Supplementary Table 7). In standard fundus images, the DeepDR-Transformer model showed excellent performance in identifying referable DR, with areas under the receiver operating characteristic curve (AUCs) ranging from 0.892 to 0.933 across 12 external test sets. In portable fundus images, the model showed AUCs ranging from 0.896 to 0.920 across six external test sets.

补充表5和6总结了DeepDR-Transformer在图像质量评估和病变分割中的表现。在DR分级方面，我们评估了DeepDR-Transformer模型在标准眼底图像中检测早期到晚期DR（多类别）和在便携式眼底图像中检测可转诊DR（补充表7）的性能。在标准眼底图像中，DeepDR-Transformer模型在识别可转诊DR方面表现出色，12个外部测试集中的接收器操作特征曲线下面积（AUCs）范围为0.892到0.933。在便携式眼底图像中，该模型在六个外部测试集中的AUCs范围为0.896到0.920。

DeepDR-Transformer as an assistive tool (experiment 2c)

DeepDR-Transformer作为辅助工具（实验2c）

To evaluate DeepDR-Transformer as an assistive tool for PCPs and professional nonphysician graders (these graders are now used in many DR screening programs, such as the UK, Singapore and Vietnam, in place of PCPs) in identifying referable DR, we assessed both the accuracy and time efficiency of the grading processes with and without the assistance of the DeepDR-Transformer module (Fig. 4, Extended Data Tables 1–3 and Supplementary Fig. 1). Based on standard fundus images graded by PCPs in the urban area (Fig. 4a and Extended Data Table 1), we observed a sensitivity range of $37.2–81.6\%$ for unassisted PCPs, which subsequently increased to $78.0{-}98.4\%$ with DeepDR-Transformer assistance. Similarly, specificity improved from the original range of $84.4\substack{-94.8\%}$ (unassisted) to $90.4–98.8\%$ when assisted with DeepDR-Transformer. Moreover, with the assistance of DeepDR-Transformer, the median time taken for assessment was reduced from 14.66 s (interquartile range (IQR) 14.09–15.57) per eye to 11.31 s (IQR 10.82–11.84) $(P\!<\!0.001)$ , indicating a significant enhancement in both the accuracy and efficiency of DR grading.

为了评估DeepDR-Transformer作为PCP和专业非医生评分员（这些评分员现在在许多DR筛查项目中使用，例如英国、新加坡和越南，代替PCP）识别可转诊DR的辅助工具，我们评估了在有无DeepDR-Transformer模块辅助下的评分过程的准确性和时间效率（图4、扩展数据表1–3和补充图1）。根据在城市地区由PCP评分的标准眼底图像（图4a和扩展数据表1），我们观察到在没有辅助的情况下PCP的敏感性范围为 $37.2–81.6\%$ ，而在DeepDR-Transformer辅助下增加到 $78.0{-}98.4\%$ 。类似地，特异性从原始范围 $84.4\substack{-94.8\%}$ （无辅助）提高到 $90.4–98.8\%$ （有DeepDR-Transformer辅助）。此外，在DeepDR-Transformer的帮助下，评估的中位时间从每只眼14.66秒（四分位数范围14.09–15.57）减少到11.31秒（四分位数范围10.82–11.84） $(P\!<\!0.001)$ ，表明DR评分的准确性和效率都有显著提高。

Fig. 4 | Receiver operating characteristic curves showing performance of DeepDR-Transformer alone versus PCPs (when unassisted and assisted by DeepDR-Transformer) in identifying referable DR.

图4 | 接收器操作特征曲线显示DeepDR-Transformer单独与PCP（在未辅助和得到DeepDR-Transformer辅助时）在识别可转诊DR中的表现。

(a) Standard fundus images (500 eyes: 250 nonreferable eyes and 250 referable eyes) graded by PCPs in the urban area.
(a) 城市地区由PCP评分的标准眼底图像（500只眼睛：250只非可转诊眼睛和250只可转诊眼睛）。

(b) Portable fundus images (500 eyes: 250 nonreferable eyes and 250 referable eyes) graded by PCPs in the urban area.
(b) 城市地区由PCP评分的便携式眼底图像（500只眼睛：250只非可转诊眼睛和250只可转诊眼睛）。

(d) Portable fundus images (500 eyes: 250 nonreferable eyes and 250 referable eyes) graded by PCPs in the rural area.
(d) 农村地区由PCP评分的便携式眼底图像（500只眼睛：250只非可转诊眼睛和250只可转诊眼睛）。

Multiethnic validation of DeepDR-Transformer (experiment 2b)

多民族验证DeepDR-Transformer（实验2b）

DeepDR-Transformer模块在14个标准眼底图像数据集和7个便携式眼底图像数据集中进行了回顾性开发和验证。用于DeepDR-Transformer性能评估的数据集特征汇总在补充表1和2中。

补充表5和6总结了DeepDR-Transformer在图像质量评估和病变分割中的性能。对于DR分级，我们评估了DeepDR-Transformer模型在从标准眼底图像中检测DR早期到晚期阶段（多类别）以及从便携式眼底图像中检测可转诊DR的性能（补充表7）。在标准眼底图像中，DeepDR-Transformer模型在识别可转诊DR方面表现出色，12个外部测试集中的接收器操作特征曲线（AUCs）范围从0.892到0.933。在便携式眼底图像中，该模型在六个外部测试集中的AUCs范围从0.896到0.920。

DeepDR-Transformer as an assistive tool (experiment 2c)

DeepDR-Transformer作为辅助工具（实验2c）

To evaluate DeepDR-Transformer as an assistive tool for PCPs and professional nonphysician graders (these graders are now used in many DR screening programs, such as the UK, Singapore and Vietnam, in place of PCPs) in identifying referable DR, we assessed both the accuracy and time efficiency of the grading processes with and without the assistance of the DeepDR-Transformer module (Fig. 4, Extended Data Tables 1–3 and Supplementary Fig. 1). Based on standard fundus images graded by PCPs in the urban area (Fig. 4a and Extended Data Table 1), we observed a sensitivity range of 37.2% to 81.6% for unassisted PCPs, which subsequently increased to 78.0% to 98.4% with DeepDR-Transformer assistance. Similarly, specificity improved from the original range of 84.4–94.8% (unassisted) to 90.4–98.8% when assisted with DeepDR-Transformer. Moreover, with the assistance of DeepDR-Transformer, the median time taken for assessment was reduced from 14.66 s (interquartile range (IQR) 14.09–15.57) per eye to 11.31 s (IQR 10.82–11.84) (P<0.001), indicating a significant enhancement in both the accuracy and efficiency of DR grading.

为了评估DeepDR-Transformer作为PCP和专业非医师评分者（这些评分者现在在许多DR筛查项目中使用，如英国、新加坡和越南，取代PCP）在识别可转诊DR方面的辅助工具的效果，我们评估了在有和没有DeepDR-Transformer模块辅助下评分过程的准确性和时间效率（图4，扩展数据表1–3和补充图1）。基于城市地区PCP评分的标准眼底图像（图4a和扩展数据表1），我们观察到未辅助PCP的敏感性范围为37.2%到81.6%，在DeepDR-Transformer辅助下提高到78.0%到98.4%。同样，特异性从原始范围的84.4–94.8%（未辅助）提高到90.4–98.8%，在DeepDR-Transformer辅助下。此外，在DeepDR-Transformer辅助下，评估的中位时间从每只眼14.66秒（四分位数范围（IQR）14.09–15.57）减少到11.31秒（IQR 10.82–11.84）（P<0.001），表明DR分级的准确性和效率都有显著提高。

Prospective real-world study of DeepDR-LLM (experiment 2d)

DeepDR-LLM的前瞻性现实世界研究（实验2d）

To evaluate the impact of implementing the integrated DeepDR-LLM system (combining both the LLM and DeepDR-Transformer modules) on diabetes self-management behaviors, we carried out a proof-of-concept, two-arm, prospective study in a real-world setting. Extended Data Fig. 3 shows the study design of this real-world prospective study (showing numbers of participants included in the outcome analysis). Participants were allocated to two groups: one receiving management recommendations from PCPs without the assistance of DeepDR-LLM (referred to as the unassisted PCP arm) and the other receiving augmented input where PCPs’ recommendations were enhanced with insights from DeepDR-LLM (referred to as the PCP+DeepDR-LLM arm). Comparisons of baseline characteristics of included participants in two substudies between the two arms are presented in Extended Data Table 4.

为了评估实施集成DeepDR-LLM系统（结合LLM和DeepDR-Transformer模块）对糖尿病自我管理行为的影响，我们在现实世界环境中进行了一个概念验证的两臂前瞻性研究。扩展数据图3展示了这项现实世界前瞻性研究的研究设计（显示了纳入结果分析的参与者数量）。参与者被分配到两个组：一个组接受来自PCP的管理建议，未获得DeepDR-LLM的辅助（称为未辅助PCP组），另一个组获得增强输入，PCP的建议通过DeepDR-LLM的见解得到增强（称为PCP+DeepDR-LLM组）。在两个亚研究中纳入的参与者基线特征的比较见于扩展数据表4。

For patients diagnosed with newly diagnosed diabetes at baseline, they were followed up after 2 weeks and 4 weeks to evaluate their self-management practices. Patients in the PCP+DeepDR-LLM arm showed better self-management of diabetes in several aspects at the 2-week follow-up, including decreased consumption of refined grains and alcohol, increased consumption of whole grains and fresh vegetables, increased physical activities, and adherence to drug therapy (all P<0.05, after adjusting for age, sex, and baseline HbA1c level; Extended Data Table 5). At the 4-week follow-up, participants in the PCP+DeepDR-LLM arm maintained better self-management of diabetes and exhibited behaviors of increased consumption of fresh fruits, decreased consumption of starchy vegetables, more frequent blood glucose monitoring, and better adherence to antidiabetic medication, compared to those in the unassisted PCP arm (all P<0.05, after adjusting for age, sex, and baseline HbA1c level).

对于在基线时被诊断为新发糖尿病的患者，他们在2周和4周后进行随访，以评估他们的自我管理实践。在2周随访中，PCP+DeepDR-LLM组的患者在多个方面表现出更好的糖尿病自我管理，包括减少精制谷物和酒精的摄入、增加全谷物和新鲜蔬菜的摄入、增加身体活动以及遵循药物治疗（所有P<0.05，调整年龄、性别和基线HbA1c水平后；扩展数据表5）。在4周随访中，PCP+DeepDR-LLM组的参与者保持了更好的糖尿病自我管理，并表现出增加新鲜水果摄入、减少淀粉类蔬菜摄入、更加频繁的血糖监测以及更好地遵循抗糖尿病药物治疗的行为，与未辅助PCP组相比（所有P<0.05，调整年龄、性别和基线HbA1c水平后）。
在这里插入图片描述

For patients diagnosed with referable DR at baseline visit, the 2-week follow-up revealed a significantly positive trend. Those patients in the PCP+DeepDR-LLM arm were more likely to follow through with their referral and consult an ophthalmologist within 2 weeks (77.78% versus 58.44%; P=0.001, as indicated in Extended Data Table 5). Furthermore, patients in the PCP+DeepDR-LLM arm scheduled their post-referral ophthalmologist appointments significantly sooner than those in the unassisted PCP arm (4 (IQR 3–5) days versus 7 (IQR 6–8) days; P<0.001). These findings underscore the positive influence of the integrated DeepDR-LLM system in fostering more proactive self-management actions.

对于在基线访问时被诊断为可转诊DR的患者，2周的随访显示了显著的积极趋势。那些在PCP+DeepDR-LLM组的患者更可能继续进行转诊并在2周内咨询眼科医生（77.78%对58.44%；P=0.001，如扩展数据表5所示）。此外，PCP+DeepDR-LLM组的患者安排他们的转诊后眼科医生预约的时间明显早于未辅助PCP组（4天（IQR 3–5）对7天（IQR 6–8）；P<0.001）。这些发现强调了集成DeepDR-LLM系统在促进更积极的自我管理行为方面的积极影响。

In addition, we carried out a post-deployment evaluation to assess the quality and level of empathy provided by the DeepDR-LLM system alone, PCP alone, and PCP+DeepDR-LLM (Extended Data Fig. 4). This evaluation involved three consultant-level endocrinologists and 372 patients. Across these 372 cases evaluated by the three endocrinologists, of the three versions of management recommendations, PCP+DeepDR-LLM’s recommendations were most preferred (56.36% Fig. 5a) by the endocrinologists. In total, 68.37% of PCP+DeepDR-LLM’s recommendations were rated as either ‘good’ or ‘very good’ quality, and 71.06% recommendations were deemed ‘empathetic’ or ‘very empathetic’ (Fig. 5b). From the patients’ perspective, the majority (238/372, 63.98%) also favored PCP+DeepDR-LLM’s recommendations over the other two versions (Fig. 5a). Similarly, 69.35% of PCP+DeepDR-LLM’s recommendations were rated by the surveyed patients as either ‘good’ or ‘very good’ quality, and 73.92% recommendations were deemed ‘empathetic’ or ‘very empathetic’ (Fig. 5c).

此外，我们进行了部署后评估，以评估DeepDR-LLM系统单独、PCP单独以及PCP+DeepDR-LLM提供的质量和同理心水平（扩展数据图4）。这项评估涉及三位顾问级内分泌专家和372名患者。在这372个案例中，三位内分泌专家评价的三种版本的管理建议中，PCP+DeepDR-LLM的建议最受欢迎（56.36% 图5a）。总的来说，68.37%的PCP+DeepDR-LLM建议被评为‘良好’或‘非常好’，71.06%的建议被认为是‘有同情心’或‘非常有同情心’（图5b）。从患者的角度来看，大多数（238/372，63.98%）也更喜欢PCP+DeepDR-LLM的建议，而不是其他两个版本（图5a）。类似地，69.35%的PCP+DeepDR-LLM的建议被调查的患者评为‘良好’或‘非常好’，73.92%的建议被认为是‘有同情心’或‘非常有同情心’（图5c）。

Finally, to capture the PCPs’ perceptions and satisfactions towards the DeepDR-LLM system after using its insights, the 12 PCPs who participated in the PCP+DeepDR-LLM arm of the real-world prospective study were also asked to complete a user satisfaction questionnaire. This questionnaire was completed within 2 weeks after the study closure (Extended Data Table 6). Across the 12 PCPs, the DeepDR-LLM system obtained an average score of 4.42 for being understandable (out of 5.00), 4.33 for time-saving, 4.17 for effectiveness, and 4.17 for being safe in clinical practice. It also obtained an overall satisfaction score of 4.50.

最后，为了了解PCP在使用DeepDR-LLM系统见解后的感知和满意度，参与现实世界前瞻性研究的12名PCP还被要求填写用户满意度问卷。该问卷在研究结束后2周内完成（扩展数据表6）。在这12名PCP中，DeepDR-LLM系统在易于理解方面获得了4.42分（满分5.00），在节省时间方面获得了4.33分，在有效性方面获得了4.17分，在临床实践中的安全性方面获得了4.17分。它还获得了4.50的总体满意度评分。

Fig. 6 | Envisioning the future of primary diabetes care with the clinical system processes the accumulated clinical data to concurrently deliver DR integration of the DeepDR-LLM system.

图6 |设想初级糖尿病护理的未来，临床系统处理积累的临床数据，同时提供DeepDR LLM系统的DR集成。

First, patients with diabetes undergo comprehensive evaluations that include medical history taking that can be augmented by automated voice-to-text technology, physical examinations, laboratory assessments and fundus imaging. Following this, the DeepDR-LLM system processes the accumulated clinical data to concurrently deliver DR screening results and tailored management recommendations for PCPs. Subsequently, augmented with these AI-derived insights, PCPs then offer treatment guidance and health education to patients, either in person or through teleconsultation services.

首先，糖尿病患者接受全面的评估，包括病史采集，可以通过自动语音转文本技术、体检、实验室评估和眼底成像来增强。在此之后，DeepDR LLM系统处理累积的临床数据，同时为PCP提供DR筛查结果和量身定制的管理建议。随后，在这些人工智能衍生的见解的基础上，初级保健医生为患者提供治疗指导和健康教育，无论是面对面还是通过远程咨询服务。

Discussion

Primary diabetes care that is accessible, timely and appropriate persists as a major public health challenge due to insufficient healthcare infrastructure and a lack of trained PCPs, particularly in low-resource settings in many LMICs. Adding to this complexity in primary diabetes care is the need to manage diabetes complications, such as DR, the most specific complication, with its presence often signaling other complications in major organ systems (for example, kidney, heart and brain). While DR screening has been widely recommended by international guidelines, such programs are lacking in low-resource settings due to the scarcity of infrastructure and a lack of trained PCPs who can administer and manage such programs. To address these gaps, we developed an integrated image–language system (DeepDR-LLM) combining an LLM module and a DL module (DeepDR-Transformer), with an aim to provide tailored personalized diabetes management recommendations and real-time fully automated DR screening and referral recommendations to aid the PCPs working in primary diabetes care.

基础糖尿病护理的可及性、及时性和适宜性仍然是一个主要的公共卫生挑战，这主要是由于医疗基础设施不足和训练有素的初级保健医生 (PCPs) 缺乏，尤其是在许多低资源国家和地区。基础糖尿病护理的复杂性还在于需要管理糖尿病并发症，如糖尿病视网膜病变（DR），这是最具特异性的并发症，其出现往往预示着其他主要器官系统（例如，肾脏、心脏和大脑）的并发症。尽管国际指南广泛推荐进行 DR 筛查，但由于基础设施稀缺和缺乏能够管理这些项目的训练有素的初级保健医生，这类程序在低资源环境中仍然不足。为了解决这些问题，我们开发了一个集成的图像-语言系统（DeepDR-LLM），结合了 LLM 模块和 DL 模块（DeepDR-Transformer），旨在提供量身定制的个性化糖尿病管理建议和实时完全自动化的 DR 筛查及转诊建议，以协助从事基础糖尿病护理的初级保健医生。

Key features and findings of our system should be emphasized. First, our LLM module was fine-tuned on an open-source LLM (using more than 300,000 real-world management recommendations from more than 250,000 participants), focusing on providing individualized and reliable management recommendations for the PCPs to manage common scenarios in diabetes. In our head-to-head analysis (experiment 2a), we showed that our LLM module performed better than nontuned ‘generic’ LLMs (that is, LLaMA) and PCPs, and with comparable performance to endocrinology residents. Furthermore, our two-arm, real-world prospective study in a primary diabetes care context demonstrated that the integration of DeepDR-LLM with PCP consultations enhanced self-management behaviors in newly diagnosed patients with diabetes and increased adherence to DR referrals for those with identified referable DR.

系统的关键特性和发现应予以强调。首先，我们的 LLM 模块在一个开源 LLM 上进行了微调（使用了来自超过 250,000 名参与者的 300,000 多个实际管理建议），重点是为初级保健医生提供个性化和可靠的管理建议，以应对糖尿病中的常见情境。在我们的对比分析（实验 2a）中，我们展示了我们的 LLM 模块比未调整的“通用” LLM（即 LLaMA）和初级保健医生表现更好，并且与内分泌科住院医生的表现相当。此外，我们在基础糖尿病护理背景下进行的双臂真实世界前瞻性研究表明，DeepDR-LLM 与初级保健医生咨询的整合提高了新诊断糖尿病患者的自我管理行为，并增加了对已识别的可转诊 DR 患者的 DR 转诊遵循率。

For the LLM module, in the head-to-head comparison (experiment 2a), we demonstrated that the LLM module of the DeepDR-LLM system could mostly generate reliable management recommendations for patients with diabetes in the retrospective evaluations in both English and Chinese. Previous studies have shown the promising potential of ‘generic’ LLMs in generating answers to real-world consumer queries for medical information, which are usually general and somewhat superficial. However, previous LLMs did not provide specific and detailed management recommendations for patients with common diseases, such as diabetes. Another limitation of previous head-to-head evaluations between LLMs and clinicians was the lack of model answers serving as benchmarks to compare the performance of different players. In our study, we enlisted an international panel of experts in endocrinology and ophthalmology (names listed in Methods) to formulate the model answers for each case, using established clinical guidelines (that is, 2023 American Diabetes Association Guidelines on Diabetes Care and 2018 International Council of Ophthalmology Guidelines on Diabetic Eye Care). Encouragingly, the LLM module showed performance comparable to endocrinology resident in Chinese and PCPs in English, in all three evaluated axes. These results demonstrated the potential of the DeepDR-LLM system to provide reliable management recommendations for PCPs to manage patients with diabetes.

对于 LLM 模块，在对比分析（实验 2a）中，我们展示了 DeepDR-LLM 系统的 LLM 模块在回顾性评估中能够在英语和中文中大多数情况下生成可靠的糖尿病管理建议。以往研究表明，“通用” LLM 在生成针对真实世界消费者查询的医疗信息答案方面具有良好的潜力，这些答案通常是一般性的且略显肤浅。然而，以往的 LLM 并没有为常见疾病（如糖尿病）患者提供具体和详细的管理建议。另一个局限性是以往 LLM 与临床医生之间的对比评估中缺乏模型答案作为基准来比较不同参与者的表现。在我们的研究中，我们邀请了一个由内分泌学和眼科学国际专家组成的专家小组（名单见方法部分），根据已建立的临床指南（即 2023 年美国糖尿病协会糖尿病护理指南和 2018 年国际眼科学理事会糖尿病眼护理指南）制定每个案例的模型答案。令人鼓舞的是，LLM 模块在所有三个评估维度上在中文中表现出与内分泌科住院医生相当的性能，在英语中则与初级保健医生相当。这些结果表明 DeepDR-LLM 系统有潜力为初级保健医生提供可靠的糖尿病管理建议。

With respect to the image-based DL component for DR screening, the DeepDR-Transformer module provided robust performance of DR grading in diverse multiethnic cohorts of patients with diabetes (experiment 2b). Importantly, we demonstrated this performance in both standard (desktop) and portable (mobile) fundus images. Existing DL systems for DR screening primarily focused on standard retinal images taken with more expensive desktop fundus cameras. In this study, we showed that DeepDR-Transformer could also achieve optimal performance in lower-resolution portable fundus images, with AUCs ranging from 0.896 to 0.920 for detecting referable DR across six external test datasets from China, Algeria and Uzbekistan. The robustness and generalizability of the DeepDR-Transformer module for identifying referable DR from portable fundus images could potentially empower point-of-care DR screening by PCPs in lower-resourced settings, where future DR screening models will probably involve such smaller, cheaper fundus cameras rather than standard retinal cameras.

关于基于图像的 DL 组件用于 DR 筛查，DeepDR-Transformer 模块在多种族糖尿病患者队列中的 DR 分级表现出强大的性能（实验 2b）。重要的是，我们在标准（台式）和便携式（移动）眼底图像中都展示了这种性能。现有的 DR 筛查 DL 系统主要集中在使用更昂贵的台式眼底相机拍摄的标准视网膜图像上。在这项研究中，我们展示了 DeepDR-Transformer 在低分辨率便携式眼底图像中也能实现最佳性能，检测可转诊 DR 的 AUC 从 0.896 到 0.920，在来自中国、阿尔及利亚和乌兹别克斯坦的六个外部测试数据集中表现一致。DeepDR-Transformer 模块在从便携式眼底图像中识别可转诊 DR 的稳健性和普适性有可能赋能在资源不足的环境中由初级保健医生进行现场 DR 筛查，未来的 DR 筛查模型可能会使用较小、更便宜的眼底相机，而不是标准的视网膜相机。

Finally, to further demonstrate the impact of DeepDR-LLM on patients’ self-management behavior for diabetes care (experiment 2d), we conducted a two-arm, real-world prospective study in a primary care setting. In the unassisted PCP arm, PCPs gave the management recommendations without the help of DeepDR-LLM. We found that these recommendations given by PCPs were generally rule-based with ‘one-size-fits-all’ treatment targets and lifestyle interventions, with little personalization (examples shown in Supplementary Table 8). These findings are probably explained by routine generic answers provided by PCPs, in part due to the lack of in-depth diabetes-specific training of PCPs, a problem even in high-resource settings. On the other hand, in the PCP $^+$ DeepDR-LLM arm, using electronic health records and fundus images, our integrated DeepDR-LLM system could generate good quality and empathetic recommendations. These suggestions were then used by PCPs to formulate management plans for each patient. Evaluations by consultant-level endocrinologists and patients indicated that the integration of DeepDR-LLM could significantly enhance the quality and perceived empathy of the PCPs’ recommendations.

最后，为了进一步展示 DeepDR-LLM 对糖尿病护理中患者自我管理行为的影响（实验 2d），我们在初级保健环境中进行了一个双臂真实世界前瞻性研究。在未使用 DeepDR-LLM 的初级保健医生组中，初级保健医生在没有 DeepDR-LLM 帮助的情况下给出了管理建议。我们发现这些建议通常是基于规则的，“一刀切”的治疗目标和生活方式干预，个性化程度较低（具体例子见补充表 8）。这些发现可能是由于初级保健医生提供的常规通用答案，部分原因是初级保健医生缺乏深入的糖尿病特定培训，这在资源丰富的环境中也存在问题。另一方面，在 PCP $^+$ DeepDR-LLM 组中，利用电子健康记录和眼底图像，我们的集成 DeepDR-LLM 系统能够生成高质量和富有同理心的建议。这些建议被初级保健医生用于为每位患者制定管理计划。内分泌科顾问和患者的评估表明，DeepDR-LLM 的整合可以显著提高初级保健医生建议的质量和感知的同理心。

Current digital and AI solutions cannot realize their full potential unless seamlessly integrated into existing clinical workflows. We showed that the integration of the DeepDR-LLM system into primary diabetes care could improve patient outcomes in two aspects. First, for patients with newly diagnosed diabetes, the DeepDR-LLM system could promote better self-management behaviors, including dietary modifications (for example, increased consumption of whole grains and decreased consumption of starchy vegetables), increased physical activities and adherence to antidiabetic medication. Concurrently, for those patients diagnosed with referable DR, receiving recommendations from PCPs that were augmented with DeepDR-LLM’s recommendation could improve the compliance rate of attending the ophthalmologists within 2 weeks, as well as shorten the referral interval. These results highlight the beneficial impact of the integrated DeepDR-LLM system in promoting patient engagement and encouraging more proactive health management behaviors.

当前的数字和 AI 解决方案无法充分发挥其潜力，除非与现有的临床工作流程无缝集成。我们展示了将 DeepDR-LLM 系统整合到基础糖尿病护理中可以在两个方面改善患者结果。首先，对于新诊断的糖尿病患者，DeepDR-LLM 系统可以促进更好的自我管理行为，包括饮食调整（例如，增加全谷物摄入和减少淀粉质蔬菜摄入）、增加体力活动和遵循抗糖尿病药物治疗。同时，对于那些被诊断为可转诊 DR 的患者，接受来自初级保健医生的、由 DeepDR-LLM 增强的建议可以提高在 2 周内就诊眼科医生的依从率，并缩短转诊间隔。这些结果突显了集成 DeepDR-LLM 系统在促进患者参与和鼓励更主动健康管理行为方面的积极影响。

For the implementation of digital solutions, feedback from end-users (in this case, PCPs) is critical. In our real-world prospective evaluation of the integrated DeepDR-LLM system, post deployment, most PCPs deemed the system simple and understandable, effective and safe. PCPs who participated in our survey also indicated they would like to use the DeepDR-LLM system in their future practice. Thus, our DeepDR-LLM system holds great potential for primary diabetes care to empower AI-assisted face-to-face consultation or teleconsultation. Nevertheless, for clinical adoption, other workflow challenges need to be addressed, including addressing data quality issues, ethical, privacy and legal considerations, and integration with existing healthcare information technology infrastructure. Thus, future research directions for DeepDR-LLM should focus on developing more transparent and unbiased datasets applicable to more diverse populations, thereby mitigating data quality issues and the risk of bias and discrimination; exploring ethical and legal frameworks for safe and responsible primary care setting implementation; integration with other technologies (for example, wearables) to further optimize patient engagement; and evaluating the long-term cost-effectiveness and patient outcomes as well as identifying areas for further improvement and refinement.

对于数字解决方案的实施，来自最终用户（在此情况下为初级保健医生）的反馈至关重要。在我们对集成 DeepDR-LLM 系统的真实世界前瞻性评估中，大多数初级保健医生在部署后认为该系统简单易懂、有效且安全。参与我们调查的初级保健医生还表示，他们希望在未来的实践中使用 DeepDR-LLM 系统。因此，我们的 DeepDR-LLM 系统在基础糖尿病护理中具有巨大的潜力，可以支持 AI 辅助的面对面咨询或远程咨询。然而，临床采纳还需解决其他工作流程挑战，包括解决数据质量问题、伦理、隐私和法律考虑，以及与现有医疗信息技术基础设施的集成。因此，未来对 DeepDR-LLM 的研究方向应集中于开发更透明和无偏的数据集，以适用于更广泛的人群，从而减轻数据质量问题及偏见和歧视的风险；探索安全和负责任的初级护理设置实施的伦理和法律框架；与其他技术（例如可穿戴设备）的集成，以进一步优化患者参与；评估长期成本效益和患者结果，并确定进一步改进和完善的领域。

Our study had limitations. First, since our integrated system was trained and fine-tuned exclusively on Chinese populations, additional training or fine-tuning on more diverse clinical and demographic cohorts may further improve the diagnostic accuracy and clinical utility of this system. However, we tested the generalizability of the DeepDR-Transformer module in diverse multiethnic multicountry datasets that showed consistently robust performance across different datasets. Second, the LLM module of the DeepDR-LLM system was evaluated in English and Chinese. Future studies should extend this evaluation to other languages to better assess its broader applicability. Additionally, we did not compare the performance between the LLM module and other open-source LLMs due to concerns about privacy leakage. Third, in the evaluation of the DeepDR-Transformer module of the DeepDR-LLM system as an assistive tool in identifying referable DR, a 1-week washout period between unassisted and DeepDR-Transformer assistant decisions may not be sufficient to fully eliminate the recall bias. Fourth, our real-world prospective evaluation of the DeepDR-LLM system was not designed as a randomized controlled trial, and it primarily focused on self-management behaviors as the key clinical outcomes of interest with a relatively short follow-up period and not sufficiently on objective clinical outcomes (for example, documented progression of DR). As such, the findings of our study could potentially be influenced by sampling bias and self-reporting inaccuracies. Additionally, PCPs were the same in the two arms, which could lead to biases in the intervention due to prior approaches and expectations. Despite these limitations, our study serves as a foundational proof-of-concept that can inform the design of future, prospective or community-based studies or randomized controlled trials. We believe that it is essential to evaluate the longer-term effectiveness of this intervention via future (preferably blinded) randomized studies with a more extended observation period and multiple clinical outcomes (including objective measurements, duration of the consultation interactions, PCPs’ attitude toward the proposed system and subsequent patient outcomes).

我们的研究存在一些局限性。首先，由于我们的集成系统仅在中国人群上进行了训练和微调，额外的训练或微调在更多多样化的临床和人口统计队列上可能进一步提高该系统的诊断准确性和临床实用性。然而，我们在不同民族和多国数据集中测试了 DeepDR-Transformer 模块的泛化能力，这些数据集显示了在不同数据集中的一致性强大的性能。其次，DeepDR-LLM 系统的 LLM 模块在英语和中文中进行了评估。未来的研究应将此评估扩展到其他语言，以更好地评估其更广泛的适用性。此外，由于隐私泄露的担忧，我们没有将 LLM 模块与其他开源 LLM 的性能进行比较。第三，在评估 DeepDR-LLM 系统的 DeepDR-Transformer 模块作为识别可转诊 DR 的辅助工具时，未辅助和 DeepDR-Transformer 辅助决策之间的 1 周洗脱期可能不足以完全消除回忆偏倚。第四，我们对 DeepDR-LLM 系统的真实世界前瞻性评估不是随机对照试验，其主要关注的是自我管理行为作为关键临床结果，跟踪期相对较短，而未充分关注客观临床结果（例如，DR 的记录进展）。因此，我们研究的发现可能受到抽样偏倚和自我报告不准确的影响。此外，两组中的初级保健医生相同，这可能导致由于先前的方法和期望而在干预中产生偏差。尽管存在这些局限性，我们的研究作为概念验证的基础，可以为未来的前瞻性或社区研究或随机对照试验的设计提供参考。我们认为，有必要通过未来（最好是盲法）随机研究，评估该干预的长期效果，观察期更长，涵盖多种临床结果（包括客观测量、咨询互动的持续时间、初级保健医生对建议系统的态度和后续患者结果）。

In conclusion, we developed an integrated image–language system synergistically combining an LLM module and an image-based DL module (DeepDR-Transformer). We demonstrated that our DeepDR-LLM system could provide personalized high-quality and empathetic management recommendations for patients with diabetes based on their retinal images and routine clinical data. This integrated digital solution could provide complementary functionality to enhance individualized diabetes management and may be useful in low-resource but high-volume settings. Given its multifaceted performance and potential impact, our proposed system holds promise as a digital solution for primary diabetes care management, particularly relevant to 80% of the world’s diabetes population living in underserved, resource-limited settings.

总之，我们开发了一个集成的图像–语言系统，协同结合了 LLM 模块和基于图像的 DL 模块（DeepDR-Transformer）。我们证明了我们的 DeepDR-LLM 系统可以根据患者的视网膜图像和常规临床数据提供个性化、高质量和富有同理心的管理建议。该集成数字解决方案可以提供补充功能，以增强个性化糖尿病管理，并可能在资源有限但需求量大的环境中发挥作用。鉴于其多方面的性能和潜在影响，我们提出的系统作为初级糖尿病护理管理的数字解决方案，特别是对生活在服务不足、资源有限环境中的全球 80% 糖尿病患者，具有很大的潜力。

Methods Ethical approval

Data acquisition and diagnosis criteria

Fourteen independent cross-sectional datasets with standard fundus images and seven independent cross-sectional datasets with port- able fundus images from people with diabetes were included in this study. For datasets with standard fundus images, two datasets were used to develop and internally validate the DeepDR-Transformer module: the Shanghai Integration Model (SIM) cohort 24 , and the Shanghai Diabetes Prevention Program (SDPP) cohort. In addition, 12 multiethnic datasets were enrolled for external validation: the Nicheng Diabetes Screening Project (NDSP) cohort, the Diabetic Retinopathy Progression Study (DRPS) cohort, the Wuhan Tongji Health Management (WTHM) cohort, the Peking Union Diabetes Management (PUDM) cohort, the CNDCS cohort 50 , the Guangzhou Diabetic Eye Study (GDES) cohort, the Chinese University of Hong Kong-Sight-Threatening Diabetic Retinopathy (CUHK-STDR) cohort 51 , the Singapore Epidemiology of Eye Diseases study (SEED) cohort 22 , , the Singapore National Diabetic Retinopathy Screening Program (SiDRP) cohort 22 , the Sankara Nethralaya-Diabetic Retinopathy Epi- demiology and Molecular Genetics Study (SN-DREAMS) cohort 53 , the Thai National Diabetic Retinopathy Screening Program (TNDRSP) cohort 54 and United Kingdom Biobank (UKB) cohort. Use of data from the UK Biobank was approved with the UK Biobank Resource under application number 104443.

本研究纳入了14个独立的标准眼底图像横断面数据集和7个独立的糖尿病患者可移植眼底图像横断面数据库。对于具有标准眼底图像的数据集，使用两个数据集来开发和内部验证DeepDR Transformer模块：上海整合模型（SIM）队列24和上海糖尿病预防计划（SDPP）队列。此外，12个多民族数据集被纳入外部验证：泥城糖尿病筛查项目（NDSP）队列、糖尿病视网膜病变进展研究（DRPS）队列、武汉同济健康管理（WTHM）队列、北京联合糖尿病管理（PUDM）队列、CNDCS队列50、广州糖尿病眼科研究（GDES）队列、香港中文大学-轻度糖尿病视网膜病变（CUHK-STDR）队列51、新加坡眼科疾病流行病学研究（SEED）队列22、新加坡国家糖尿病视网膜病变筛查项目（SiDRP）队列22和Sankara Netheralaya-糖尿病视网膜病变流行病学和分子遗传学研究（SN-DREAMS）队列53、泰国国家糖尿病视网膜病变筛查计划（TNDRSP）队列54和英国生物银行（UKB）队列。英国生物库资源部批准了英国生物库数据的使用，申请号为104443。

Portable fundus images from the NDSP cohort were utilized to fine-tune the DeepDR-Transformer module. Another six datasets were included for external validation: the Chinese Portable Screening Study for Diabetic Retinopathy-East (CPSSDRE) cohort, the Chinese Portable Screening Study for Diabetic Retinopathy-Middle (CPSSDRM) cohort, the Chinese Portable Screening Study for Diabetic Retinopathy-West (CPSSDRW) cohort, the Chinese Portable Screening Study for Diabetic Retinopathy-Northeast (CPSSDRN) cohort, the Algerian Diabetic Retin- opathy Study (ADRS) cohort and the Uzbek Diabetic Retinopathy Study (UDRS) cohort. The CPSSDRE, CPSSDRM, CPSSDRW and CPSSDRN cohorts were derived from real-world DR screening programs assisted by Phoebusmed. For the ADRS and UDRS datasets, the participants were recruited in regions of Algeria and Uzbekistan, respectively. These fundus images were captured using a variety of desktop and handheld fundus cameras from Canon, Topcon, Carl Zeiss, Optomed and MicroClear.

来自NDSP队列的便携式眼底图像用于微调DeepDR Transformer模块。另外六个数据集被纳入外部验证：中国东部糖尿病视网膜病变便携式筛查研究（CPSSDRE）队列、中国中部糖尿病视网膜病便携式筛查研究。CPSSDRE、CPSSDRM、CPSSDRW和CPSSDRN队列来自Phoebusmed协助的真实DR筛查项目。对于ADRS和UDRS数据集，参与者分别在阿尔及利亚和乌兹别克斯坦地区招募。这些眼底图像是使用佳能、拓普康、卡尔蔡司、Optomed和MicroClear的各种台式和手持式眼底相机拍摄的。

DR severity was graded into five levels (non-DR, mild nonprolif- erative DR (NPDR), moderate NPDR, severe NPDR or proliferative DR (PDR), respectively), according to the International Clinical Diabetic Retinopathy Disease Severity Scale (AAO, October $2002)^{55}$ . Diabetic macular edema (DME) was considered to be present when there was retinal thickening at or within one disk diameter of the macular center or definite hard exudates in this region 56 . Referable DR was defined as moderate NPDR or worse, DME or both. The adjudication process and interrater reliability of DR and DME grading of each dataset are presented in Supplementary Table 9. Retinal photographs were flagged as ungradable according to our previous study 24 . Diabetes was diagnosed according to the latest American Diabetes Associa- tion guidelines 57 .

根据国际临床糖尿病视网膜病变严重程度量表（AAO，2002年10月）^{55}$，DR严重程度分为五个级别（分别为非DR、轻度非增殖性DR（NPDR）、中度NPDR、重度NPDR或增生性DR（PDR））。当黄斑中心一个盘直径处或范围内出现视网膜增厚或该区域56出现明确的硬性渗出物时，认为存在糖尿病性黄斑水肿（DME）。可参考的DR被定义为中度NPDR或更严重，DME或两者兼而有之。每个数据集的DR和DME分级的判定过程和评分者间可靠性如补充表9所示。根据我们之前的研究24，视网膜照片被标记为不可分级。糖尿病是根据最新的美国糖尿病协会指南57诊断的。

The architecture of the DeepDR-LLM system

The DeepDR-LLM consists of two modules: the LLM module (module I) and the DeepDR-Transformer module (module II). Module II is used for image quality assessment, lesion segmentation and DR/DME grading from standard or portable fundus images, based on image-based DL. Module I is used for integrating clinical metadata of people with dia- betes, including medical history, physical examinations, laboratory tests and DR/DME diagnosis results, to provide personalized diabetes management recommendations, based on LLM. Specifically, DR/DME diagnosis results could be derived from medical records or module II. In the integrated fashion, DeepDR-LLM could combine DR/DME diagnosis results derived from module II using fundus images as inputs with other clinical metadata to generate individualized management recommendations for people with diabetes.

DeepDR LLM由两个模块组成：LLM模块（模块I）和DeepDR Transformer模块（模块II）。模块II用于基于图像DL的标准或便携式眼底图像的图像质量评估、病变分割和DR/DME分级。模块I用于整合糖尿病患者的临床元数据，包括病史、体检、实验室检查和DR/DME诊断结果，以基于LLM提供个性化的糖尿病管理建议。具体而言，DR/DME诊断结果可以从病历或模块II中得出。以集成的方式，DeepDR LLM可以将使用眼底图像作为输入的模块II得出的DR/DME诊断结果与其他临床元数据相结合，为糖尿病患者生成个性化的管理建议。

LLM module’s supervised fine-tuning . Module I is a domain knowl- edge enhanced LLM model that is designed to formulate diabetes management recommendations, based on various clinical metadata from medical history, physical examinations, laboratory tests, and DR and DME diagnosis results. The primary foundational LLM (that is, LLaMA) was not directly effective in generating diabetes manage- ment recommendations due to a lack of domain-specific knowledge. Recognizing this gap, we developed a supervised fine-tuning approach to integrate diabetes management-related knowledge into the LLM training process. This approach could enhance the model’s capabil- ity to generate diabetes management recommendations by adding essential domain knowledge to the foundational LLM. The dataset for supervised fine-tuning was retrospectively sourced from 371,763 paired clinical data and real-world management recommendations from 267,730 participants from Shanghai Sixth People’s Hospital and Hua- dong Sanatorium after de identification. Characteristics of the dataset are presented in Supplementary Table 10. Our proposed supervised fine-tuning approach can work with various LLM models, and we used LLaMA-7B as the foundational LLM for module I in further experiments. As updating all parameters (that is, the original weights of the LLM) during the fine-tuning of LLM is evidently not optimal in terms of efficiency 58 , we employ the LoRA 59 and Adapter 60 techniques here. Spe- cifically, LoRA adds additional network layers, forming a bypass path adding to the original LLM vertically, which emulates intrinsic rank by executing a one-dimensionality reduction followed by a dimensionality increase. During training, the parameters of LLM remain fixed, with only the matrices A (for reduction) and $B$ (for expansion) undergoing train- ing. The dimensionality-reducing matrix A is initialized with a random Gaussian distribution, whereas the dimensionality-expanding matrix $B$ is initialized as a zero matrix. The process is formulated as

$y=W_{O}x+B A x,$

where $x$ and $y$ are the input and output, respectively. $W o$ is the pre- trained weight of the original LLM.

LLM 模块的监督微调。模块 I 是一个领域知识增强的 LLM 模型，旨在根据来自病史、体检、实验室测试以及 DR 和 DME 诊断结果的各种临床元数据制定糖尿病管理建议。由于缺乏领域特定知识，主要的基础 LLM（即 LLaMA）在生成糖尿病管理建议方面效果不佳。意识到这一差距，我们开发了一种监督微调方法，将糖尿病管理相关知识集成到 LLM 训练过程中。这种方法可以通过向基础 LLM 添加必要的领域知识来提高模型生成糖尿病管理建议的能力。监督微调的数据集来源于上海第六人民医院和华东疗养院的 371,763 对临床数据和 267,730 名参与者的真实管理建议，经过去标识化处理。数据集的特征见附表 10。我们提出的监督微调方法可以与各种 LLM 模型配合使用，我们在进一步的实验中使用了 LLaMA-7B 作为模块 I 的基础 LLM。由于在微调 LLM 过程中更新所有参数（即 LLM 的原始权重）在效率方面显然不够理想，我们在这里采用了 LoRA 和 Adapter 技术。具体而言，LoRA 添加了额外的网络层，形成了垂直于原始 LLM 的旁路路径，通过执行一维降维后进行维度增加，从而模拟固有秩。在训练过程中，LLM 的参数保持固定，仅矩阵 A（用于降维）和 B（用于扩展）进行训练。降维矩阵 A 以随机高斯分布初始化，而维度扩展矩阵 B 则初始化为零矩阵。过程形式化为

$y=W_{O}x+B A x,$

其中 $x$ 和 $y$ 分别是输入和输出。 $W_o$ 是原始 LLM 的预训练权重。

Besides, within each Transformer layer of LLM, we embed addi- tional initialized Adapter networks, which are used for dimensionality reduction and subsequent expansion of the Transformer’s feature representations. Each Adapter network, consisting of a two-layered multilayer perceptron (MLP) and an activation layer, is behind the feed-forward layer and before the residual connection in a Transformer layer.

此外，在每个 Transformer 层中，我们嵌入了额外初始化的 Adapter 网络，这些网络用于 Transformer 特征表示的降维和后续扩展。每个 Adapter 网络由一个两层的多层感知器（MLP）和一个激活层组成，位于 Transformer 层的前馈层之后，残差连接之前。

Combining the above two techniques, the training focuses solely on the newly incorporated layers, with the parameters of the original LLM frozen. For the training phase, we set a learning rate of $10^{-4}$ with a cosine learning rate scheduler, a warmup ratio of 0.03 and training epochs of 10. For the detailed training parameters, we used a batch size of 8, selected mapping dimensions of 4,096 for both LoRA and Adapters, and limited the maximum text length to 512 tokens, with a rank of 64, an alpha of 128 and a dropout rate of 0.05.

结合上述两种技术，训练仅集中于新加入的层，而原始 LLM 的参数保持冻结。在训练阶段，我们设置了 $10^{-4}$ 的学习率，使用余弦学习率调度器，预热比率为 0.03，训练轮次为 10。详细训练参数为：批量大小 8，LoRA 和 Adapter 的映射维度均为 4,096，最大文本长度限制为 512 个标记，秩为 64，alpha 为 128，dropout 率为 0.05。

DeepDR-Transformer module’s development and training . As men- tioned before, module II serves as a tool for module I in analyzing fundus images for DR predictions. So, we propose a separate model named DeepDR-Transformer, which can extract distinct features from fundus images after fine-tuning on specific tasks.

DeepDR-Transformer 模块的开发和训练。如前所述，模块 II 作为模块 I 的工具，用于分析视网膜图像以进行 DR 预测。因此，我们提出了一个名为 DeepDR-Transformer 的单独模型，该模型可以在特定任务上微调后，从视网膜图像中提取不同的特征。

We address the prediction and analysis of fundus images, including two main objectives: standard retinal image prediction and portable retinal image prediction. We utilize standard fundus images and related labels from the developmental dataset for model training. Moreover, we incorporate the Vision Transformer (ViT) architecture 61 and conduct supervised training with this dataset. We train DeepDR-Transformer for four tasks using standard fundus images: quality assessment models for images (determining gradability), DR grading prediction models, pre- diction models for DME (present or absent) and lesion segmentation models (microaneurysms, hemorrhages, cotton-wool spots (CWS) and hard exudates). For each model, we load pretrained weights from Ima- geNet 62 , initiating end-to-end fine-tuning thereafter. For the structured prediction output yielded by this module II (DeepDR-Transformer), we devise standardized linguistic templates, for example, ‘DR grade: 0 (DR not present); DME grade: 0 (DME not present)’. These linguistic tem- plates could be subsequently integrated as a part of the input prompt for module I (LLM module), thus forming the integrated DeepDR-LLM system altogether. For instance, the generated DR/DME diagnosis results generated by DeepDR-Transformer, along with other clinical metadata could be fed into the LLM module to generate individualized management recommendations for people with diabetes.

我们解决了视网膜图像的预测和分析，包括两个主要目标：标准视网膜图像预测和便携式视网膜图像预测。我们利用标准视网膜图像及相关标签进行模型训练。此外，我们结合了 Vision Transformer (ViT) 架构，并使用该数据集进行了监督训练。我们使用标准视网膜图像训练 DeepDR-Transformer 以完成四个任务：图像质量评估模型（确定可分级性）、DR 评分预测模型、DME 预测模型（有无）和病变分割模型（微动脉瘤、出血、棉絮斑（CWS）和硬性渗出物）。对于每个模型，我们从 ImageNet 加载预训练权重，然后进行端到端的微调。对于模块 II（DeepDR-Transformer）产生的结构化预测输出，我们设计了标准化的语言模板，例如‘DR 等级：0（DR 不存在）；DME 等级：0（DME 不存在）’。这些语言模板随后可以作为模块 I（LLM 模块）输入提示的一部分，从而形成集成的 DeepDR-LLM 系统。例如，DeepDR-Transformer 生成的 DR/DME 诊断结果，以及其他临床元数据，可以输入 LLM 模块，以生成个性化的糖尿病管理建议。

DeepDR-Transformer fine-tuning for the classification and segmentation from standard fundus images

We choose ViT as the backbone model of our DeepDR-Transformer for its robust performance in modeling images. Our DeepDR-Transformer module is initialized by the pretrained weights from ImageNet and then fine-tuned on the developmental dataset for image quality assessment, DR grading, DME grading, and lesion segmentation.

我们选择 ViT 作为 DeepDR-Transformer 的主干模型，因为它在图像建模方面表现强劲。我们的 DeepDR-Transformer 模块通过 ImageNet 预训练权重初始化，然后在开发数据集上进行微调，以进行图像质量评估、DR 分级、DME 分级和病变分割。

The architecture of the DeepDR-Transformer module is composed of a series of Transformer layers. We represent the output features of these layers as $Z^{1}, Z^{2}, \cdots, Z^{n}$ , where $Z^{n}$ corresponds to the feature derived from the nth Transformer layer. Our DeepDR-Transformer model is initialized by the pretrained weights from ImageNet and then fine-tuned on the developmental dataset for four classification and segmentation tasks, respectively.

DeepDR-Transformer 模块的架构由一系列 Transformer 层组成。我们将这些层的输出特征表示为 $Z^{1}, Z^{2}, \cdots, Z^{n}$ ，其中 $Z^{n}$ 对应于第 n 层 Transformer 提取的特征。我们的 DeepDR-Transformer 模型通过 ImageNet 预训练权重初始化，然后在开发数据集上分别进行四个分类和分割任务的微调。

The tasks of fundus image quality assessment, DR grading, and DME grading are three classification problems. We apply the global average pooling to the final layer feature $Z^{n}$ of the DeepDR-Transformer module. Subsequently, it is processed by a fully connected linear layer to produce a vector that matches the number of classes in the respective classification task.

视网膜图像质量评估、DR 分级和 DME 分级是三个分类问题。我们对 DeepDR-Transformer 模块的最终层特征 $Z^{n}$ 应用全局平均池化。随后，经过一个全连接线性层处理，生成一个与各自分类任务中的类别数匹配的向量。

The objective of the fundus image lesion segmentation is to generate lesion pixel-level masks within the original two-dimensional image size of $h \times w$ , where $h$ represents the height and $w$ denotes the width of the original image. Consequently, we transform the feature $Z^{n} \in \mathbb{R}^{\frac{h \times w}{p \times p} \times c}$ into a feature $O_{S} \in \mathbb{R}^{\frac{h}{p} \times \frac{w}{p} \times c}$ , where $p$ is the patch size and $c$ is the number of channels. We alternate between convolutional layers and upsampling operations with a factor of $2 \times$ . Thus, to restore from $O_{S}$ to the original size of the input image, four upsampling operations are required. The final channel number is adjusted to 5, where the 0th channel represents the background and the other channels represent the lesions.

视网膜图像病变分割的目标是生成原始二维图像大小为 $h \times w$ 的病变像素级掩膜，其中 $h$ 表示图像的高度， $w$ 表示图像的宽度。因此，我们将特征 $Z^{n} \in \mathbb{R}^{\frac{h \times w}{p \times p} \times c}$ 转换为特征 $O_{S} \in \mathbb{R}^{\frac{h}{p} \times \frac{w}{p} \times c}$ ，其中 $p$ 是补丁大小， $c$ 是通道数。我们在卷积层和上采样操作之间交替，使用 $2 \times$ 的上采样因子。因此，为了将 $O_{S}$ 恢复到输入图像的原始大小，需要四次上采样操作。最终的通道数调整为 5，其中第 0 通道表示背景，其他通道表示病变。

All these tasks are considered classification problems (with segmentation being pixel-level classification). The loss function employed across these tasks is cross-entropy loss. We set the number of Transformer layers as 12 and the patch size $p$ as 16. We used the standardized structure for Transformer layers, with the following parameters for each layer: an embedding size of 768, an MLP size of 3072 (derived from an MLP ratio of 4), and 12 attention heads. The activation function is Gaussian error linear units, and layer normalization is applied. Our learning strategy includes a learning rate set at ${10}^{-3}$ , a weight decay of 0.05, and a layer decay of 0.75. We leverage the stochastic gradient descent optimizer for optimization tasks. To enhance stability and mitigate overfitting, the learning rate is scheduled to decrease by a factor of 0.1 every 10 epochs throughout a span of 40 epochs. Each gradient update iteration is configured with a batch size of 16, and the model’s input image resolution is set at $448 \times 448$ pixels. To improve the training dataset’s diversity and prevent overfitting, data augmentation techniques are utilized, including random resized cropping, affine transformations, horizontal and vertical flips, and Krizhevsky-inspired color augmentation. This color augmentation method introduces color noise to images based on precomputed eigenvectors and eigenvalues. It generates a color vector from a normal distribution (mean 0, standard deviation 0.5), calculates the noise using these eigenvalues and eigenvectors, and adds the resulting noise to the input image to achieve realistic color variation.

所有这些任务被视为分类问题（其中分割为像素级分类）。在这些任务中使用的损失函数是交叉熵损失。我们将 Transformer 层的数量设置为 12，补丁大小 $p$ 设置为 16。我们使用了标准化的 Transformer 层结构，每层的参数如下：嵌入大小为 768，MLP 大小为 3072（来源于 MLP 比例为 4），以及 12 个注意力头。激活函数为高斯误差线性单元，并应用层归一化。我们的学习策略包括将学习率设置为 ${10}^{-3}$ ，权重衰减为 0.05，层衰减为 0.75。我们利用随机梯度下降优化器进行优化任务。为了提高稳定性并减轻过拟合，学习率计划在 40 轮中每 10 轮下降 0.1 倍。每次梯度更新迭代的批量大小设置为 16，模型的输入图像分辨率设置为 $448 \times 448$ 像素。为了提高训练数据集的多样性并防止过拟合，采用了数据增强技术，包括随机调整裁剪、仿射变换、水平和垂直翻转，以及 Krizhevsky 启发的颜色增强。这种颜色增强方法基于预计算的特征向量和特征值为图像引入颜色噪声。它从正态分布（均值 0，标准差 0.5）中生成颜色向量，使用这些特征值和特征向量计算噪声，并将结果噪声添加到输入图像中，以实现逼真的颜色变化。

Transfer learning from standard to portable fundus images

The fine-tuned DeepDR-Transformer models, initially trained on standard fundus images, may yield inconsistent results when deployed on portable fundus images, given the inherent disparities in equipment, noise, and image dimensions. To address this, we utilize transfer learning on portable device images. This adaptation leveraged a tuning set derived from the NDSP dataset, including labels for image quality assessment, DR grading, and DME detection.

经过微调的 DeepDR-Transformer 模型在最初使用标准视网膜图像进行训练后，可能在部署到便携式视网膜图像时产生不一致的结果，这主要是由于设备、噪声和图像尺寸的固有差异。为了解决这个问题，我们在便携设备图像上使用迁移学习。这种适应利用了来自 NDSP 数据集的调优集，其中包括图像质量评估、DR 分级和 DME 检测的标签。

Integration of module I and module II. In our DeepDR-LLM system, there are two modes of integrating module I and module II.

模块 I 和模块 II 的集成。在我们的 DeepDR-LLM 系统中，模块 I 和模块 II 有两种集成模式。

In the physician-involved integration mode, the outputs of module II (that is, fundus image gradability; the lesion segmentation of microaneurysm, CWS, hard exudate, and hemorrhage; DR grade; and DME grade) could help physicians generate DR/DME diagnosis results (that is, fundus image gradability; DR grade; DME grade; and the presence of lesions). These DR/DME diagnosis results and other clinical metadata will be fed into module I to generate individualized management recommendations for people with diabetes.

在涉及医生的集成模式中，模块 II 的输出（即视网膜图像的可评估性；微动脉瘤、CWS、硬性渗出物和出血的病变分割；DR 分级；以及 DME 分级）可以帮助医生生成 DR/DME 诊断结果（即视网膜图像的可评估性；DR 分级；DME 分级；以及病变的存在）。这些 DR/DME 诊断结果和其他临床元数据将被输入到模块 I 中，以生成个体化的糖尿病管理推荐。

In the automated integration mode, the DR/DME diagnosis results from module II and other clinical metadata could be automatically fed into module I to generate individualized management recommendations for people with diabetes. Specifically, the DR/DME diagnosis results include fundus image gradability, DR grade, DME grade classified by module II, and the presence of lesions segmented out by module II.

在自动集成模式中，模块 II 的 DR/DME 诊断结果和其他临床元数据可以自动输入到模块 I 中，以生成个体化的糖尿病管理推荐。具体来说，DR/DME 诊断结果包括模块 II 分类的视网膜图像可评估性、DR 分级、DME 分级以及模块 II 分割出的病变存在。

Evaluation of the LLM module in a retrospective dataset

To evaluate the capability of the LLM module to provide comprehensive diabetes management recommendations in both English and Chinese languages, we curated a retrospective dataset comprising 100 cases randomly selected from CNDCS (Supplementary Table 3). The flowchart of the evaluation is depicted in Extended Data Fig. 2.

为了评估 LLM 模块在英语和中文中提供全面糖尿病管理建议的能力，我们从 CNDCS 中随机选择了 100 个病例，编制了一个回顾性数据集（见补充表 3）。评估流程图如扩展数据图 2 所示。

We first translated the case scenarios into English. An international expert panel was then convened to derive reference evidence-based management recommendations from initial drafts created by four senior consultant-level endocrinologists (W.J., Y.B., H.L., and J.Y.). The international expert panel comprised eight endocrinologists—J.C.N.C., J.B.E.-T., L.C., A.O.Y.L., J.E.S., L.-L.L., R.S., and Y.M.B.—and two ophthalmologists, G.S.W.T. and L.J.C. After thorough review and consensus-building discussions, this group of ten experts subsequently agreed upon the English model answers, establishing the benchmark for the management recommendation evaluations in English.

我们首先将案例场景翻译成英语。随后召开了一个国际专家小组，从四位高级顾问级内分泌学家（W.J.、Y.B.、H.L. 和 J.Y.）创建的初稿中提取参考的基于证据的管理建议。国际专家小组由八位内分泌学家（J.C.N.C.、J.B.E.-T.、L.C.、A.O.Y.L.、J.E.S.、L.-L.L.、R.S. 和 Y.M.B.）和两位眼科医生（G.S.W.T. 和 L.J.C.）组成。经过彻底的审查和共识讨论，这十位专家最终达成了英文模型答案，从而为英文的管理推荐评估建立了基准。

For the Chinese recommendation evaluations, three consultant-level endocrinologists (W.J., H.L., and J.Y.) and two consultant-level ophthalmologists (T.C. and Q.W.) first translated the English reference recommendations into Chinese. They further contextualized these by incorporating guidelines from the Chinese Diabetes Society aligning the recommendations with local clinical practices in China. These Chinese model answers, which had gone through careful evaluations by Chinese experts, were then applied for assessments in the Chinese language.

对于中文推荐评估，三位顾问级内分泌学家（W.J.、H.L. 和 J.Y.）以及两位顾问级眼科医生（T.C. 和 Q.W.）首先将英文参考建议翻译成中文。然后，他们结合中国糖尿病学会的指南进一步调整这些建议，以使其与中国的地方临床实践对齐。这些经过中国专家仔细评估的中文模型答案随后被用于中文的评估。

Utilizing the aforementioned 100 cases, we generated management recommendations using both the nontuned LLaMA and our fine-tuned LLM module in DeepDR-LLM, in both English and Chinese. For the English-language assessment, we invited an endocrinology resident and a PCP (A.A., with more than 10 years of clinical experience) to formulate management strategies for these cases. The recommendations from LLaMA, DeepDR-LLM, the resident, and the PCP were then anonymized and subsequently appraised by a separate assessment panel of eight consultant-level physicians (L.-L.L., C.C.L., H.C.T., Z.H.L., C.S.-Y.T., S.L.K., A.Y.L.L., and S.F.M.), measured against the preestablished model answers in English described above. In a parallel process for the Chinese-language assessment, we sought recommendations from an endocrinology resident and a PCP (Y. Huang, with 15 years of clinical experience) from China. These recommendations were similarly anonymized and then evaluated by a separate assessment panel of four consultant-level endocrinologists from China, against the Chinese model answers previously generated. The 100 cases were distributed at random for assessment in both English and Chinese. Evaluations were anchored to three domains: the extent of inappropriate content, the extent of missing content, and the likelihood of possible harm. This evaluation framework was adapted from a methodology employed in a prior study (refer to Supplementary Table 4). Supplementary Table 8 shows an example of one case, along with its corresponding model answer for management, and four management recommendations provided by LLaMA, DeepDR-LLM, PCP, and endocrinology resident.

利用上述 100 个案例，我们使用未微调的 LLaMA 和我们在 DeepDR-LLM 中微调的 LLM 模块，分别生成了英文和中文的管理建议。在英文评估中，我们邀请了一名内分泌学住院医师和一名初级保健医生（A.A.，拥有超过 10 年的临床经验）为这些案例制定管理策略。然后，LLaMA、DeepDR-LLM、住院医师和初级保健医生的建议被匿名处理，并由八名顾问级医生（L.-L.L.、C.C.L.、H.C.T.、Z.H.L.、C.S.-Y.T.、S.L.K.、A.Y.L.L. 和 S.F.M.）组成的独立评估小组进行评估，与上述英文预先设定的模型答案进行对比。在中文评估过程中，我们从中国邀请了一名内分泌学住院医师和一名初级保健医生（Y. Huang，拥有 15 年临床经验）。这些建议也被匿名处理，然后由四位中国顾问级内分泌学家组成的独立评估小组，与之前生成的中文模型答案进行比较。这 100 个案例被随机分配，用于英文和中文的评估。评估分为三个领域：不适当内容的程度、遗漏内容的程度和可能造成的伤害的可能性。这个评估框架借鉴了先前研究中使用的方法（参见补充表 4）。补充表 8 显示了一个案例的示例，以及相应的管理模型答案和 LLaMA、DeepDR-LLM、PCP 和内分泌学住院医师提供的四个管理建议。

Moreover, we have conducted an additional ablation study to investigate whether the integration of the DeepDR-Transformer module affects the performance of diabetes management recommendations. In our original analysis of the head-to-head comparative analysis of management recommendations provided by DeepDR-LLM, LLaMA, PCPs, and endocrinology residents, we did not utilize the DeepDR-Transformer module. We included participants with gradable standard fundus images. We just input the ground truth DR/DME grading and other clinical metadata into the LLM module to generate diabetes management recommendations.

此外，我们还进行了额外的消融研究，以调查 DeepDR-Transformer 模块的集成是否会影响糖尿病管理建议的性能。在我们最初的对比分析中，我们没有使用 DeepDR-Transformer 模块。我们包括了具有可分级标准视网膜图像的参与者。我们只是将真实的 DR/DME 分级和其他临床元数据输入到 LLM 模块中，以生成糖尿病管理建议。

To investigate whether the integration of the DeepDR-Transformer module, an image-based DL module (module II), would affect the performance of diabetes management recommendations, we conducted ablation studies in both English and Chinese languages. The design of the ablation studies is shown in Supplementary Fig. 3. There were three arms in the comparison:

为了调查 DeepDR-Transformer 模块（一个基于图像的深度学习模块，模块 II）的集成是否会影响糖尿病管理建议的性能，我们在英语和中文中进行了消融研究。消融研究的设计如补充图 3 所示。比较中有三个组别：

Arm 1: input the ground truth DR/DME diagnosis results (that is, fundus image gradability, DR grade, DME grade, and the presence of lesions) and other clinical metadata into the LLM module.

复制代码

1. 组别 1：将真实的 DR/DME 诊断结果（即视网膜图像可评估性、DR 分级、DME 分级和病变的存在）以及其他临床元数据输入到 LLM 模块中。

Arm 2 (using the automated integration mode of the DeepDR-LLM system): input the DR/DME diagnosis results derived from the DeepDR-Transformer module (module II) and other clinical metadata into the LLM module.

复制代码

2. 组别 2（使用 DeepDR-LLM 系统的自动集成模式）：将从 DeepDR-Transformer 模块（模块 II）获得的 DR/DME 诊断结果和其他临床元数据输入到 LLM 模块中。

Arm 3: input the other clinical metadata but without DR/DME diagnosis results into the LLM module.

复制代码

3. 组别 3：输入其他临床元数据，但不包括 DR/DME 诊断结果到 LLM 模块中。

For evaluations in English, we invited ten physicians from Sin- gapore, Malaysia, Spain and the USA. For evaluations in Chinese, we invited four consultant-level endo cri no logi sts from China. The evaluation results are shown in Supplementary Fig. 4. The results showed that the performance of the LLM module (module I) after integration with module II (that is, arm 2 in this experiment) was com- parable to that using the ground truth DR/DME diagnosis results as inputs (arm 1). Expectedly, when DR/DME diagnosis results were not input into the LLM module, the performance of the LLM module was significantly decreased.

对于英文评估，我们邀请了来自Sin-gapore、马来西亚、西班牙和美国的十名医生。对于中文评估，我们请了四名来自中国的顾问级内分泌学家。评价结果见附图4。结果表明，与模块II（即本实验中的臂2）集成后，LLM模块（模块I）的性能与使用地面真实DR/DME诊断结果作为输入（臂1）的性能相当。预期的是，当DR/DME诊断结果未输入LLM模块时，LLM模块的性能会显著降低。

Evaluation of the performance of the DeepDR-Transformer on retrospective datasets

The DeepDR-Transformer module was retrospectively developed and validated in 14 datasets with standard fundus images and 7 datasets with portable fundus images as described before. The characteristics of the participants and eyes used in the performance evaluation of DeepDR-Transformer are summarized in Supplementary Tables 1 and 2.

DeepDR-Transformer 模块是在 14 个标准视网膜图像数据集和 7 个便携式视网膜图像数据集中回顾性开发和验证的，如前所述。DeepDR-Transformer 性能评估中参与者和眼睛的特征总结在补充表 1 和 2 中。

For image quality assessment, we assessed the discriminative performance of the DeepDR-Transformer module for gradability assessment (gradable or ungradable image) on the internal test dataset, four external test datasets with standard fundus images (NDSP, DRPS, WTHM, and PUDM) and six external test datasets with portable fundus images (CPSSDRE, CPSSDRM, CPSSDRW, CPSSDRN, ADRS, and UDRS).

对于图像质量评估，我们评估了 DeepDR-Transformer 模块在内部测试数据集、四个标准视网膜图像外部测试数据集（NDSP、DRPS、WTHM 和 PUDM）以及六个便携式视网膜图像外部测试数据集（CPSSDRE、CPSSDRM、CPSSDRW、CPSSDRN、ADRS 和 UDRS）上的可分辨性能（可评估或不可评估的图像）。

For lesion segmentation, we annotated retinal lesions, including microaneurysms, CWS, hard exudates, and hemorrhages on 5,690 gradable eyes (11,380 images) in the developmental dataset and 2,438 gradable eyes (4,876 images) in the internal test dataset (7:3). For retinal lesion annotation, each fundus image was annotated by two ophthalmologists. Two ophthalmologists generated two lesion annotations for each type of lesion. We considered the two annotations valid if the Intersection over Union (IoU) between them was greater than 0.85. Otherwise, a senior supervisor would check the annotations and give feedback to provide guidance. The image would be reannotated by the two ophthalmologists until the IoU was larger than 0.85. Finally, we took the union of valid annotations as the final ground truth segmentation annotation. We assessed the performance of DeepDR-Transformer for segmenting microaneurysm, CWS, hard exudate, and hemorrhage on the internal test dataset.

对于病变分割，我们在开发数据集中标注了 5,690 个可评估眼睛（11,380 张图像）和内部测试数据集中 2,438 个可评估眼睛（4,876 张图像）的视网膜病变，包括微动脉瘤、CWS、硬性渗出物和出血。每张视网膜图像由两位眼科医生标注。两位眼科医生为每种病变生成了两个标注。如果两者之间的交并比 (IoU) 大于 0.85，则认为这两个标注有效。否则，资深监督员会检查标注并提供反馈以指导。图像会被重新标注，直到 IoU 大于 0.85。最后，我们将有效标注的并集作为最终的真实分割标注。我们在内部测试数据集中评估了 DeepDR-Transformer 在分割微动脉瘤、CWS、硬性渗出物和出血方面的性能。

For DR grading, we assessed the performance of DeepDR-Transformer for detecting early-to-late stages of DR, DME, and referable DR on the internal test dataset and 12 external test datasets with standard fundus images (NDSP, DRPS, WTHM, PUDM, CNDCS, GDES, CUHK-STDR, SEED, SiDRP, SN-DREAMS, TNDRSP, and UKB). Moreover, we assessed the performance of DeepDR-Transformer for detecting referable DR on six external test datasets with portable fundus images (CPSSDRE, CPSSDRM, CPSSDRW, CPSSDRN, ADRS, and UDRS).

对于 DR 分级，我们在内部测试数据集和 12 个标准视网膜图像外部测试数据集（NDSP、DRPS、WTHM、PUDM、CNDCS、GDES、CUHK-STDR、SEED、SiDRP、SN-DREAMS、TNDRSP 和 UKB）上评估了 DeepDR-Transformer 在检测 DR、DME 和可转诊 DR 的早期到晚期阶段的性能。此外，我们还在六个便携式视网膜图像外部测试数据集（CPSSDRE、CPSSDRM、CPSSDRW、CPSSDRN、ADRS 和 UDRS）上评估了 DeepDR-Transformer 在检测可转诊 DR 方面的性能。

Evaluation of DeepDR-Transformer as an assistive tool in identifying referable DR

In this retrospective evaluation, we enlisted three distinct study sites: Huadong Sanatorium in the urban area of Shanghai, China; The People’s Hospital of Sixian County in the rural area of Anhui Province, China; and Singapore National Eye Centre, Singapore. While the two study sites in China employed PCPs for DR grading, the Singapore study site utilized professional graders. At two study sites in China, we recruited 6 PCPs with different levels of experience in DR grading from each study site: 2 PCPs under 2 years (junior), 2 PCPs around 4 years (intermediate) and 2 PCPs over 6 years (senior), respectively. At the study site in Singapore, we recruited three graders with varying levels of experience in DR grading: one junior grader with under 2 years of experience, one intermediate grader with 4 years and one senior grader with over 6 years of experience.

在这项回顾性评估中，我们选择了三个不同的研究地点：位于中国上海市区的华东疗养院；位于中国安徽省农村的宿县人民医院；以及新加坡国家眼科中心。中国的两个研究地点使用了初级保健医生（PCP）进行 DR 分级，而新加坡的研究地点则使用了专业分级人员。在中国的两个研究地点，我们从每个研究地点招募了 6 名不同经验水平的 PCP 进行 DR 分级：2 名经验不足 2 年的初级 PCP，2 名经验约 4 年的中级 PCP 和 2 名经验超过 6 年的高级 PCP。在新加坡的研究地点，我们招募了三名具有不同经验水平的 DR 分级人员：一名经验不足 2 年的初级分级员、一名经验 4 年的中级分级员和一名经验超过 6 年的高级分级员。

For the fundus images used in the study sites in China, 500 gradable eyes of standard fundus images (250 nonreferable eyes and 250 referable eyes) were randomly selected from six external test datasets (NDSP, DRPS, WTHM, PUDM, CNDCS, and GDES), while 500 gradable eyes of portable fundus images (250 nonreferable eyes and 250 referable eyes) were randomly selected from six external test datasets (CPSSDRE, CPSSDRM, CPSSDRW, CPSSDRN, ADRS, and UDRS). For the fundus images used in the study site in Singapore, 300 gradable eyes of standard fundus images (150 nonreferable eyes and 150 referable eyes) were randomly selected from the SEED study. Referable DR was defined as moderate NPDR or worse, DME or both.

对于中国研究地点使用的视网膜图像，500 只标准视网膜图像（250 只不可转诊眼睛和 250 只可转诊眼睛）的可评估眼睛随机从六个外部测试数据集中（NDSP、DRPS、WTHM、PUDM、CNDCS 和 GDES）选择，而 500 只便携式视网膜图像（250 只不可转诊眼睛和 250 只可转诊眼睛）的可评估眼睛随机从六个外部测试数据集中（CPSSDRE、CPSSDRM、CPSSDRW、CPSSDRN、ADRS 和 UDRS）选择。对于新加坡研究地点使用的视网膜图像，300 只标准视网膜图像（150 只不可转诊眼睛和 150 只可转诊眼睛）的可评估眼睛随机从 SEED 研究中选择。可转诊 DR 定义为中度 NPDR 或更严重的 DME 或两者兼有。

To evaluate the accuracy and time efficiency of detecting referable DR cases, we conducted a comparative analysis before and after the integration of the DeepDR-Transformer module into the grading process. Initially, all human experts (that is, PCPs or professional graders) determined the referability of cases without the aid of DeepDR-Transformer. After a washout period of 1 week to minimize recall bias, these experts reassessed the same cases, this time with the assistance of the DeepDR-Transformer module. To ensure the integrity of the evaluation, the sequence of the cases was randomized before each grading session.

为了评估检测可转诊 DR 案例的准确性和时间效率，我们在将 DeepDR-Transformer 模块整合到分级过程中之前和之后进行了比较分析。最初，所有人类专家（即 PCP 或专业分级人员）在没有 DeepDR-Transformer 帮助的情况下确定了案例的可转诊性。在 1 周的洗脱期后，以最小化回忆偏差，这些专家再次评估了相同的案例，这次使用了 DeepDR-Transformer 模块。为了确保评估的完整性，案例的顺序在每次分级前都进行了随机化。

Real-world prospective study

The real-world two-arm, prospective study was conducted in Huadong Sanatorium (affiliated to Shanghai Municipal Health Commission), which is a public medical institution integrating high-volume primary care and health examinations. The study aimed to investigate the impact of the DeepDR-LLM system on patient health outcomes, and sat- isfaction of both patients and PCPs, when deployed into a high-volume primary care setting. This real-world prospective study was approved by the Ethics Committee of Huadong Sanatorium (2023-08, approved 2 April 2023). The number of enrolled participants was estimated on the basis of the proportion of participants with diabetes and average visits per week in the study site, before the deployment of DeepDR-LLM.

这项现实世界的双臂前瞻性研究是在华东疗养院（隶属于上海市卫生委员会）进行的，该疗养院是一家集大量初级保健和健康检查于一体的公共医疗机构。该研究旨在调查DeepDR LLM系统在部署到高容量初级保健环境中时对患者健康结果的影响，以及患者和PCP的满意度。这项现实世界的前瞻性研究已获得华东疗养院伦理委员会的批准（2023-08，2023年4月2日批准）。在部署DeepDR LLM之前，根据糖尿病患者的比例和研究现场每周的平均就诊次数估计了入组参与者的数量。

The study design of the real-world prospective study is shown in Extended Data Fig. 3 (showing numbers of participants included in the outcome analysis), and the flow diagram illustrating the screening, selection, and management of study participants is shown in Supplementary Fig. 2. In these 12 weeks, 20,124 participants attended the health examinations. They received medical history taking, physical examinations, laboratory tests, and fundus examinations (Supplementary Table 11). Among them, patients with diabetes and gradable fundus images (n = 1,994) were subsequently recruited and included in this study. Details of the inclusion and exclusion criteria are shown in Supplementary Section B. These participants were allocated into two arms (the unassisted PCP arm and the PCP+DeepDR-LLM arm) according to the visit time of the participant. The physician-involved integration mode of the DeepDR-LLM system was deployed in the PCP+DeepDR-LLM arm. Participants attending health examinations from 10 April 2023 to 21 May 2023 (first 6 weeks of evaluation period) were included in the unassisted PCP arm, while those from 22 May 2023 to 2 July 2023 (later 6 weeks of evaluation period) were included in the PCP+DeepDR-LLM arm. In this study, a total of 12 PCPs were responsible for primary diabetes care management (Supplementary Table 12). In the unassisted PCP arm, based on examination results, PCPs gave management recommendations. In the PCP+DeepDR-LLM arm, the DeepDR-LLM system was integrated into the clinical workflow (Extended Data Fig. 4). Initially, PCPs gave management recommendations independently. Then, the DeepDR-LLM system assisted PCPs in generating DR/DME diagnosis results and utilized DR/DME diagnosis results and patient information from the electronic health systems, including medical history, physical examinations, and laboratory tests to automatically generate recommendations. Subsequently, PCPs edited and produced their final recommendations by taking DeepDR-LLM’s recommendations into account. In both arms, participants were given treatment advice for diabetes face to face by PCPs based on the above recommendations (details in Supplementary Section B).

真实世界前瞻性研究的设计见扩展数据图 3（显示了纳入结果分析的参与者数量），而研究参与者筛选、选择和管理的流程图见补充图 2。在这 12 周期间，共有 20,124 名参与者接受了健康检查。他们接受了病史采集、体格检查、实验室测试和视网膜检查（补充表 11）。其中，具有糖尿病和可评估视网膜图像的患者（n = 1,994）随后被招募并纳入本研究。入选和排除标准的详细信息见补充部分 B。这些参与者根据访问时间被分配到两个组（无辅助 PCP 组和 PCP+DeepDR-LLM 组）。在 PCP+DeepDR-LLM 组中，部署了 DeepDR-LLM 系统的医生参与整合模式。2023 年 4 月 10 日至 2023 年 5 月 21 日（评估期的前 6 周）参加健康检查的参与者被纳入无辅助 PCP 组，而 2023 年 5 月 22 日至 2023 年 7 月 2 日（评估期的后 6 周）参加健康检查的参与者则被纳入 PCP+DeepDR-LLM 组。在这项研究中，共有 12 名 PCP 负责糖尿病的初级护理管理（补充表 12）。在无辅助 PCP 组中，根据检查结果，PCP 提供管理建议。在 PCP+DeepDR-LLM 组中，DeepDR-LLM 系统被整合到临床工作流程中（扩展数据图 4）。最初，PCP 独立提供管理建议。然后，DeepDR-LLM 系统协助 PCP 生成 DR/DME 诊断结果，并利用 DR/DME 诊断结果和来自电子健康系统的患者信息，包括病史、体格检查和实验室测试，自动生成建议。随后，PCP 结合 DeepDR-LLM 的建议编辑并生成最终建议。在两个组中，参与者都根据上述建议由 PCP 面对面提供糖尿病治疗建议（详细信息见补充部分 B）。

These participants registered on the mobile follow-up platform deployed in the study site, which could reach the participants via instant messaging and collect information on their current condition of diabetes management using online questionnaires. They were followed up at 2 weeks and/or 4 weeks through the mobile follow-up platform.

这些参与者在研究现场部署的移动跟踪平台上注册，该平台可以通过即时消息联系参与者，并使用在线问卷收集他们当前糖尿病管理的情况。他们在2周和/或4周后通过移动跟踪平台进行跟踪。

For all participants diagnosed as referable DR, they were contacted at the 2-week follow-up to check whether (and when) they attended appointments with an ophthalmologist. For all participants with newly diagnosed diabetes, they filled out a questionnaire investigating their status of diabetes management at baseline, 2-week follow-up, and 4-week follow-up (Extended Data Fig. 3). The questionnaire investigated the frequency of blood glucose monitoring, physical therapy, nutrient therapy, drug therapy, and cessation of drinking and smoking.

对于所有被诊断为可参考的糖尿病视网膜病变（DR）的参与者，他们在2周的跟踪随访中被联系，以检查他们是否（以及何时）与眼科医生预约。对于所有新诊断的糖尿病患者，他们填写了一份问卷，调查他们在基线、2周随访和4周随访时的糖尿病管理状态（扩展数据图3）。问卷调查了血糖监测、物理治疗、营养治疗、药物治疗的频率以及戒酒和戒烟的情况。

The post-deployment evaluation of management recommendations (ranking, quality, and empathy) was conducted in substudies I and II of the PCP+DeepDR-LLM arm, which was provided by three consultant-level endocrinologists and participants. For participants, their opinions on three recommendations were collected at the 4-week follow-up. We collected opinions from 372 participants with newly diagnosed diabetes and/or referable DR (6 participants with both newly diagnosed diabetes and referable DR) in the PCP+DeepDR-LLM arm. Each of the three consultant-level endocrinologists was invited to evaluate all the cases. For each case, the PCP, DeepDR-LLM, and PCP+DeepDR-LLM’s recommendations were anonymized and randomly ordered. The endocrinologists and surveyed participants ranked these three recommendations and judged both ‘the quality of information provided’ (very poor, poor, acceptable, good or very good) and ‘the empathy or bedside manner provided’ (not empathetic, slightly empathetic, moderately empathetic, empathetic or very empathetic).

在PCP+DeepDR-LLM组的I和II子研究中进行了管理建议（排名、质量和同情心）的后期评估，这由三位顾问级内分泌学家和参与者提供。对于参与者，他们在4周的随访中收集了他们对三项建议的意见。我们在PCP+DeepDR-LLM组中收集了372名新诊断糖尿病和/或可参考的DR的参与者的意见（其中6名参与者同时有新诊断的糖尿病和可参考的DR）。每位顾问级内分泌学家都被邀请评估所有案例。对于每个案例，PCP、DeepDR-LLM和PCP+DeepDR-LLM的建议被匿名化并随机排序。内分泌学家和调查参与者对这三项建议进行了排名，并评估了“提供的信息质量”（非常差、差、可接受、好或非常好）和“提供的同情心或床边礼仪”（不具同情心、稍具同情心、中等同情心、具同情心或非常同情）。

Furthermore, PCPs who used the DeepDR-LLM system in this real-world study were invited to complete a satisfaction questionnaire within two weeks after the conclusion of the study. The questionnaire included seven-item questions assessing these PCPs’ views regarding the integration of DeepDR-LLM into daily routine practice (Extended Data Table 6).

此外，在这项真实世界研究中使用DeepDR-LLM系统的初级保健医生（PCP）被邀请在研究结束后两周内完成一份满意度问卷。问卷包括七项问题，评估这些PCP对DeepDR-LLM融入日常常规实践的看法（扩展数据表6）。

Statistical analysis

In the retrospective evaluation of the LLM module in both English and Chinese languages, the total score (defined as the sum of domain-specific scores) was calculated by summing the scores gained in three domains, ranging from 3 to 9 points. For ‘extent of inappropriate content’ and ‘extent of missing content’, 1 point was given for ‘Present, substantial clinical significance’, 2 points for ‘Present, little clinical significance’ and 3 points for ‘None’. For ‘likelihood of possible harm’, 1 point was given for ‘High’, 2 points for ‘Medium’ and 3 points for ‘Low’. We compared the total scores of DeepDR-LLM, LLaMA, PCPs and endocrinology residents using the Friedman tests. Post-hoc pairwise comparisons were performed using the Wilcoxon signed-rank test. $P$ values for multiple comparisons were adjusted using the Bonferroni method.

在对LLM模块进行回顾性评估时，英文和中文的总分（定义为领域特定分数的总和）是通过将三个领域的得分相加计算得出的，得分范围从3到9分。对于“内容不当的程度”和“缺失内容的程度”，1分表示“存在，临床意义重大”，2分表示“存在，临床意义较小”，3分表示“无”。对于“可能造成的伤害的可能性”，1分表示“高”，2分表示“中等”，3分表示“低”。我们使用Friedman检验比较了DeepDR-LLM、LLaMA、PCP和内分泌学住院医师的总分。使用Wilcoxon符号秩检验进行事后成对比较。多重比较的 $P$ 值通过Bonferroni方法进行了调整。

In the development and validation of the DeepDR-Transformer module, the performance of the image quality assessment and DR grading was measured by the AUCs generated by plotting sensitivity (true positive rate) versus 1 − specificity (false positive rate). The operating thresholds for sensitivity and specificity were selected using the Youden index. The performance of lesion segmentation was measured using the IoU and F score. Cluster-bootstrap, biased-corrected, asymptotic two-sided $95\%$ confidence intervals (CIs) adjusted for clustering by patients were calculated and presented for proportions (sensitivity, specificity) and AUC, respectively.

在DeepDR-Transformer模块的开发和验证中，图像质量评估和DR分级的性能通过绘制敏感性（真正率）与1 − 特异性（假正率）生成的AUC来测量。敏感性和特异性的操作阈值是使用Youden指数选择的。病灶分割的性能使用IoU和F分数来衡量。计算并呈现了经过患者聚类调整的集群自助、偏倚校正、渐近双侧 $95\%$ 置信区间（CI），分别针对比例（敏感性、特异性）和AUC。

In the evaluation of DeepDR-Transformer as an assistive tool for PCPs and professional graders in identifying referable DR, the performance was measured by sensitivity and specificity of detecting referable DR. The $95\%$ CIs of the assessment time per eye were calculated using bootstrap methods. The assessment time before and after the DeepDR-Transformer assistance was compared using Wilcoxon signed-rank tests.

在评估DeepDR-Transformer作为PCP和专业评分员识别可参考DR的辅助工具时，性能通过检测可参考DR的敏感性和特异性来衡量。使用自助法计算每只眼睛评估时间的 $95\%$ 置信区间（CI）。使用Wilcoxon符号秩检验比较了DeepDR-Transformer帮助前后的评估时间。

In the real-world prospective study, to compare the differences in outcomes at baseline, 2-week and 4-week follow-up among participants with newly diagnosed diabetes or referable DR between two arms, we performed linear mixed models, logistic regression models, and linear regression models, adjusting for age, sex and baseline HbA1c. For post-deployment evaluation of management recommendations by both endocrinologists and participants, we reported the percentage of evaluators for their first-choice preference as well as the Clopper–Pearson $95\%$ CI. All hypotheses tested were two-sided, and a $P$ value of less than 0.05 was considered statistically significant.

在这项真实世界的前瞻性研究中，为了比较在两组之间新诊断糖尿病或可参考DR的参与者在基线、2周和4周随访中的结果差异，我们进行了线性混合模型、逻辑回归模型和线性回归模型分析，调整了年龄、性别和基线HbA1c。对于内分泌学家和参与者对管理建议的部署后评估，我们报告了评估者的首选偏好百分比以及Clopper–Pearson $95\%$ 置信区间（CI）。所有假设检验均为双侧检验， $P$ 值小于0.05被视为具有统计学意义。

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

有关研究设计的更多信息，请参阅与本文链接的《自然投资组合报告摘要》。

Code availability

The code being used in the current study for developing the algorithm is provided via GitHub at https://github.com/DeepPros/DeepDR-LLM .

当前研究中用于开发算法的代码通过GitHub提供，网址为https://github.com/DeepPros/DeepDR-LLM .

Additional information

Extended data is available for this paper at https://doi.org/10.1038/s41591-024-03139-8 .

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41591-024-03139-8 .

Correspondence and requests for materials should be addressed to Weiping Jia, Yih-Chung Tham, Huating Li, Bin Sheng or Tien Yin Wong.

Peer review information Nature Medicine thanks Stephen Gilbert, Francisco Pasquel, Sergey Tarima and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Lorenzo Righetto and Sonia Muliyil, in collaboration with the Nature Medicine team.

Reprints and permissions information is available at www.nature.com/reprints .
在这里插入图片描述

全部评论 (0)

还没有任何评论哟~

【Nature medicine】Integrated image-based deep learning and language models for primary diabetes care

Integratedimagebaseddeeplearningandlanguagemodelsforprimarydiabetescare Abstract Introduction Fig.1A...

【Nature medicine】A visual–language foundation model for pathology image analysis using medical

Avisual–languagefoundationmodelforpathologyimageanalysisusingmedicalTwitter Abstract Introduction Re...

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

本文是LLM系列文章，针对《AComprehensiveSurveyofLargeLanguageModelsandMultimodalLargeLanguageModelsinMedicine》的翻...

Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based

本文是LLM系列文章，针对《AuditingLargeLanguageModelsforEnhancedTextBasedStereotypeDetectionandProbingBasedBiasE...

【Nature medicine】A generalist vision–language foundation model for diverse biomedical tasks

笔记 Ageneralistvision–languagefoundationmodelfordiversebiomedicaltasks Introduction Fig.1BiomedGPTcan...

A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges

本文是LLM系列文章，针对《ASurveyofLargeLanguageModelsinMedicine:Principles,Applications,andChallenges》的翻译。

Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges

前言：该篇文章较为全面但稍偏简单的介绍医学图像分割的常见数据集、各种神经网络，以及常见的训练技巧等问题。一、重点摘录 1.2.5Dapproachesareinspiredbythefacttha...

Adversarial Similarity Network for Evaluating Image Alignment in Deep Learning based Registration

一、Motivation 1\.传统的配准方法试图基于基于强度的相似性度量来优化变形场。这些方法通常涉及计算量大的高维优化和与任务有关的参数调整。 2\.最近的基于CNN的有监督配准方法虽然能够一定程...

Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based

title:LLMfortablereasoning mathjax:true date:2024051111:44:58 tags: LargeLanguageModelsareVersatileD...

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

现有很多研究关注于在视觉语言模型通过仅仅调整一小部分参数使用有效参数迁移算法。然而过多的计算负担阻碍了VLP的实际应用。本文关注于参数和计算高效的迁移学习（Parameterandcomputatio...

是否确定退出登录?

【Nature medicine】Integrated image-based deep learning and language models for primary diabetes care

Integrated image-based deep learning and language models for primary diabetes care

Abstract

Introduction

Fig. 1 | Architecture of the DeepDR-LLM system. The DeepDR-LLMsystem consists of two modules

Results

Study design and participants

Fig. 2 | Study design overview for the DeepDR-LLM system evaluation.

Performance of the LLM module (experiment 2a)

Fig. 3 | Head-to-head comparison between DeepDR-LLM, nontuned LLaMA, PCP and endocrinology residents in both English and Chinese.

Multiethnic validation of DeepDR-Transformer (experiment 2b)

DeepDR-Transformer as an assistive tool (experiment 2c)

Fig. 4 | Receiver operating characteristic curves showing performance of DeepDR-Transformer alone versus PCPs (when unassisted and assisted by DeepDR-Transformer) in identifying referable DR.

Multiethnic validation of DeepDR-Transformer (experiment 2b)

DeepDR-Transformer as an assistive tool (experiment 2c)

Prospective real-world study of DeepDR-LLM (experiment 2d)

Fig. 6 | Envisioning the future of primary diabetes care with the clinical system processes the accumulated clinical data to concurrently deliver DR integration of the DeepDR-LLM system.

Discussion

Methods Ethical approval

Data acquisition and diagnosis criteria

The architecture of the DeepDR-LLM system

DeepDR-Transformer fine-tuning for the classification and segmentation from standard fundus images

Transfer learning from standard to portable fundus images

Evaluation of the LLM module in a retrospective dataset

Evaluation of the performance of the DeepDR-Transformer on retrospective datasets

Evaluation of DeepDR-Transformer as an assistive tool in identifying referable DR

Real-world prospective study

Statistical analysis

Reporting summary

Code availability

Additional information

全部评论 (0)

相关文章推荐

【Nature medicine】Integrated image-based deep learning and language models for primary diabetes care

【Nature medicine】A visual–language foundation model for pathology image analysis using medical

A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine

Auditing Large Language Models for Enhanced Text-Based Stereotype Detection and Probing-Based

【Nature medicine】A generalist vision–language foundation model for diverse biomedical tasks

A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges

Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges

Adversarial Similarity Network for Evaluating Image Alignment in Deep Learning based Registration

Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models