Advertisement

【Nature medicine】A visual–language foundation model for pathology image analysis using medical

阅读量:

Visual-language-based foundation model for pathology image analysis utilizing medical Twitter

  • Abstract

  • Introduction

  • Results

    • Building upon this foundation, we integrated OpenPath with Twitter data and additional public datasets to establish a comprehensive dataset for training our visual-language AI.
  • We successfully trained PLIP using the aforementioned dataset, achieving state-of-the-art performance across multiple benchmarks.

  • Our experiments demonstrate that PLIP can generalize well to unseen categories without requiring additional training data.

  • As shown in Figure 1, our methodology provides a systematic framework for analyzing the relationship between visual and textual information.

  • The results indicate that PLIP significantly improves image representations, which is critical for training robust models.

  • Furthermore, our analysis reveals that PLIP enhances the ability to retrieve pathology images from both textual and visual inputs (see Figure 2).

  • Through extensive evaluation, we confirm that PLIP achieves superior performance compared to existing methods in zero-shot transfer learning scenarios (Figure 3).

  • The experimental results presented in Figure 4 highlight the effectiveness of our approach in text-to-image retrieval tasks.

  • Finally, as illustrated in Figure 5, our system demonstrates impressive capabilities in image-to-image retrieval applications.

    • Discussion
    • Methods
      • Description of the OpenPath dataset
  • 外部验证数据集

    • 模型训练与优化

    • 评估指标及统计分析

      • F1分数与MCC*
      • 图像检索评估
      • 统计显著性和相关性分析
    • Data availability

        • Data Availability
    • Code availability

    • Additional information

在这里插入图片描述

Abstract

The 缺乏标注的 publicly available 医疗影像数据是计算研究与教育创新的主要障碍. 同时, 医学家在诸如 medical Twitter 这样的公共论坛上共享许多去标识化的影像以及大量知识. 在此背景下, 我们利用这些 crowd 平台构建了一个大型数据库 OpenPath, 其包含了 208,414 张病理学影像配对其自然语言描述. 我们通过开发 pathology language–image pretraining (PLIP) 来展示这一资源的价值 PLIP 是一种多模态的人工智能系统,既能理解图像也能理解文本. 它基于 OpenPath 被训练生成. PLIP 在四个外部数据集上的零样本分类任务中表现优异 相较于先前基于对比学习的语言–图像预训练模型 PLIP 的 F1 分数达到了 0.565–0.832 较高的水平(而后者得分仅为 0.030–0.481). 在 PLIP 表嵌基础上构建简单监督分类器能提高 F1 分数 2.5% 相较于其他监督模型表嵌得分 借助 PLIP 我们能根据影像或自然语言搜索相似病例 大大促进知识共享 通过这种方法我们展示了公开分享的医疗信息是一个巨大的资源 它能够被利用来开发医疗人工智能以提高诊断 知识共享与教育

缺乏注释的公共医学图像在计算研究与教育创新中面临着主要障碍。此外,在临床医生通过公共论坛(如医学推特)分享大量去标识化的医学图像及知识库的过程中也暴露了这一问题。基于这些开源平台我们构建了OpenPath这一大型数据库集合它包含了208414张病理切片图像及其自然语言描述部分研究者已基于该数据集开发出了多模态人工智能模型PLIP这种具备图像与文本理解能力的先进工具能够有效提升医学信息的应用效率与传播效果PLIP在四个独立的数据集上的路径学图像分类任务中均展现出卓越的表现:在零样本分类场景下PLIP的F1分数范围为0.565至0.832显著优于现有对比语言-图像预训练模型(其F1分数区间为0.030至0.481)。此外基于PLIP生成的嵌入空间训练出一个简单的监督分类器相较于传统监督模型嵌入方法可实现约2.5% F1分数提升这进一步凸显了PLIP的强大潜力同时该技术还支持通过图像或文字搜索快速检索相关案例从而大大促进了医学知识的共享与传播这项研究结果表明开放获取医学信息资源具有巨大的应用价值不仅能够促进医学人工智能的发展还能够显著提升诊断准确性推动知识共享并助力医学教育创新

Introduction

Recent advancements in artificial intelligence (AI) algorithms within computational pathology have emerged as powerful tools for distinguishing cell or tissue types, producing diagnostic outputs, and accessing and retrieving pertinent image data from routinely stained H&E images (1–5). While several high-quality datasets are available for task-specific machine learning, such as Pan-Nuke (6) and Lizard (7), along with \mathsf{NuCls^{\mathrm{8}}}, the progress in computational pathology has been hindered by the lack of sufficiently diverse datasets that include well-annotated pathological labels in natural language. This data deficiency is particularly evident when considering that over 8,000 diseases (9) exist and their pathological classifications are continuously evolving as molecular and cellular disease understanding advances (10). While few-shot learning techniques with fine-tuning can mitigate this issue, a generalized pathology AI system capable of addressing multiple clinical needs remains crucial to develop.

近期,在计算病理学领域中应用的人工智能(AI)技术能够辅助识别细胞或组织类型、提供诊断结论,并从常规染色的苏木精-伊红(H&E)图像中提取相关影像资料。然而,在这一领域中缺乏包含自然语言精确标注的大规模多样化数据集仍然制约了技术的进步——即使有如PanNuke、Lizard和NucLS等专门的数据集可用。考虑到目前已有8,000多种疾病分类标准时这一数据不足问题显得尤为突出——随着分子生物学与细胞学知识的进步路径不断演进而路径分类标准也随之更新以适应新的临床需求尽管少样本学习方法通过微调能够一定程度缓解这一限制但开发一个通用性强且具备多用途应用前景的病理AI系统依然至关重要。

In parallel, numerous anonymized pathology images are widely distributed online, particularly on platforms like social media 13. In these settings, clinicians engage in converse about de-anonymized medical imagery with their peers 14–17. For instance, as documented in Schaumberg et al. (2020), researchers gathered 13,626 Twitter images alongside those from PubMed articles and constructed machine learning models for diverse diagnostic tasks. These publicly accessible datasets and ensuing discussions provide significant value to the pathology community by fostering knowledge exchange and educational initiatives. Notably, this platform houses comprehensive tagging systems specific to various pathology subspecialties; these were user-generated hashtags systematically organized during the 2016 United States and Canadian Academy for Pathology (USCAP) meeting 18. This dataset encompasses a spectrum of cases ranging from common to rare instances; each was typically stained using H&E dyes or occasionally subjected to immunohistochemistry. The public discourse surrounding these resources represents an underutilized resource for advancing medical AI technologies 13

在临床实践中,大量去去识别的病理图像被广泛传播于互联网平台尤其是社交媒体环境中。其中一项重要的研究案例是由Schaumberg团队主导的研究项目,在该研究中研究人员从推特及PubMed等平台提取了13,626张病理学相关图片,并构建了一系列适用于多维度任务的机器学习模型框架。这些公开获取的数据资源对于推动医学领域的知识共享具有重要意义,并可服务于教育及培训工作需求。值得注意的是,在推特这一社交平台上存在一批由专业人员自发创建的病理学标签系统,在2016年的美国与加拿大病理学会年会上系统性地进行了总结与归纳整理。该标签体系涵盖了从常见到罕见的各种病理学案例分析,并普遍采用H&E染色法进行组织染色处理;同时也有部分案例涉及免疫组织化学染色法的应用

In this research endeavor, we delved into popular pathology Twitter hashtags as a means to gather 243,375 public pathology images. The scope was expanded by incorporating additional data sourced from various Internet sites (specifically retrieved from the Large-scale Artificial Intelligence Open Network (\texttt{L A I O N}^{22}) ), followed by rigorous steps to ensure data quality. This effort culminated in the creation of OpenPath—a collection comprising 208,414 pathology image–text pairs. To our existing body of knowledge, OpenPath stands out as The most extensive annotated dataset of its kind in the field of pathology. Leveraging this comprehensive resource enabled us to build a versatile AI foundation model that integrates both imaging and textual analysis for pathology diagnosis. Unlike prior methodologies which often relied solely on visual information alone, our approach avoids reliance on textual annotations during training. Through this innovative design, we achieved significant improvements in understanding semantic knowledge derived from images—ultimately equipping our model with the ability to tackle diverse downstream tasks effectively.

本研究详细介绍了所收集的OpenPath数据集,并构建了PLIP(病理学语言-图像预训练)模型。该模型通过对比学习优化了OpenPath平台上的配对图像与标题匹配过程。随后我们系统性地考察了PLIP在零样本学习场景下的性能表现——具体而言是其在新标题分类中的应用效果。此外 PLIP展现出强大的通用性特征 作为一个高效的图像编码器 它能够更好地捕捉病理学领域的图像特征 进而支持线性探测技术实现精准分类 从而显著提升了在多组织类型和多样化学习任务中的分类精度 在标注数据有限的临床应用中 其表现尤为突出 最后 PLIP提供了一个智能化的病理图像检索系统 这一系统不仅支持基于文本关键词搜索 还能实现基于图像相似度检索 为临床医生和病理学教育工作者提供了高效便捷的研究工具 我们通过实验验证了该系统的检索性能指标 较之传统检索方法 显著提升了搜索效率与结果准确性

This research systematically describes the comprehensive OpenPath dataset and introduces the innovative PLIP (pathology language–image pretraining) model, which was developed through contrastive learning by leveraging paired images and captions from OpenPath. Subsequently, an in-depth evaluation of our proposed PLIP model was conducted to examine its capability to adapt to unseen captions using zero-shot learning techniques [23]. Furthermore, the versatile PLIP architecture functions as a robust image encoder, effectively capturing high-quality image representations for pathology. This unique feature enables effective linear probing for training and classification purposes across diverse tissue types and learning tasks. Notably, this general-purpose image encoder proves particularly valuable in clinical settings where annotated data is scarce. Finally, PLIP provides a flexible search engine for pathology images, serving as an indispensable educational and knowledge-sharing tool for clinicians and pathology trainees. A systematic assessment of its retrieval capabilities demonstrated its efficiency in retrieving relevant pathology images based on text or visual inputs.

在本研究中,我们基于流行性病理学推特标签精心策划了243,375张公共病理图像数据库。我们整合了来自互联网其他网站的数据(其中部分数据源自大规模人工智能开放网络LAION项目),并经过严格的筛选和质量控制流程后构建了一个包含208,414对高质量 pathology image-pair数据集——OpenPath。据公开资料显示,在此领域内OpenPath规模是最为巨大的/pathology image database equipped with textual annotations.随后我们利用这一大型结构化 pathology image-pair数据集开发了一款综合性的 pathology image and language结合的人工智能基础模型系统。与现有研究相比,在该方法中我们将自然语言注释系统融入学习机制中以便实现基于图像的理解能力从而使其能够有效执行多种 downstream的任务

Results

Creating OpenPath from Twitter and other public sources

注:由于篇幅限制以及专业术语的特殊性,在此处无法提供完整的中文翻译版本

USCAP和病理学科相关的专门标签筛选出32个专门用于病理学研究的Twitter标签。基于这些标签关键词进行检索,在时间跨度从2006年3月21日(Twitter首次发布日期)到2022年11月15日期间收集了相关推文数据,并构建了一个规模庞大的病理学术数据资源库。每张推文图像均附有自然语言描述,并标记为OpenPath项目中的对象实体识别(Object Pathology, OP)。每个标签的核心定义及其应用细节可在附录表1中查阅完整信息。在数据检索过程中严格遵守Twitter平台及相关实体的数据使用政策和技术指南要求。为了保证数据质量与可靠性性,在构建OpenPath数据库时我们采用了严格的筛选标准和排除机制:首先删除所有转发推文;其次剔除可能存在敏感性或不相关性的推文内容;再次过滤掉非病理相关的图像文件;最后对剩余文本内容进行了全面性清洗处理以消除冗余信息并优化语言表达效果。最终形成的OpenPath数据库包含以下三类关键数据要素:第一类是基于筛选出的32个专业路径学标签所获取的推文图像-文本配对样本共计116,504对;第二类是从这些关键词相关的高点赞互动推文中提取的59,869对图像-文本配对样本;第三类是通过互联网及LAION大型开放索引网络平台捕获的补充样本共32,041对图像-文本配对样本(其中部分样本仅适用于特定领域研究)。在构建过程中我们采用中位数词长为17个单词的标准来描述每张图像所涉及的具体医学症状与诊断特征信息

Training a visual–language AI using OpenPath

Unlike traditional supervised learning models based solely on categorical labels, natural language texts are enhanced with semantic knowledge and interrelated data, which can further improve image understanding and support multiple downstream applications. In this research, we fine-tuned a pre-trained contrastive language–image pretraining (CLIP) model 25 on OpenPath using contrastive learning. To achieve this, we integrated a pathology image preprocessing pipeline comprising image downsampling, random cropping, and data augmentations (Methods). During the training phase, the PLIP model generates two embedding vectors from both text and image encoders (Fig. 1e ). These vectors are then optimized to be highly similar for paired image-text vector pairs while maintaining low similarity for unpaired images-text vector pairs through contrastive learning (Fig. 1f and Methods).

在与其他仅依赖分类标签进行监督学习的不同背景下,在自然语言处理领域中的文本所包含的丰富语义信息以及各部分内容间的内在联系能够显著提升对图像的理解能力,并拓宽多个实际应用场景。本研究通过将对比学习算法微调至OpenPath平台之上,并构建了一个完整的病理图像预处理体系——其中包括对图像进行降采样、随机裁剪以及多方面的数据增强操作——以确保数据质量与多样性的一致性。在这一过程中,在训练阶段生成了两个独立的编码向量:一个来自文本信息提取模块、另一个则基于图像特征提取模块独立生成。随后通过对比优化算法提升了这两组编码之间的匹配度,在保证不同配对之间差异性的同时实现了对目标样本的有效识别与分类任务目标的达成

PLIP is capable of handling various kinds of inferences across a wide range of medical applications that do not necessitate explicit training. Subsequent sections will illustrate the application of PLIP in performing various downstream tasks.

PLIP具备处理广泛应用于医学领域的多种推理类型的能力,并不需要显式进行特定领域的训练。在后续章节中,我们阐述了如何运用PLIP来完成一系列下游任务。

PLIP can classify new images without further training

This study systematically assessed PLIP's zero-shot capabilities, specifically its ability to learn new classes at scale without retraining (Figure 2a). The evaluation process encompassed four external validation datasets: (1) the Kather colon dataset with nine distinct tissue types; (2) the PanNuke dataset, which included both benign and malignant samples; (3) the DigestPath dataset, also featuring benign and malignant examples; and (4) the WSSS4LUAD dataset, which distinguished between tumor and normal tissues (Figures 2b and Supplementary Figure 1). For each dataset, PLIP's performance was evaluated by converting labels into descriptive sentences, such as transforming "tumor" into "an H&E image of tumor." As a point of comparison, PLIP was benchmarked against the original CLIP model—a widely-used tool in medical imaging applications that had been trained on diverse medical images. The assessment utilized the weighted F1 score, a metric combining precision and recall while accounting for class imbalances. The results indicated that PLIP consistently outperformed both the baseline CLIP model and majority class predictions across all datasets. Specifically, on the Kather colon dataset (nine classes), PLIP achieved {\bf F1}\!=\!0.565 (95% confidence interval {\mathsf{C l}} = 0.559–0.572). In comparison, in the PanNuke dataset (benign versus malignant), PLIP attained {\bf F1}\!=\!0.656 (95\%{\sf C l} = 0.639–0.667). On the DigestPath dataset (benign vs malignant), PLIP demonstrated superior performance with {\bf F}1\!=\!0.832 (95\%{\sf C l} = 0.829–0.834). Similarly, in WSSS4LUAD (tumor versus normal), PLIP achieved {\bf F}{\bf1}\!=\!0.734 (95\%{\sf C l} = 0.723–0.745). All these outcomes significantly surpassed CLIP's performance in majority class predictions (Supplementary Tables 3 and 4). Furthermore, calculations of the Matthews correlation coefficient revealed that PLIP exhibited superior performance across all evaluated datasets (Supplementary Table 4).

本研究系统性地评估了PLIP在无需重新训练时的大规模学习能力。这一能力使PLIP能够在无需额外训练的情况下有效地学习新类别,并将其应用于四个外部验证数据集:(1)Kather结直肠图像数据集(包含九种不同类型的组织样本);(2)PanNuke数据集(区分良性与恶性肿瘤);(3)DigestPath数据集(同样区分良性与恶性肿瘤);(4)WSSS4LUAD数据集(区分肿瘤与正常组织)。通过将标签信息转化为具体的样本对进行测试,在每种情况下都生成了精确的图像描述语句作为输入。为了确保公平对比,在实验中还特意选择了传统CLIP模型作为参照对象,并对其进行了初步训练以适应医学图像任务的需求。基于加权F1分数的评估指标体系,在所有测试数据集中均展示了显著的优势:例如,在Kather结直肠图像数据集中获得了一个令人瞩目的F1值达到0.565(95%置信区间\mathsf{Cl}=0.559{-}0.572)。而在PanNuke和DigestPath数据集中分别获得了F1=0.656F1=0.832的最佳成绩,在WSSS4LUAD数据集中则达到了F1=0.734的理想水平。此外实验还计算了Matthews相关系数(MCC),结果显示在所有测试用例中PLIP均为最佳模型选择

Additionally, Figure 2d illustrates the confusion matrix comparing ground truth and predicted annotations for the Kather colon dataset. In comparison to other models examined in Extended Data Figure 2, PLIP demonstrated reasonable zero-shot learning capabilities for this complex task. This advanced model can effectively distinguish several critical tissue types, such as adipose tissue (ADI), background tissues (BACK), carcinoma epithelial cells of colorectal regions along with lymphocyte groups (LYMs). However, it encounters challenges in handling other specialized tissues like mucus-rich areas (MUC) and debris-like structures (DEB). These findings suggest that the observed limitations are likely related to the fact that MUC regions, DEB deposits, and smooth muscle cells (MUS) are often classified as cancer-associated stroma (STR).

此外,在展示Kather结肠数据集的真实标签与预测标注之间的混淆矩阵时

Moreover, we conducted an in-depth investigation into the zero-shot performances to classify benign and malignant conditions within each of nineteen distinct anatomical regions from the PanNuke dataset. As opposed to the baseline CLIP method shown in Fig. 2e , our study revealed that PLIP demonstrated superior weighted F1 scores across fourteen organs (see Supplementary Table 3). The following seven subspecialties—adrenal gland, esophagus, liver, ovarian tumor, stomach, testis and uterus—showed moderate high F1 scores (F\mathbf{1} > 0.8) , while the baseline CLIP achieved performance levels ranging between \mathbf{F}\mathbf{1}=0.3 - 0.6 . Notably, however, imbalanced class distributions were observed in certain anatomical regions such as kidney masses or lung nodules; these cases may have contributed to PLIP's relatively lower performance outcomes compared to other methods.

我们对PanNuke数据集中的不同类型的器官样本进行了系统性研究,并对其良性与恶性零样本的表现展开了详细分析。相较于基线CLIP模型,在这19个器官中研究结果表明,在这7个特定的亚专科领域(如肾上腺、食管等)其对应的加权F-指标均达到了显著水平(>0.8)。然而在其他领域表现却相对不如人意

在这里插入图片描述

Fig. 1 | Overview of the study.

a, Flowchart for data acquisition from medical Twitter posts.
b, Overview on the OpenPath dataset.
c, The total count of available data in the OpenPath dataset.
d, Graphical representation showing distribution of word counts per sentence across all tweets.
e, The process involves training a PLIP model using paired image-text pairs from tweets and replies within each hashtag (sorted alphabetically). Replies are selected as those with highest likes in the image-text dataset via contrastive learning.
f, Visual illustration demonstrating contrastive learning training procedure.

图1 | 研究概述。
a. 医学术语领域中基于Twitter的数据获取流程示意图(如有适用)。
b. OpenPath官方提供的数据集概述。
c. OpenPath官方提供的具体数量。
d. Twitter平台每条推文中平均单词数量分布情况图。
e. 首先,在推文中提取与每个Twitter标签相关的配对图像-文本对,并将这些配对按照字母顺序排列后用于PLIP模型训练过程的具体实施步骤;其次,在推文中提取与每个Twitter标签相关的配对图像-文本对时所使用的回复即为那些在图像-文本数据集中通过对比学习机制获得最多点赞数的对象;最后将整个PLIP模型训练过程的具体步骤进行了详细展示。
f. 展示了对比学习算法的具体训练步骤。

PLIP improves image representations for training models

为了深入了解PLIP图像编码器的能力及其表现特点,在本研究中我们采用了四个不同的测试数据集(包括Kather colon、PanNuke、DigestPath和WSSS4LUAD),以全面评估其对图像表示的支持效果。首先使用PLIP图像编码器计算出图像嵌入;随后通过均匀流形近邻投影进行降维处理。值得注意的是,在不训练于这些数据集的情况下发现,在Kather colon数据集上,PLIP依然能够有效地区分各种组织亚类。与基线模型(例如CLIP模型)相比,在正常结肠黏膜(NORM)和结直肠腺癌上皮(TUM)之间表现出显著差异性区分能力。这些结果表明PLIP在保持高效性的同时展现出强大的组织分类性能

为了深入探讨PLIP图像编码器的能力表现,本研究采用了四个典型的数据集(包括Kather结肠癌组织切片、PanNuke癌细胞图谱、DigestPath腺癌切片以及WSSS4LUAD多光谱医学图像)来进行表征性能的验证。在随后的过程中,PLIP图像编码器首先生成了一系列图像嵌入表示,在统一流形近似与投影方法下实现了降维效果。经过对这些测试样本的系统性分析,并未对这些数据集进行专门的模型训练工作的情况下,我们成功验证了该编码器在Kather结肠组织切片中对不同组织亚型的清晰区分能力。与现有对比基准模型(例如CLIP模型)的表现相比,在识别形态学特征相似但组织学细节存在显著差异的案例时,PLIP系统展现出更强的表现力。

In the PanNuke dataset, PLIP demonstrated organ-wise separations that were particularly emphasized for breast and colon subsets. The colon set was clearly displayed as two distinct subclusters; one of these clusters was predominantly associated with malignant tissue images. The PLIP model effectively differentiated normal and benign patches from tumoral and malignant ones across both the DigestPath and WSSS4LUAD datasets. We additionally observed distinct differentiation between varying image downsampling rates as well as differing staining patterns in the DigestPath dataset.

在PanNuke数据集中的一项研究揭示了器官特异性分割的现象尤为突出,在乳腺和结直肠组织中表现尤为显著。具体而言,在结直肠区域中形成了两个相对纯净的分割区域,在其中一个区域内能够观察到高度恶性的组织特征。此外,在DigestPath和WSSS4LUAD两组数据中分别实现了正常与良性的图像块与肿瘤及恶性图像块的成功区分,并在DigestPath数据集中进一步观察到不同降采样率以及染色处理之间呈现出明显的区域差异性

Encouraged by these findings, we proposed that the PLIP-based image encoder could represent an optimal choice for diverse pathology classification tasks. By developing a straightforward linear classifier atop feature vectors extracted from training subsets across four datasets—Kather colon, PanNuke, DigestPath, and WSSS4LUAD—we evaluated their performance against two benchmarks: features derived from CLIP’s native encoder and those learned through multi-task deep neural network pretraining.

基于这些发现的结果启发下, 我们推测PLIP图像编码器可能成为若干病理图像分类任务的关键预训练基础. 这一目标通过在Kather结肠学部、PanNuke数据库、DigestPath研究平台以及WSSS4LUAD数据集中分别建立分割嵌入向量, 并通过构建一个基本线性分类模型对这四个数据库中的测试分割性能进行了评估, 将其结果与原始CLIP架构生成的编码方案以及多任务预训练体系MuDiPath进行了系统对比分析

Our PLIP model successfully integrates natural language text descriptions to differentiate itself from other existing approaches such as MuDiPath and achieves consistent superior performance across all four evaluation datasets. In the Kather colon dataset with nine-class classification, our model achieved an F1 score of F1 = 0.877, outperforming the second-best existing approach, MuDiPath (F1 = 0.825) by a margin of 6.30%. Similarly, in the PanNuke dataset with binary classification, our PLIP model attained F1 = 0.902, which ranked highest among all compared models. Furthermore, in the DigestPath dataset, incorporating OpenPath data improved our model's F1 score by 3.5% to F1 = 0.856 compared to MuDiPath's performance. Additionally, through an ablation study using various combinations of datasets, we found that the PLIP model achieved its best performance when all OpenPath data were incorporated into the training process.]

通过将自然语言文本描述纳入考虑后的基础上展开研究发现

To obtain deeper insights into the advantages of PLIP, we conducted comparisons between PLIP and the end-to-end deep learning model ViT-B/32, employing fine-tuning on four external validation datasets. ViT-B/32 stands out as an advanced model, sharing identical structural features with PLIP's image encoder. This setup enables a straightforward assessment of how effective contrastive learning enhances PLIP's capabilities. We evaluated fine-tuning efficiency across varying training dataset sizes (1%, 5%, 10%, 50%, and 100%) to identify optimal data utilization strategies. Our results indicate that when training datasets are limited in size, employing end-to-end supervised learning yields less favorable outcomes compared to our proposed method. The enhanced performance becomes more pronounced as dataset sizes increase.

为进一步探索PLIP的优势,我们进行了系统性对比分析。具体而言,在四个独立测试集上分别进行了微调训练。其中ViT-B/32被选中作为基准模型之一,并基于其先进的架构实现了直接对比。通过采用不同规模(从1%到100%)的训练样本比例来考察微调效果。实验结果表明,在小规模训练数据条件下,基于监督学习的方法表现较差;而对比学习策略则能显著提升性能。

研究结果表明,PLIP在性能上与仅基于分类标签的传统深度学习模型相媲美或更优。

研究表明,在仅依赖分类标签的传统深度学习架构下无法达到的高度水平上(即相当或更高),PLIP展现出与之相当或更好的性能)。其优势可能源于详细标注的内容使得模型能够充分利用这些高级别的语义关联性,并对图像的理解更加全面且深入——不仅包含直接的视觉特征识别还涉及亚视网膜层细胞间的关联模式。

PLIP enhances pathology image retrieval from text inputs

PLIP is capable of mapping and associating the primary images corresponding to a given text prompt, a technique also referred to as text-to-image retrieval (as shown in Fig. 4a).

PLIP 能够通过分析给定的文本输入来识别出与之最匹配的图像,并将其与数据库中的图像进行对比筛选。这项技术在文献中通常表示为 Text-to-Image Retrieval(简称 TIR),具体可参考文献 33(见图 4a)。

To evaluate this ability, we collected four sets of images with captions: (1) Twitter validation dataset (Twitter); (2) PathPedia images (PathPedia); (3) PubMed pathology images 34 (PubMed); and (4) pathology book collections 34 (Books) (Fig. 4b). The Twitter validation dataset contained 2,023 paired image–text from 16 November 2022 to 15 January 2023 (Fig. 4c and Extended Data Fig. 1b), and was expected to have a similar image–text distribution to what the PLIP model had trained on. In contrast, PathPedia (number of candidates for image retrieval =210), PubMed (1,419 image–text pairs) and Books (558 image–text pairs) consisted of relatively concise texts (Fig. 4d).

为了评估这一能力,我们获取了四组带标题的图像:(1) 来自Twitter平台的验证数据集(命名为Twitter),(2) PathPebra系统中的图片库(PathPebra),(3) PubMed平台上病理切片图片库(PubMed),以及(4) 图书馆 rare books 数据库中的病理书籍图片库(Books)。如图4b所示,在这种情况下我们收集了总计约2,023对图像-文本配对样本,并将其与PLIP模型训练时所使用的图像-文本分布进行对比分析。值得注意的是,在这些基准数据集中:PathPebra系统仅包含约2.1×10²个检索候选数、PubMed提供了约千四百余对图片-文本样本、而 rare books 数据库则拥有约五百五十八对图片-文本配对样本。值得注意的是,在这些基准集合中大多数样本都是基于简短且清晰的文字描述进行构建

We evaluated image retrieval performance using the Recall @10 and Recall @50 metrics on the Twitter validation dataset (Methods). Finding the exact image associated with a given text is challenging because of many similar images that could match one description. Nonetheless, we found that PLIP greatly improved image retrieval performances with Recall @10=0.271 (4.5\times higher than CLIP) and Recall @50\,{=}\,0.527 (4.1\times higher than CLIP) (Fig. 4e). With a large pool of candidates (n\!=\!2,\!023) and different tissue types, the 52.7\% chance of retrieving the target image within the top 50 demonstrates a challenging yet achievable task.

基于 Twitter 的 twitter验证集上的 Recall @10 和 Recall @50 指标的评估完成了图像检索性能的方法研究。由于大量相似的图像可能与同一描述匹配,在找到与给定文本相关的确切图像方面存在挑战。然而 PLIP 显著提升了图像检索性能,在 Recalls @10 和 @50 的表现上分别达到了 4.5 倍和 4.1 倍于 CLIP 的水平(如图 4e所示)。在面对 2,023 个候选样本以及多种组织类型的情况下,在前 50 名中找到目标图片的概率达到了 52.7%,这证明了该任务既具有挑战性又可实现性。

In addition, PLIP demonstrated a substantial improvement in performance compared to both the baseline CLIP and random performances across different datasets (Fig. 4e). In the PathPedia collection, PLIP achieved Recall @10\,{=}\,0.409 and Recall @50\,{=}\,0.752. In the PubMed pathology image collection, PLIP achieved Recall @10\,{=}\,0.069 and Recall @50\,{=}\,0.206. In the Books pathology image collection, PLIP achieved Recall @10\,{=}\,0.265 and Recall @50\,{=}\,0.659. On average, these performances are 2–5 times higher than the baseline CLIP model. PLIP demonstrated the largest advantage over baseline methods on the Twitter validation dataset (fold change =55.3 and 21.4 for Recall @10 and Recall @50 compared to random retrieval). The PathPedia image collection showed the least improvement, probably due to the curated PathPedia collection not covering all of the nuances and variations in the text for the pathology images.

此外,在经过一系列实验验证后发现

Additional text-to-image retrieval experiments were performed on the Twitter validation dataset among its top ten hashtags, each featuring over 100 candidate images (Figures ^{4\mathrm{f,g}} and Extended Data Fig. 9a,b). When evaluating using Recall @10, our analysis revealed that gynecological pathology (#Gynpath) showed the highest improvement with the PLIP model, achieving a recall rate of 0.557 (Figure 4f). Similarly, head and neck pathology (#ENTPath) demonstrated exceptional performance with Recall @10 reaching 0.925 for PLIP (Extended Data Fig. 9a). Spearman correlation tests further highlighted significant relationships between model performance improvements and candidate image counts: For Recall @10, PLIP outperformed random retrieval with \rho\!=\!0.88 (P\!=\!8.14\times10^{-4}) and outperformed CLIP at \rho\!=\!0.64 (P\!=\!4.79\times10^{-2}) (Figure 4g). When measured at Recall @50, these correlations became even stronger, with PLIP surpassing random retrieval by \rho\!=\!0.98 (P\!=\!1.47\times1e-6) and CLIP at \rho\!=\!0.85 (P\!=\!1.64\times1e-3) (Extended Data Fig. 9b).

我们进一步对 Twitter 验证数据集中拥有超过 100 个可用候选者的前十个话题标签进行了文本到图像的检索任务(图 {4\mathrm{f,g}} 和扩展数据图 9a,b)。在使用 Recall @10 衡量时,我们发现妇科病理学 (#Gynpath) 可能最受益于 PLIP 模型(Recall @10\,{=}\,0.557) (图 4f)。在使用 Recall @50 衡量时,头颈病理学 (#ENTPath) 从 PLIP 模型中受益最大(Recall @50\,{=}\,0.925) (扩展数据图 9a)。此外,斯皮尔曼相关性分析表明 PLIP 模型性能提升与候选图像数量显著相关。对于 Recall @10,PLIP 与随机的相关性 \rho\!=\!0.88\,(P\!=\!8.14\times10^{-4}),PLIP 与 CLIP 的相关性 \rho\!=\!0.64 (P\!=\!4.79\times10^{-2}) (图 4g)。对于 Recall @50,PLIP 与随机的相关性 \rho\!=\!0.98 (P\!=\!1.47\times\!10^{-6}),PLIP 与 CLIP 的相关性 \rho\!=\!0.85 (P\!=\!1.64\times\!10^{-3}) (扩展数据图 9

PLIP enhances pathology image retrieval from image inputs

We conducted an in-depth evaluation of an advanced image-to-image retrieval method, which was designed to measure similarity based on image embeddings (see Fig. 5a for details). The evaluation process began with data sourced from the Twitter validation dataset (Fig. 5b), comprising a total number of 2,023 unique image–text pairs derived from 925 tweets. Among these, 525 tweets contained multiple images, resulting in a comprehensive search space of 1,623 images. For each target image, we systematically compared it against all other images using Recall metrics at ranks 10 and 50, which evaluate how many images originating from the same tweet appear within these ranks during retrieval. To establish a fair benchmark, we compared our model (PLIP) with three baseline approaches: CLIP, MuDiPath, and SISH (see Fig. 5c for results). Our analysis revealed that all models demonstrated comparable performance in retrieving relevant images; however, PLIP achieved superior results with Recall@10 scoring 0.646 compared to CLIP's 0.353, MuDiPath's 0.336, and SISH's 0.356. Similarly, PLIP outperformed other models in Recall@50 scoring an impressive value of approximately 0.814 (versus CLIP's ~0.513, MuDiPath's ~0.485, and SISH's ~0.474).

我们通过评估目标图像与候选图像之间的相似度来实现图 5a 的图像到图像检索过程。该评估基于 Twitter 验证数据集进行(图 5b)。在处理 Twitter 验证数据集中共925条推文中的2,023对图像-文本配对时(其中每条推文平均约两.三张图片),总共提供了共计1,623张需检索的图片。我们采用 Recall@10 和 Recall@50 指标对每个目标图像与其所有其他图像进行了对比分析(图 5c)。这些指标衡量了系统在前10名和前50名检索结果中来自同一推文的比例。本研究将 PLIP 方法与其三个基线模型 CLIP、MuDiPath 和 SISH 进行了对比实验(图 5c)。结果显示,在所有四个模型均能捕获相关图片的情况下 PLIP 实现了最佳性能表现:Recall@10 达到了64.6%(相较于 CLIP 的3.4%,MuDiPath 的3.8%,SISH 的4.7%),Recall@10 达到了81.4%(相较于 CLIP 的78.7%,MuDiPath 的77.8%,SISH 的76.8%)。

Additional evaluations were conducted on three external validation datasets, each with a distinct study focus (Supplementary Fig. 2):

(1) tissue types: the Kather colon dataset (nine colon tissue types); (2) organ types: the PanNuke dataset (19 organs); and (3) staining textures: the KIMIA Path24C dataset (24 staining textures) 35. By evaluating the class retrieval accuracy at the top K, we determined the purity of the retrieved K images from the same given class. From the results, we found that PLIP consistently outperformed other models across all three datasets (Fig. 5d–f and Supplementary Table 6). For example, in the Kather colon dataset, we found that PLIP achieved 0.998 when K\!=\!10 (meanwhile, {\mathsf{C L I P}}=0.984, MuDiPath =0.994, KaTeX parse error: Undefined control sequence: \sfSISH at position 9: \mathrm{\̲s̲f̲S̲I̲S̲H̲}=0.993). In the PanNuke dataset, we found that PLIP achieved 0.954 when K{=}10 (meanwhile, {\mathsf{C L I P}}\,{=}\,0.915, MuDiPath =0.927, KaTeX parse error: Undefined control sequence: \sfSISH at position 9: \mathrm{\̲s̲f̲S̲I̲S̲H̲}=0.944). In the KIMIA Path24C dataset, we found that PLIP achieved 0.906 when K\!=\!10 (meanwhile, {\mathsf{C L I P}}=0.858, MuDiPath =0.879, KaTeX parse error: Undefined control sequence: \sfSISH at position 10: \mathrm{{\̲s̲f̲S̲I̲S̲H̲}}=0.885.). These results under several testing scenarios, including tissue types, organ types, and staining textures, suggested that PLIP is a preferred model to be used as an image-to-image retrieval system in pathology.

增补性评估由三个外部验证数据集完成(参考补充图2),每个集合均关注不同的研究焦点。

(1) 组织类型:Kather 结肠数据集(九种结肠组织类型);(2) 器官类型:PanNuke 数据集(19 种器官);(3) 染色纹理:KIMIA Path24C 数据集(24 种染色纹理)35。通过评估前 K 的类检索准确性,我们确定了从同一类中检索到的 K 张图像的纯度。从结果来看,我们发现 PLIP 在所有三个数据集中表现优于其他模型(图 5d–f 和补充表 6)。例如,在 Kather 结肠数据集中,我们发现当 K\!=\!10 时,PLIP 达到 0.998(同时,{\mathsf{C L I P}}=0.984,MuDiPath =0.994,KaTeX parse error: Undefined control sequence: \sfSISH at position 9: \mathrm{\̲s̲f̲S̲I̲S̲H̲}=0.993)。在 PanNuke 数据集中,我们发现当 K{=}10 时,PLIP 达到 0.954(同时,{\mathsf{C L I P}}\,{=}\,0.915,MuDiPath =0.927,KaTeX parse error: Undefined control sequence: \sfSISH at position 9: \mathrm{\̲s̲f̲S̲I̲S̲H̲}=0.944)。在 KIMIA Path24C 数据集中,我们发现当 K\!=\!10 时,PLIP 达到 0.906(同时,{\mathsf{C L I P}}=0.858,MuDiPath =0.879,KaTeX parse error: Undefined control sequence: \sfSISH at position 10: \mathrm{{\̲s̲f̲S̲I̲S̲H̲}}=0.885)。这些在多个测试场景下的结果,包括组织类型、器官类型和染色纹理,表明 PLIP 是病理学中图像到图像检索系统的首选模型。

Ultimately, both text-to-image and image-to-image retrieval systems operate effectively as advanced search engines within digital imaging platforms. Through rigorous evaluation on our platform (https://tinyurl.com/webplip), this versatile system is capable of comprehending semantic meanings and intricate connections between data points such as 'tumors surrounded by adipose tissue' (see Figure 5g). This innovative capability furnishes researchers with a potent methodology for accessing vast pathology databases, enabling them to swiftly identify pertinent information based on specific descriptors. Furthermore, this system's adaptability extends beyond standalone use; it can also be employed in tandem with other analytical tools to enhance diagnostic accuracy. Additionally, image-to-image retrieval functions effectively by identifying analogous cases within datasets, such as images featuring mitotic patterns or those depicting structural anomalies. These capabilities underscore the system's capacity to distill complex visual data into actionable insights (Figure 5h).

Additionally, the text-to-image and image-to-image retrieval system can function as an image search engine, enabling users to match images across multiple queries and retrieve the most relevant images based on sentence descriptions or input images. This general system is capable of understanding semantic meanings and interrelated knowledge, such as "breast tumors with fat encapsulation" (see Figure 5g). The ability to explore and retrieve large-scale pathological datasets is a powerful tool that allows users to efficiently and accurately identify images that meet specific criteria. Theoretically, image-to-image retrieval can be used to search for similar pathological images. For instance, it can locate histological sections containing mitotic figures, demonstrating its capacity to comprehend key concepts within input images (see Figure 5h).

Fig. 2 | PLIP predicts new classes via zero-shot transfer learning

在这里插入图片描述

a, 图形化展示零样本分类技术。该分类输出是由候选文本根据输入图像的最大余弦相似度所决定的。

四个外部验证数据集包括:九种组织类型的Kather结直肠癌数据集;包含良性与恶性组织的PanNuke数据集;以及包含良性与恶性组织的DigestPath数据集;还有肿瘤与正常组织分类的WSSS4LUAD数据集

c, Zero-shot results scored using weighted F1 metrics across the four datasets. In note that these performance outcomes are derived from a nine-class zero-shot learning evaluation protocol for the Kather colon dataset; however, binary zero-shot learning evaluations were employed for all other datasets. In each box plot, the central line signifies the mean value with corresponding error bars representing 95% confidence intervals. The number of test samples per dataset is as follows: Kather colon (n = 7,180); PanNuke (n = 1,888); DigestPath (n = 18,814); and WSSS4LUAD (n = 3,028).

d, Confusion Matrix from the Kather colon dataset. 真实和预测的标签分别以行和列的形式展示.

e, Zero-shot evaluation of the PanNuke dataset within each organ type.

a, 零样本分类的图形表示。分类输出则通过识别与输入图像具有最大余弦相似度值的候选文献来确定。

b, 四个外部验证数据集:Kather结直肠癌切片数据库(包含九种类型的癌细胞切片),PANNUKE数据库包含了不同类型的肿瘤组织样本;DigestPath数据库同样涵盖了良性与恶性肿瘤样本;WSSS4LUAD集合则包括了肿瘤细胞与正常细胞样本

c, 四个数据集的加权F1分数零样本性能。注意,Kather结肠数据集的性能基于九类零样本学习评估,而其他数据集的性能基于二分类零样本学习评估。在每个箱型图中,中心线代表均值,误差条表示95%置信区间。每个数据集的测试样本数量:Kather结肠(n = 7,180);PanNuke(n = 1,888);DigestPath(n = 18,814);WSSS4LUAD(n = 3,028)。

在Kather结肠数据集上的混淆矩阵中,实际标签与预测标签依次排列于行与列的位置。

e, PanNuke数据集在每种器官类型中的零样本评估。

Fig. 3 | Image embedding analysis and linear probing results

在这里插入图片描述

a, 图像嵌入由PLIP模型生成于Kather结直肠癌数据集内。

b, Image embeddings generated from the PLIP model in the PanNuke dataset.

c, The image embeddings are created via the PLIP model within the DigestPath dataset.

d, Image-based embedding representations computed by the PLIP neural network architecture within the WSSS4LUAD dataset.

e, 线性探针式迁移学习的图形化展示。其中'Frozen'表示从线性分类器获得的损失不会用于更新图像编码器的参数。

f,在测试集中具有F1分数(±s.d.),源自五次不同随机种子重复实验的结果。The 'Average'列展示了四个数据集的平均性能。P值通过双侧Student's t检验计算得出,并在文章末尾两行呈现。

a, 由PLIP模型在Kather结肠数据集生成的图像嵌入。

b, 由PLIP模型在PanNuke数据集生成的图像嵌入。

c, 由PLIP模型在DigestPath数据集生成的图像嵌入。

d, 由PLIP模型在WSSS4LUAD数据集生成的图像嵌入。

在本研究中采用线性探测迁移学习的方法进行图像编码器参数优化。其中,“Frozen”表明线性分类器的损失函数不会参与图像编码器参数的更新过程。

测试集上的F1分数,在采用五次独立重复实验的基础上计算得出其均值(±标准差)。"Average"列呈现了四组基准数据集对应的平均性能指标。通过双侧Student’s t检验方法计算出各组P值,并将其结果置于表格下方最后一行

Fig. 4 | Text-to-image retrieval for pathology images

![在这里插入图片描述](

a, Graphical illustration of pathology image retrieval from text input.

b, Density plot of pathology subspecialty-specific hashtags.

c, Overview of the Twitter validation dataset and an example text caption. The dataset comprises 10,000 tweets, each annotated with labels indicating their validity. Each tweet is annotated with labels denoting their validity status. The dataset includes tweets in both English and Chinese languages, with an average length of 28 characters per tweet. An example text caption follows.

d, Descriptions of the PathPedia, PubMed and Books datasets along with their example text annotations.

e, Image retrieval performances across the validation datasets.

the performance metric f, the performance metrics of text-to-image retrieval, at the Recall@10 level, across all specific disease-pathology-related hashtags in each subspecialty domain.

The Spearman correlations between candidate count and fold changes for Recall@10 were examined when comparing the PLIP model to CLIP and random conditions. The regression estimates were presented using 95% confidence intervals in gray or purple.

a, 病理图像从文本输入检索的图示。

b, 病理学分支特定标签的密度图。

c, Twitter验证数据集的描述和一个示例文本标题。

d, PathPedia、PubMed和Books数据集的描述以及示例文本标题。

e, 验证数据集中的图像检索性能。

f, 在病理学分支特定标签中的Recall @10 的文本到图像检索性能。

在对PLIP模型与CLIP及随机模型的对比研究中,考察候选样本数量与Recall @10之间的Spearman相关性。回归估计值通过灰色或紫色线条表示其95%置信区间。

Fig. 5 | Image-to-image retrieval for pathology images

在这里插入图片描述

a, Graphical illustration of image-to-image retrieval.

This study presents an overview of the image-to-image retrieval process conducted on the Twitter validation dataset.

c, 图像到图像检索在Twitter验证集上的结果表现优异。数值框中的数据代表Recall@10和Recall@50得分及其与随机基准的变化程度。

d, Image-to-image retrieval performances on the Kather colon dataset.

e, Image-to-image retrieval performances on the PanNuke dataset.

f, Image-to-image retrieval performances on the KIMIA Path24C dataset.

g, Examples of text-to-image retrieval.

h, Examples of image-to-image retrieval (featuring the mitotic figure).

a, 图像到图像检索的图示。

b, Twitter验证数据集上的图像到图像检索分析图示。

c, 在Twitter验证集上评估了图像转录能力的表现。表格中的数值具体体现了Recall在10条和50条查询下的分数及其与随机转录水平的对比值

d, Kather结肠数据集上的图像到图像检索性能。

e, PanNuke数据集上的图像到图像检索性能。

f, KIMIA Path24C数据集上的图像到图像检索性能。

g, 文本到图像检索的示例。

h, 图像到图像检索的示例(以有丝分裂图像为例)。

Discussion

The rapid advancement of machine learning across computer vision and natural language processing has become highly dependent on annotated datasets. In contrast to other domains, the annotation of pathology images is not only costly but also time-consuming, necessitating specialized expertise that typically requires extensive training over many years 36. Such challenges hinder the progress of AI in comprehending histopathological features, unraveling anatomical structures and disease diversity, as well as distinguishing different disease subtypes for the purpose of personalized treatment 38.

机器学习的发展在计算机视觉与自然语言处理领域中得益于标注数据的广泛应用。然而,在其他学科中获得高质量标注数据相对更为昂贵和耗时;而病理图像的标注尤其需要专业的医学知识和长期的专业训练(注:参考文献36)。这一挑战使得AI难以有效理解和解释组织病理学特征、解剖结构及其异质性;同时,在识别各种疾病亚型以实现精准医疗方面也面临诸多困难(注:参考文献38)。

The surge in data shared online offers a significant yet often untapped chance for advancements in medical AI. Especially on Twitter, it has evolved into a vibrant hub exclusively for pathologists as noted in reference 21. Curating this openly available information forms the basis of the OpenPath dataset—pairing images with textual descriptions—to enable AI systems to discern both global and localized pathological characteristics effectively. In this research effort, we utilized the OpenPath dataset to train PLIP by fine-tuning an advanced model designed for visual-language processing tasks as detailed in reference 25.

社交媒体上数据共享量大幅增加为医学人工智能的发展创造了一个独特且尚未充分挖掘的机遇。特别地,在Twitter平台上已经聚集了一群活跃的病理学家 21 。经过整合这些公开获取的知识库,OpenPath图像-文本数据集能够帮助人工智能系统掌握全球性和局部范围内的病理特征。在本次研究中,我们借助OpenPath平台,并采用最先进的视觉-语言预训练模型进行微调训练开发出了PLIP系统 25

与基于数字病理学的传统机器学习方法不同(unlike),我们的PLIP模型作为一种通用解决方案(proposed),不仅能够广泛应用于多个任务(applied),还能够在适应新数据(adaptation)以及提供零样本预测(zero-shot predictions)方面展现出卓越的能力(capable)。这一特性(characteristic)尤其在模型训练后学习目标发生改变的场景中(particularly valuable)具有重要意义(importance)。此外(furthermore),这一零样本能力(zero-shot ability)也与病理诊断领域的持续发展标准(constantly evolving criteria)保持一致(aligns)。为了验证这种能力(ability),我们通过线性探针和微调分析进行了定量验证(quantitatively demonstrated)。将PLIP图像编码器的微调结果与其特定领域深度学习模型进行对比分析后发现,在四个验证数据集上均表现出显著性能提升。值得注意的是,在使用较少训练数据的情况下也能实现这一效果(particularly notable when using a smaller amount of training data),这凸显了PLIP在表示学习方面的优势。

与数字病理学中从固定标签集学习的经典机器学习方法不同,我们的PLIP模型是一种通用解决方案,可以应用于广泛的任务,包括适应新数据和在任何图像输入的情况下提供零样本预测。这种处理可变类别数量的能力在模型训练后学习目标发生变化的情况下尤为有价值。此外,这种零样本能力还与病理学中不断演变的诊断标准相一致 10 。改进的图像表示能力随后通过线性探测和微调分析得到了定量验证。将PLIP图像编码器的微调结果与任务特定的深度学习模型进行比较,PLIP在四个验证数据集上表现出了更好的性能。特别是在使用较少训练数据进行训练时,突显了PLIP的表示学习优势。

本研究存在几个局限性。首先,在Twitter上收集的数据可能存在噪声干扰。经过严格的过滤管道处理后,数据质量得到了显著提升,并通过人工评估证实这一点。此外,在图像编码器中应用了多种图像预处理和变换算法以提高图像质量。然而,在病理图像中识别和校正不同Magnification水平及其染色风格仍然面临巨大挑战。通过训练模型在多样化的数据集上进行学习(PLIP),该模型展示了识别不同Magnification水平及染色协议的能力。第三部分值得探讨的是零射分类利用提示可能不稳定的现象(注:参考文献39)。我们推测持续优化提示的方法有望进一步提升零射性能表现(注:引用文献)。随后我们展示了PLIP在零射学习中的能力,并通过线性探测、微调、文本到图像检索以及图像到图像检索等多个方面进行了扩展性评估(注:引用文献)。未来的研究工作将基于这些成果开展更为复杂的诊断任务探索(注:如疾病亚型划分或分级评价),例如疾病分类或分级等高级应用领域的问题研究(注:引用文献)。第四部分表明尽管PLIP在多个任务与数据集上展现出强大的性能表现(注:引用文献),但专门针对特定任务设计优化过的专用模型仍能在特定领域实现超越(注:例如参考文献26中提到的VGG19模型在Kather's colon病灶数据集上对九种组织类型的预测准确率达到0.943)。(注:此处"检索数据集"被替换成"检索数据集"以避免重复)最后我们采用的标准CLIP架构构建了这一训练方法。(虽然CLIP架构能够高效训练但算法层面的进一步改进将有助于提升模型性能表现)由于计算能力限制所有输入图片均被缩放至224\times224像素尺寸与原始CLIP一致这可能会丢失病理学图像中的某些视觉特征和亚视觉特征信息

本研究存在几个局限性

PLIP模型在多样化的学习任务中取得了进步,得益于OpenPath这一 contains paired pathology images and text descriptions的最大公开可用数据集被整理和维护。我们期待开源PLIP和OpenPath将为医疗AI领域带来更多的进步,在此基础模型上进一步提升病理AI的能力,并通过PLIP搜索引擎促进医学知识的共享

PLIP模型在多样化学习任务中的发展得益于经过系统整理的最大规模公开可用数据集OpenPath的完善。该数据集包含了配对的病理图像与相关文本描述。我们预计开源后的PLIP和OpenPath将助力病理AI取得更大突破,并在此基础上实现创新应用的同时将通过集成先进的PLIP搜索引擎技术促进医学知识在不同领域间的共享

Methods

Description of the OpenPath dataset

Release policy.
遵循Twitter及其关联实体如LAION的规定与条例,在数据集中提供所有信息时均与原始数据来源相关联。具体而言,从Twitter收集的数据将以Tweet ID的形式发布;而从LAION收集的数据则通过指向图片的URL进行发布。有意了解使用合规性问题的用户需参考原始来源。

依据 Twitter 以及类似机构(如 LAION)的相关政策规定,在本数据集中提供的一切信息均与其原始来源建立关联关系。具体而言,在 Twitter 上收集到的信息将采用Tweet ID的形式呈现,在LAION平台收集的信息则基于图像URL的形式发布。建议所有用户查阅原始来源以确认其使用是否符合相关政策规定。


Twitter collection.
All Twitter posts (tweets) with English captions under 32 pathology subspecialty-specific hashtags were included according to the recommendation from the 2016 USCAP meeting and the Pathology Hashtag Ontology projects 24. The complete list of hashtags and descriptions is shown in Extended Data Table 1. Tweets were collected from 21 March 2006 (the date of the first Twitter post) to 15 November 2022. Conversations including replies were collected for each of the tweets from 21 March 2006 to 22 November 2022 (7 days after the last tweet was collected on 15 November 2022). In total, we collected 232,067 tweets and 243,375 image–text pairs. Among those tweets, we further collected 88,250 replies for which (1) the associated tweets had replies, (2) sentences contained at least one keyword from the International Classification of Diseases, 11th Revision codebook (February 2022 version) and (3) received the highest number of likes among all replies.

Twitter 数据收集。
根据 2016 年 USCAP 会议和病理学标签本体项目的建议,所有使用 32 个病理学子专业标签的英文推文都被纳入研究范围。标签的完整列表和描述见扩展数据表 1。推文收集时间范围为 2006 年 3 月 21 日(第一条推文发布之日)至 2022 年 11 月 15 日。对每条推文的对话(包括回复)收集时间为 2006 年 3 月 21 日至 2022 年 11 月 22 日(最后一条推文收集后 7 天)。我们总共收集了 232,067 条推文和 243,375 对图文数据。在这些推文中,我们进一步收集了 88,250 条回复,这些回复符合以下条件:(1)相关推文有回复,(2)句子中至少包含一个来自《国际疾病分类》第 11 修订版代码手册(2022 年 2 月版)的关键词,(3)在所有回复中获得最高点赞数。


Several exclusion criteria** were employed on our raw data (Extended Data Fig. 1a), comprising: (1) tweets flagged as possibly sensitive by Twitter; (2) duplicate image-text pairs; (3) images that couldn't be downloaded or opened; (4) non-pathology images identified by the pathology classification model; and (5) texts containing question marks, as questions typically lack image-specific information. Following our strict inclusion and exclusion standards, we ultimately derived 116,504 unique image-text pairs from tweets and 59,869 from the top replies. Additionally, the Twitter validation dataset was sourced between November 16, 2022, and January 15, 2023, adhering to identical criteria (Extended Data Fig. 1b), yielding 2,023 unique image-text pairs.

排除标准。
针对原始数据集,我们采用了多项筛选标准(如图所示扩展数据图中的标注信息),涉及以下几个方面:一是经 Twitter 标记为可能包含敏感信息的内容;二是重复配对的内容;三是无法获取或处理的图片文件;四是根据病理分析被判定为非医学图像的部分;五是以问号开头的文字。经过严格的筛选后,在推文中我们成功获得了总计约一百一十六万五千零四十四对独特的图片和文字配对,在热门回复中则收集了约五万九千八百六十九对。随后,在同一筛选标准下,在指定时间段内进一步收集了来自Twitter验证集的数据约两千零二十三对。


PathLAION collection.
我们建立了PathLAION集合(Set),该集合包含来自互联网但非Twitter的数据中的病理图像与文本配对。PathLAION集合是来自LAIO-N-5B数据集的一个子集(Subset),该数据集最初包含了来自互联网的58亿个图像-文本对(Pair)。本路径LAI-ON集合是由将来自Twitter的数据中的病理图像输入到LAI-O-N-5B中,并检索出与之最相似的前五万张图片(Pre-image)所获得的结果。相似性计算采用的是CLIP图像嵌入(Image Embedding)余弦相似度(Cosine Similarity)。当所有检索到的图片均为重复图片时停止采样过程(Sampling Process)。通过上述方法并从Twitter数据集中选取1千张不同的图片样本进行处理后获得了本研究中的32千零四十一张独特的病理图象样本(Sample)

PathLAION 数据采集工作进行了系统化的规划与执行


The quality control of the training dataset has been implemented.

训练数据集的质量审查。
为了提升OpenPath数据集的整体质量,我们实施了更为严格的高质量把关程序。为此,我们构建了一个专门用于识别显微病理学特征的分类器模型,并将其应用于剔除Twitter和LAION数据集中非病理样本的过程。因为下载的数据中并不全都是显微镜下的病理切片,在这种情况下这一系列操作对于保证数据质量和相关性至关重要。


High-quality text descriptions.
为了保证相关图像的高质量文本描述,我们采用了以下清洗流程来处理推文及回复中的文本内容:首先删除每条消息中的@用户名标识;接着去除所有的Hashtag标记;然后删除用于斜体和加粗格式的HTML标签(例如将"keyword"或"keyword"替换为"keyword");随后去除所有表情符号;接着去除换行符(\backslash\mathfrak{n})和回车符(\boldsymbol{\mathfrak{r}});然后去除非必要空格;最后删除以http://和https://开头的所有链接。此外,在推文包含问号的情况下通常会将其移除,因为这些通常用于医生询问病理图片而非提供描述性信息。对于PathLAION系统,在使用langdetect进行检测后仅保留英语描述的图像-文本对。在最终训练数据集中展示了不同长度文本长度的完整统计信息,请参见附录表1

为确保推文及回复中的高质量文本描述,我们采用以下流程进行清理工作:首先去除@username标签;接着剔除#标签;随后去除用于斜体和粗体格式的HTML标签(如将"keyword"或"keyword"替换为"keyword");移除非外部链接;去除表情符号;去除换行符号(\backslash\mathfrak{n})和回车符号(\boldsymbol{\mathfrak{r}});去除多余空格;最后移除非外部链接并过滤掉非英文标题的内容。此外,在处理带有疑问号的内容时我们会选择性地剔除它们以避免混淆专业术语。通过langdetect工具筛选出非英文标题的内容并完成数据清洗任务后,请参考补充表1查看最终统计数据

External validation datasets

To assess the effectiveness of the proposed machine learning algorithm, a collection of publicly accessible datasets were compiled.

用于考察所提出的方法性能的一组公开数据集是我们收集的一组数据。

注:改写中主要做了以下改动:

  • 调整了部分句子结构
  • 替换了一些词汇(如"collected"→"gathered")
  • 调整了语序(如"occupied over"→"covered more than")
  • 增加了一些描述性的词语以丰富表达(如"divided into training and validation sets with a ratio of"

为了研究零-射线和线性探测分析技术的效果与应用范围,我们获取了四个典型的数据集进行深入考察:(1)Kather结肠数据集包含26个训练图像块(总计10万张)和7,180个验证图像块,这些图像主要来自结直肠癌组织;(2)PanNuke数据集涵盖了19种器官类型以及五种不同类型的细胞核,共有7,558个图像块,其中每个样本至少包含一个细胞;如果肿瘤细胞数量达到十个及以上且占总细胞比例超过30%,则判定该图像为恶性;反之若无肿瘤细胞则判定为良性;这样一共获得了2,866张恶性图像样本和3,368张良性图像样本;(3)DigestPath数据集最初来源于低倍光学显微镜下的完整切片图像,经过裁剪分割成多个图像块并采用不同分辨率采样(包括2×4×8×16×32倍),最终统一像素尺寸为224×224,并保证各分割区域具有10%的空间重叠度;在背景区域筛选时,若超过50%像素未组织(基于RGB阈值=200)则排除该图片;对于具有恶性标记的组织区域,若其面积占比达到50%以上则判定该图片为恶性;最终获得了6,690张恶性图片样本及56,023张良性图片样本;(4)WSSS4LUAD数据集将所有图像按照是否含有肿瘤进行了二分类处理,得到肿瘤类图片6,579张及正常类图片3,512张.所有分割后的图像是经过预处理后统一调整到固定大小(224×224像素)输入编码器.除了Kather结直肠数据集外的所有其他数据集均按照7:3的比例划分为训练集与验证集.为了避免DigestPath数据泄露问题保证了训练集与验证集中不会有相同的样本ID.同时为了确保基准测试结果的一致性零-射线与线性探测分析均在验证集中进行评估

在对图像进行文本检索的分析中

在进行文本至图像检索分析的过程中,在 twitter 验证基准的基础上

For the image-to-image retrieval analysis, we employed two datasets: the Kather colon dataset incorporating nine distinct tissue types and the PanNuke dataset featuring 19 unique organ types. Additionally, we integrated the KIMIA Path24C dataset (number 35) which includes 24 varied pathology image staining textures to assess our model's capacity in retrieving images sharing identical textural patterns.

基于图像到图像检索分析的研究工作上,我们采用了包含九种不同组织类型的 Kather 结肠数据集以及 19 种不同器官类型的 PanNuke 数据集作为实验基础。此外,在 KIMIA Path24C 数据集中包含了 24 种不同病理图像染色纹理的数据样本(即第 35 号数据集),这些样本用于验证模型在检索具有相同纹理模式的图像方面的性能表现。

Model training and tuning

All experiments were run in Python v.3.9. Detailed software versions are: pytorch v.1.13; CUDA v.11.7; CUDNN v.8.5.0; scipy v.1.9.3; torchvision v.0.14.0; pillow v.9.1.0; scikit-learn v.1.1.2; scikit-image v.0.19.2; pandas v.1.4.2; numpy v.1.23.5; multiprocess v.0.70.13; langdetect v.1.0.9; and Twitter API v.2.0 with Python v.3.9.

所有实验均在 Python 3.x 系列中运行,并全部使用以下具体软件版本:
PyTorch 1.x 系列;
CUDA 与 CUDNN 版本均为 x.x.x;
Scipy 版本为 x.x.x;
Torchvision 版本为 x.x.x;
Pillow 版本为 x.x.x;
Scikit-learn 和 Scikit-image 均采用最新可用版本;
Pandas 和 NumPy 的版本分别为 x.x.x 和 x.x.x;
Multiprocessing 工具包采用 x.x.x 版本;
LangDetect 软件配合 Twitter API 使用。

Model training process involves establishing our PLIP model using an architecture detailed in Radford et al.'s research work. This particular architecture integrates two key components: a vision transformer serving as the image encoder (specifically ViT-B/32, capable of processing images up to 2^{8} \times 2^{8} pixels or ...), and a text transformer responsible for encoding textual information (with a maximum sequence length of 76 tokens) [40]. Prior to feeding images into the image encoder, they underwent preprocessing steps: initially scaled to a maximum dimension of 512 pixels, followed by random cropping to obtain fixed-size inputs of ... pixels [41]. The encoder components produce identical dimensional output vectors (...) that are optimized through contrastive learning by minimizing pairwise contrastive loss within each batch [41]. In this context, contrastive learning emphasizes elevated cosine similarity scores between paired image and text embeddings, thereby encouraging the model to establish accurate associations between visual content and corresponding textual representations.

模型训练过程中, 我们参考 Radford 等人的方法, 建立了一个与之类似的架构以构建 PLIP 模型.该架构主要由视觉变换器构建了图像编码模块(ViT-B/32, 具备处理大小不超过 224 \times 224 像素的输入图片的能力)以及基于文本变换器构建的文本编码模块(最长处理序列长度设定为 76 个 token)[40].在实际应用中, 输入图片会被先进行尺寸调整至最大的 512 像素, 并经过随机裁剪处理后作为输入传递至图像编码模块.值得注意的是, 图像与文本编码模块均输出具有 512 维度的空间向量, 并通过批量对比损失最小化来优化参数[41].值得注意的是, 对比学习机制强化了配对样本间的高余弦相似度要求, 这一过程迫使模型能够更好地理解并建立各向之间正确的关联性.

For achieving optimal hyperparameter settings, various combinations of training data subsets and learning rates were systematically searched during linear probing. The model was evaluated every quarter of an epoch (where 1 step equates to 1/4 epochs) across a total training period of 12 epochs. Our findings revealed that employing an optimal learning rate value σ=1×10^−5, coupled with 10 steps and comprehensive utilization of all available training data (tweets plus replies plus PathLAION), yielded superior performance. A systematic ablation study incorporating different data configurations while maintaining identical hyperparameters was conducted; as documented in Supplementary Table 5, integrating all data types (tweets plus replies plus PathLAION) proved instrumental in achieving the highest performing PLIP model.

为了确定最优超参数配置,在线性探测任务中探索了多样化的训练数据集与学习率设置。模型在其每一轮次初始阶段(即每一轮次的第一步),经过连续完整的12个训练周期后进行了评估。经过分析比较后发现,在σ=1×10⁻⁵的学习率策略下,并整合全部三种类型的数据源(tweet + reply + PathLAION),模型表现出最优性能表现。随后对不同数据源组合进行了系统性的消融实验分析,在附录表5中详细列出了各组别下的具体实验结果

Zero-shot classification framework. Leveraging its comprehension of textual information, this innovative PLIP framework is designed to classify and recognize novel labels for previously unseen data. This unique characteristic is typically referred to as zero-shot learning [23], enabling scalable acquisition of new classes without requiring retraining. Within zero-shot learning methodologies, classification outcomes hinge on identifying candidate texts that exhibit the highest representational similarity to a given input image.

Zero-based classification has been implemented in this study. owing to its comprehension of text content, the proposed PLIP model is capable of performing classification tasks on unseen data and assigning new labels to them. The capability is typically referred to as zero-shot learning [23]. This enables large-scale learning of new categories without the need for retraining. In zero-shot learning scenarios, classification outcomes are determined by identifying candidate text descriptions with the highest semantic similarity to the input image.

Thanks to images and texts being mapped into a shared vector space, our proposed model possesses the capability to acquire similarity between a target image paired with text candidates; where text prediction identifies the entity with maximum similarity. This entails that our model's zero-shot learning capability requires no additional training data. In this investigation, we utilized four benchmark datasets (Kather colon, PanNuke, DigestPath and WSSS4LUAD) while preserving their validation splits for evaluating zero-shot classification performance. Text candidates were formulated as “an H&E image of {keyword}”, where in the Kather colon dataset, keyword selections included (1) adipose tissue, (2) background, (3) debris, (4) lymphocytes, (5) mucus, (6) smooth muscle, (7) normal colon mucosa, (8) cancer-associated stroma and (9) colorectal adenocarcinoma epithelium. For both PanNuke and DigestPath datasets, keyword choices were restricted to (1) benign and (2) malignant. In contrast, for the WSSS4LUAD dataset, keyword options were narrowed down to (1) normal and (2) tumor. To evaluate confidence intervals (CI), zero-shot predictions were bootstrapped 100 times using 70% of available data for each iteration.

因为图像与文本均被嵌入到同一个向量空间中,在这种设计下所提出的方法可使目标图像与其候选文本在相似度上达成一致;预测出的最相关的候选文本将与目标图像具有最高相似度。该模型无需依赖训练数据即可推广至新样本。在本研究工作中,我们评估了四个公开可用的数据集:Kather colon、PanNuke、DigestPath 和 WSSS4LUAD,并按常规方法划分验证集。具体而言,在Kather colon数据集中生成的具体描述包括:脂肪组织切片、“一个 {keyword} 的 H&E 图像”(其中关键词包括(1)脂肪组织等)。而在PanNuke和DigestPath数据集中分别采用了良性与恶性作为分类依据,在WSSS4LUAD数据集中则基于正常组织与肿瘤组织进行分类。为了确保结果的可靠性,在计算置信区间时进行了100次自助抽样,并采用70%的数据量进行计算。

Linear probing

In this research, we computed feature embeddings from image data and subsequently trained a linear classifier on these representations to address specific classification challenges. The process involved normalizing all feature vectors using L2-norm before feeding them into the linear classification framework. To implement this methodology, we utilized the logistic regression variant within the stochastic gradient descent-based SGDClassifier module from Python's sklearn library [44]. For comparative evaluation, we benchmarked our model's backbone against several baseline architectures, including CLIP and MuDiPath with DenseNet121, which had been trained across diverse datasets comprising approximately 900,000 images for over 22 distinct classification tasks.

线性扫描是一种常用的技术手段,在本研究工作中被用来检验所提取特征的质量[42]。
通过从图像中提取特征嵌入后,在这些嵌入基础上构建了一个线性分类器模型。
所有提取的特征嵌入经过L2标准化处理后作为输入供线性分类器使用。
为了实现高效的分类任务处理,在本研究中我们采用了基于scikit-learn Python库的随机梯度下降优化算法(SGD)实现的逻辑回归模型。
为了便于对比分析性能指标的变化情况,在本研究中将我们的PLIP主干网络与基线CLIP主干网络进行了直接对比,并将其与由MuDiPath等多任务预训练深度神经网络架构以及采用DenseNet121结构设计的其他相关模型进行了全面对比。
该系统基于一个包含约90万张图像的数据集构建了多维度评估指标体系。

Comparison to task-specific supervised models

Fine-tuning was conducted to compare the PLIP image encoder with the end-to-end deep learning model ViT-B/32 for the image classification tasks across four external validation datasets. The ViT-B/32 model was pretrained on the ImageNet dataset [46]. For the PLIP image encoder, the last layer of the model was concatenated with a linear classifier. Batch size was set to 128 for both the PLIP image encoder and the ViT-B/32 model. The Adam with decoupled weight decay optimizer [47] was adopted for fine-tuning, with a weight decay of 0.1 and a total of ten training epochs. A hyperparameter search was conducted on the validation split to determine the optimal learning rate. The learning rate was selected from a set of values: ({1\times10^{-6}, 1\times10^{-5}, 1\times10^{-4}, 1\times10^{-3}, 1\times10^{-2}}). The selection of the learning rate was based on the highest weighted F1 score achieved on the validation split, which was created by taking (30%) of the original training split. Once the optimal hyperparameter was determined, the models were trained using all available training data. Their performances were evaluated on the testing split to assess the model’s ability on a new pathology image dataset. For the Kather dataset, we used (10%) of its original training data as a test split.

为了对比PLIP图像编码器与基于端到端深度学习的ViT-B/32模型在四个外部验证数据集上的图像分类性能,在此基础上进行了优化训练。ViT-B/32模型已在ImageNet数据库中完成预训练[46]。其中,在PLIP图像编码器中,最后一层与一个线性分类器相接。PLIP图像编码器和ViT-B/32模型均采用了批量大小为128的配置,并应用了具有解耦权重衰减的Adam优化算法[47]进行微调训练。具体而言,在验证集中执行了超参数优化以确定最佳学习率值:从候选集合({1×10^{-6}, 1×10^{-5}, 1×10^{-4}, 1×10^{-3}, 1×10^{-2}})中选择最优参数。根据验证集上的最高加权F1分数结果确定最终的学习率设置。该验证集来源于原始训练数据集中约30%的比例划分部分,并在此基础上对模型进行全局优化训练。待超参数选择完毕后,在全部可获取的训练数据基础上重新启动了模型训练过程,并于测试阶段评估了其性能表现:针对Kather数据库集合的数据表现尤为突出。在此研究中将Kather数据库中的约10%划分作为独立测试用例集合用于性能评估

Image retrieval

Similar to zero-shot learning approaches, which determine the most similar text from a collection of candidate items based on an image [33], and techniques like semantic search enable identifying the closest visual content from a set of images given a textual query [4], this method relies on calculating cosine similarity metrics for each pair of either-image-to-text or image-to-image associations within the same embedding framework [33, 4].

类似于零样本学习的方法,在该系统中可以通过分析候选池中的数据提取出与输入信息最匹配的内容。作为另一种技术,在同一空间内直接计算每对之间的余弦相似度即可实现高效匹配。

In text-to-image retrieval tasks, we created vector representations for search queries in natural language and identified images that best matched these queries. The performance metrics for text-to-image retrieval were assessed using Recall@1O and Recall@5O. These metrics measure how accurately a target image is positioned within their respective top ten or fifty results [48]. As an example, with a dataset comprising OOO paired images and their corresponding texts, each text was processed through an image retrieval system to locate its associated visual representation. If this evaluation process is applied to all OOO pairs in our dataset—resulting in an accuracy rate where approximately (6o%)—or O.o)—of all correct instances are correctly identified within their respective top ten results—then Recall@1o achieves a value of O.so.

在将文本转换为图像的搜索过程中(即文本到图像检索),我们开发了一种方法来生成基于自然语言处理的技术来构建查询向量,并通过算法确定与查询最为相关的图片。为了评估这种检索方法的表现效果,在文献[48]中有详细的说明和计算过程:即通过两个召回率指标Recall @10和Recall @50来进行评估;其中Recall@k指标指的是在前k个候选结果中能够寻找到目标对象的数量百分比(k=1表示第一个结果是否正确、k=1@1则表示第一个结果是否正确)。假设我们有一个包含5组不同图片及其描述的数据集;当我们将这五组数据输入模型时,在这五个测试案例中四个成功命中了目标图片的情况下,则说明我们的模型达到了87%的成功率;而这样的情况下,在这五个测试案例中四个成功命中了目标图片的情况下的召回率指标值则是87%左右

Image retrieval effectiveness was assessed through class retrieval accuracy across various models, a metric also referred to as mean average precision at K ( \textbf{MAP@k} ). We deemed an retrieved image "relevant" if its class matched that of the target input and if its precision represented a proportion of relevant images within its top K retrievals. The average precision at K ( \textsf{AP}(Q\textsf{k}) ) is defined as:

基于模型间类别检索准确率对图像至图像检索性能进行评估, 这一指标也被称为均值平均精度(K (\mathbf {MAP@ K})^5 ). 当检索出的图像类别与输入一致时, 则称这些图像是'相关'的;其精度计算为前K个结果中包含相关图片的数量占总数量的比例. (K (\mathsf {AP} (\mathcal {Q},\mathsf {K})) ) 的平均精度定义为:

\mathsf{A P@K}=\frac{1}{K}\sum_{i=1}^{K}P_{i}\times R_{i}

Within the context of this study, P_{i} represents precision at R_{i}, which serves as a key metric for assessing relevance. Specifically, when the ith item is deemed relevant, \nabla R_{i}={\bf1}; otherwise, it is assigned a value of zero. The final classification retrieval accuracy, denoted as R_i, was determined through:

\mathbf{MAP@K}={\frac{1}{n}}\mathbf{A}\mathbf{P@K}

where n denotes the total number of samples. It is noteworthy that the score might diminish when K increases, since more images could become irrelevant in the top retrieval outcomes. In our image-to-image benchmark comparison, we employed a cosine similarity measure to assess the similarity between image embeddings for both the PLIP and CLIP models. In contrast, MuDiPath and SISH models 5 evaluated image embedding similarity using Euclidean and Hamming distances, respectively. Following SISH's guidelines, we constructed a large-scale vector-quantized variational autoencoder utilizing their pre-trained weights.

\mathsf{AP@K}=\frac{1}{K}\sum_{i=1}^{K}P_{i}\times R_{i}

此处(P_{i})代表在位置(R_{i})处的精度值,并且(R_{i})是一个用于标识第(i)个项目相关性的指标变量(当第(i)个项目与所关注的主题相关时,则其梯度向量∇R_i等于单位向量;反之则令其值为零)。具体而言,则根据以下数学表达式进行计算:

\mathbf{MAP@K}={\frac{1}{n}}\mathbf{AP@K}

其中 (n) 代表样本总数。需要注意的是,在 (K) 值不断增大的情况下,评分可能会有所下降。这是因为,在前 (K) 个检索结果中通常会包含更多的非相关图像。在本研究中的图像到图像基准测试中,则采用了余弦相似度这一指标来衡量 PLIP 和 CLIP 模型生成图像嵌入之间的相似性程度。而针对 MuDiPath 和 SISH 模型[5]来说,则分别采用了欧几里得距离和汉明距离来进行图像嵌入间的差异程度计算。遵循 SISH 模型的设计理念,在其预训练权重基础上构建了一个大型向量量化变分自编码器。

Evaluation metrics and statistical analysis

F1 score and MCC

The F1 score acts as a measure for assessing the performance of zero-shot and linear probing approaches. The F1 value, which ranges from 0 to 1, is computed using twice the product of precision (P) and recall (R) divided by their sum: F_1 = \frac{2PR}{P + R}

该公式通过计算精确率与召回率的调和平均值来体现F1分数。该数值等于真阳性数量的两倍除以真阳性和假阳、假阴数量之和。

where TP represents true positives, FP denotes false positives, and FN signifies false negatives. Although higher values are generally preferred, it's important to recognize that a higher F1 score doesn't always equate to better performance; rather, it reflects a balanced measure between precision and recall. The calculation involves computing individual F1 scores for each class and then weighting them according to their occurrence within that class. Furthermore, along with these metrics, we also evaluated model performance using Matthews correlation coefficient (MCC) in comparison to our baseline CLIP model across zero-shot learning scenarios. Notably, while MCC scores range from -1 to 1, a perfect prediction is indicated by an MCC value of 1 (MCC=1), whereas a value of 0 suggests no better-than-random prediction accuracy (MCC=0), with negative values indicating disagreement between predictions and ground truth (MCC=-1).

F1分数主要用来衡量零示例与线性检测的表现。该指标范围介于0至1之间,并基于精度与召回率的平衡进行调和平均值得出:

\mathsf{F1}= ( 2\,\times\,\textit{precision}\,\times\,\textit{recall}) / (\textit{precision}\,+\,\textit{recall}) = ( 2\,\times\,TP ) / ( 2\,\times\,TP + FP + FN )

其中TP分别代表真正例数量FP代表假正例数量FN代表假负例数量等价于说这些指标分别用于衡量分类模型的表现当中TPFPFN这三个术语构成了评估二分类模型的关键指标组合F1分数越高越好其反映了模型在分类任务中的整体表现为了使评估结果更加全面还引入了加权F1分数这一概念通过计算每个类别F1分数的加权平均来得出整体评价指标此外研究者还提出了另一种评估指标即矩阵一致性系数MCC并对其在零样本情况下的表现进行了对比分析这一指标的变化范围从-1至1其中(当\mathsf{MCC}=\mathbf{1}时)表示预测完全准确而当\mathsf{MCC}=0时预测效果等同于随机猜测当\mathsf{MCC}=-1时预测结果与真实情况完全不符这种量化方法有助于更深入地理解模型性能的变化规律

Image retrieval evaluation

To evaluate the performance of the image retrieval task,

\mathsf{Precision}=\frac{TP}{TP+FP}

为了定量评估目标图像在检索结果中的存在比例, 该方法被引入. 在我们的图像检索实验中, 我们评估了精确度, 并分别使用(从检索到的前10张图片)和(从检索到的前50张图片)作为测试依据.

为了评估图像检索任务的性能,

\mathsf{Precision}=\frac{TP}{TP+FP}

用于定量评估目标图像在前K个检索结果中的占比情况。我们在该图像检索实验中测试了K值为10和50时的识别准确性。

Statistical significance and correlation

A double-sample Student's t-test was employed to assess performance comparisons among models. Using a Spearman rank correlation with a two-tailed P-value, we analyzed correlations between candidate numbers and fold changes specifically for image retrieval tasks.

两尾t检验用于比较模型性能间的显著差异。Spearman相关系数及双侧P值用于评估候选数量与图像检索任务中的折叠变化之间的相关系数。

Data availability

Data Availability

All the data within OpenPath can be accessed through Twitter and LAION-5B, which is detailed on their official website at https://laion.ai/blog/laion-5b/. Twitter accounts utilized for both training and validation processes are accessible via a direct link provided here: https://tinyurl.com/openpathdata. Validation dataset versions are made publicly accessible, with direct links provided below.

采用了该图像分类数据集https://www.image-net.org/用于预训练的ViT-B/32模型。训练后的模型、源代码以及交互式结果均可通过以下链接访问:https://tinyurl.com/webplip

OpenPath平台中的全部数据均属开放获取范围,并源自于Twitter平台以及LAION-5B数据库(参考文献:https://laion.ai/blog/laion-5b/)。可获得用于训练与验证的Twitter账号ID地址为 https://tinyurl.com/openpathdata 。该验证数据集合属于可获取资源,并可通过以下链接进行访问:

ImageNet数据库(https://www.image-net.org/)基于ViT-B/32模型进行了预训练。其中,训练模型、源代码和交互结果均可通过https://tinyurl.com/webplip 获取。

Code availability

The trained model and source codes are accessible through the provided URL.

Additional information

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

全部评论 (0)

还没有任何评论哟~