文献速递:帕金森的疾病分享--多模态机器学习预测帕金森病
文献速递:帕金森的疾病分享–多模态机器学习预测帕金森病
Title
题目
Multi-modality machine learning predicting Parkinson’s disease
多模态机器学习预测帕金森病
01
文献速递介绍
该渐进性神经退行性疾病(ADN)的早期和准确诊断是科学开发和应用新型干预措施的重要基础。该早期检测框架旨在通过在患者症状出现之前识别疾病状态,并在此时病情发展处于最容易接受治疗阶段时实施分析与预防治疗措施。
本研究探讨了一种通过基于数据的途径使用高效经济的方法来提高准确性并实现早期诊断的工作。该报告进一步阐述了,在基于促进生产规模分析的多模态基因组与临床数据背景中采用开源自动机器学习平台GenoML的情况。
国家人类基因组研究所发布的最新战略愿景声明指出,在2030年前后,表观遗传学特征与转录组学数据将被系统性地整合到基于基因型预测表型的影响模型中。当前,在生物医学领域研究者们正站在两个重大科技突破的结合部这一前沿领域的发展前景将带来革命性的进步:一方面是以大量临床数据、人口统计信息以及遗传学/基因组数据资源的大规模可获取性为基础;另一方面是基于机器学习(ML)流程自动化以及人工智能技术的进步所实现的数据价值最大化
首次就诊时的正确临床诊断仅约80%通过病理学检查被确诊为帕金森病(PD)。早期的研究主要依赖于传统的统计方法和线性回归模型来分析神经退行性疾病的相关指标。近年来,在机器学习领域的研究中发现,在神经退行性疾病预测中存在多种有效的特征组合方式:例如体液分析指标、影像学检查、基因表达数据以及运动相关参数等。尽管这些方法在分类准确性上表现尚可,但目前我们仍致力于基于更为简便易行的数据特征构建预测模型。
Results
结果
We have demonstrated that integrating multiple modalities enhances model performance when predicting PD diagnosis across diverse patient populations comprising both cases and controls. Please refer to Table 1 for an overview of basic clinical and demographic characteristics, while analysis results can be found in Figure 1. Additional details concerning cohorts and interpretation of machine learning metrics and models are provided in Supplementary Notes 2 and 3. Our multimodal approach achieved superior AUC values compared to unimodal strategies: before tuning, the mean AUC was 80.75% (standard deviation of 8.84; range: 69.44–88.51) as detailed in Table 3; after optimization, this increased to 82.17% (standard deviation of 8.96; range: 70.93–90.17). These improvements were consistently observed when validating the model across both PPMI test sets and PDBP datasets (see Table 4 for combined modality results). Notably, our multimodal model exhibited the lowest false positive rates among all evaluated approaches across both validation sets, demonstrating its superiority compared to single-modality strategies implemented independently in each cohort
结合多种模态的预测超越了基于单一模态的预测
我们已经展示了多种模态整合在一起能够显著提高模型在混合病例及对照人群群中诊断PD的能力。关于基本临床和人口统计特征的概述,请参见表1;关于分析的概述,请参见图1。关于队列及解释性机器学习指标以及模型的相关补充信息包含在补充说明2和3中。我们的多模态模型展现出的曲线下面积(AUC值达到89.72%,比仅依赖临床-人口统计数据(可提前获得的数据;AUC值为87.52%)、单一基因组测序数据及多基因风险评分(PRS;AUC值为70.66%)、以及基于全基因组全血RNA测序数据转录组学模型(AUC值为79.73%,均在保留PPMI样本中)更高)。经过校正后此模型的表现得到改善,请参见表3以获取详细信息及图表内容:未调整模型在PPMI样本中的平均AUC值为80.75(标准差为8.84;范围从69.44到88.51),而校正后则提升至平均AUC值为82.17(标准差为8.96;范围从70.93到90.17)。当该模型应用于PDBP数据集时可以看到类似的提升效果(校正前结合模态的整体AUC值达到83.84%),详细内容请参考表4及图3。此外,在单一模态关注的情况下与其他多模态方法相比本研究采用的方法不仅在性能指标上表现更为突出,在假阳性率和假阴性率方面也具有显著优势
Fig
图

Figure 1: Workflow and Data Summary. Minimum p values from reference GWAS or differential expression studies are denoted using scientific notation in the workflow diagram to serve as a preliminary screening criterion for features. Blue highlights subsets of genetic data, which are also referred to by abbreviations such as G; green highlights transcriptomic datasets, often abbreviated as omics or O; yellow highlights clinico-demographic information represented by C + D; and purple represents integrated multimodal datasets. PD refers to Parkinson's disease; AMP-PD denotes the Accelerating Medicines Partnership for Parkinson's Disease; PPMI is the Parkinson's Progression Marker Initiative; PDBP stands for the Parkinson's Disease Biomarker Program; WGS is whole-genome sequencing; GWAS is genome-wide association study; QC is quality control; MAF is minor allele frequency; PRS is polygenic risk score.
图1 绘制了工作流程图及其相关数据摘要。科学记号标记了来自参考GWAS或差异表达研究中具有最小p值的关键特征,并也被选作预筛选的标准。蓝色标记了遗传数据子集,并用字符G表示;绿色则代表转录组数据子集,并用字符O标识;临床人口统计数据用黄色表示,并用C+D组合进行标注;紫色则代表整合的数据模式。PD代表帕金森病;AMP-PD是加速医药合作伙伴关系帕金森病项目;PPMI是帕金森病进展标志物计划;PDBP是帕金森病生物标志物计划;WGS是全基因组测序;GWAS是全基因组关联研究;QC是质量控制标准;MAF是小等位基因频率;PRS是多基因风险评分。

在去噪训练样本中的默认阈值设置下生成 ROC 曲线和病例概率密度图,并比较不同数据模态下的性能指标。提及的 P 值表示阈值设定为...
The significance of this analysis is focused on understanding patterns within each datatype category, excluding any consideration of clinico-demographic features. The study utilized four distinct datasets derived from the PPMI database: a comprehensive integrated omics dataset incorporating genetic (p < 1E-5), transcriptomic (p < 1E-2), and clinico-demographic data; b a genetics-focused dataset with a stringent p-value threshold of 1E-5; c an exclusive analysis based on clinico-demographic variables; d a transcriptomics-focused study with a more lenient p-value cutoff at 1E-2. Note that the x-axis range may vary depending on the model's output characteristics, which are influenced by how well the model fits the input data and the specific algorithm employed. The supplementary materials include further detailed images in Supplementary Fig. 5. PPMI Parkinson’s progression marker initiative, ROC receiver operating characteristic curve.
在默认阈值条件下进行PPMI(帕金森病进展标志物计划)数据分析时,请对比PPMI数据集中的不同数据类型在性能指标上的表现,并继续保留训练样本中接收器操作特征曲线(ROC)以及案例的概率密度分布图。所提及的P值代表了每种数据分析类型所采用的关键显著性门槛,在这些分析中均包含了完整的临床-人口统计特征信息之外的基础变量。具体来说:
a) 使用整合组学数据分析集合(其中遗传学领域的P值设定为1E-5,在转录组学层面则采用较为宽松的P值设定1E-2,并结合临床-人口统计信息进行分析);
b) 使用仅包含遗传学领域数据集(其中P值设定统一为严格的1E-5水平);
c) 使用仅包含临床-人口统计数据集;
d) 使用仅包含转录组学数据集(其中P值设定为较为宽松的1E-2水平)。需要注意的是,在x轴范围上可能会有所差异,因为不同的模型根据输入数据拟合能力和使用的算法特性可能导致概率分布呈现更为平缓的趋势;完整的图像细节则完整展示于附图5中。

The study presents Receiver operating characteristic (ROC) curves and case probability density plots within an external dataset (PDBP) during validation for both trained and tuned models set at default thresholds. The predicted probabilities for case status (r1) indicate that controls with a status of 0 tend to cluster more towards the left, while positive PD cases with a status of 1 are more concentrated on the right side. Testing was conducted in the PDBP dataset using an integrated *omics model built from PPMI data prior to hyperparameter optimization. Two distinct testing scenarios were evaluated: first, using the model developed in PPMI before any hyperparameter tuning; second, testing the same model after optimizing its hyperparameters. Notably, these abbreviations remain consistent: PPMI represents the Parkinson's Progression Marker Initiative; PDBP denotes the Parkinson's Disease Biomarker Program; ROC stands for Receiver Operating Characteristic curve.
图3展示了在外部数据集(即帕金森病生物标志物项目)上对训练并优化后的模型进行验证时的接收器操作特征曲线以及案例概率密度分布图,并采用了默认设置阈值作为判别依据。该预测结果关注的是病例的状态变量r1,在此指标下对照组(状态标记为0)的样本主要集中在左侧区域而呈现出较高的阴性PD风险;而阳性PD病例群的状态标记为1则主要分布在右侧区域。
a部分描述了在整合基因测序与转录分析数据的基础上建立组学预测模型的过程,在PPMI计划中进行了相关研究工作;b部分则是在完成超参数优化工作后继续开展后续研究工作。

Figure 4 illustrates feature importance plots for the top 5% of features in the dataset. The left plot uses blue to indicate lower values and red for higher ones relative to the baseline risk estimate. The right plot highlights directionality: features predictive of cases are colored red, whereas those better at predicting controls appear blue. Shapley Values represent a fair allocation of predictive contributions across all features; UPSIT is an acronym from the University of Pennsylvania. Smell Identification Test and Polygenic Risk Scores provide additional insights into feature significance.
图4数据中最低1%特征的重要度分布图。左侧图表中数值较小的部分以蓝色标记;右侧图表展示了方向性特性:测试样本的关键变量以红色标记、对照组样本的关键变量则以蓝色标记。SHAP值为沙普利值指标;UPSIT为宾夕法尼亚大学嗅觉识别测试;PRS为多基因风险评分。
Table
表

Table 1.Descriptive statistics of studies included from AMP PD.
表1.包含自AMP PD的研究的描述性统计。

Table 2.Performance metric summaries comparing training on held-out samples within the context of PPMI.
表2.比较在PPMI中保留样本训练的性能指标摘要。

Table 3 provides a summary of performance metrics across tuned cross-validation experiments within withheld samples from the PPMI dataset.
表3.比较在PPMI中保留样本的调整后交叉验证的性能指标摘要。

Table 4.Analysis of performance metrics comparing the performance between tuned and untuned models in the PDBP validation dataset.
表4.对比在PDBP验证数据集上整合了调整与未调整模型性能的性能指标概述

Table 5 presents the optimization of the AUC threshold across both withhold training datasets and validation datasets.
表5.在保留的训练样本和验证数据中优化AUC阈值
