利用 Spark NLP 和生物医学知识图谱了解最新的医学研究
利用 NLP 技术从生物医学文章中提取关系,构建生物医学知识图谱
生物医学领域是探索基因、药物、疾病等不同实体之间关联的核心研究对象。任何一名医生几乎都难以跟进所有最新发表的研究。例如,在撰写这篇文章时, 我查阅了PubMed Central数据库, 在2022年3月10日这一天发现了约10万篇新文章。这意味着每天大约一千多篇新文章被发表出来。即使是一群医生在仔细阅读这些文章并从中提取有价值的信息时也会面临挑战
为了与当前最前沿的生物医学研究建立联系, 这些技术为我们提供了强大的工具基础。在之前的博文中我已经阐述了构建生物医学知识图谱, 但在随后的研究中, 我们将重点更多地放在命名实体识别上, 这一发现启发我们深入探索生物医学关系抽取这一技术, 本文将深入探讨这一技术的发展及其应用前景。
我们将快速回顾关系抽取模型的目标。例如,假设您正在分析以下句子:
    Been taking Lipitor for 15 years , have experienced severe fatigue a lot!
    
    
      
    
    代码解读
        第一步旨在识别文本中出现的所有生物医学实体,在这种特定情况下,我们能够识别出药物立普妥及其相关的症状——严重疲劳。关系抽取模型通常具有高度定制化,并且非常领域特定,在实际应用中,我们可以训练模型来识别药物的不良反应。例如,在这种情况下, 药物不良反应是药物与其相关副作用之间的关系。如果你跟我一样的话, 在这种情况下你可能会想到绘制一个图表来存储和表示两个实体之间的关系。

图数据库主要用于存储实体及其关联信息;基于此,将这些通过关系提取的NLP模型所得出的高度相关联的数据集进行组织和管理具有重要意义。
议程
本文将从PubMed网站下载生物医学领域的文章。PubMed提供了用于检索数据的一个API服务和一个FTP站点。该FTP站点每天都会更新。然而它详细列出了使用这些数据所必须遵守的条款和条件:尽管FTP站点未明确说明使用许可条款但其描述了如何访问并利用这些资源的信息
    NLM freely provides PubMed data. Please note some abstracts may be protected by copyright.
      
    General Terms and Conditions:
    -Users of the data agree to: 
    --acknowledge NLM as the source of the data in a clear and conspicuous manner,
    --properly use registration and/or trademark symbols when referring to NLM products, and
    --not indicate or imply that NLM has endorsed its products/services/applications.
    
    
      
      
      
      
      
      
      
    
    代码解读
        由于数据将仅用于 NLP 流水线的简单演示,我们已经准备就绪。
接下来我们将利用NLP管道处理数据以识别生物医学实体之间的相互作用。目前存在众多开源命名实体识别工具遗憾的是并非所有生物医学关系提取工具都不需要人工标注本文旨在展示如何应用这些工具来解决实际问题而不是教授如何构建它们我们选择John Snow Labs医疗保健解决方案该平台提供了基于云的服务由于其开放平台策略使得获取这些工具变得相对容易不过大多数情况下这些生物医学领域的解决方案通常不提供开源代码幸运的是他们为我们提供了为期30天的免费试用计划以便您能够体验他们的服务按照上述指南完成注册并激活相应的访问令牌即可开始使用该服务
在本文的最后一节中, 我们将把提取的关系以图形化的方式存储于Neo4j中, 作为内置图形数据库, 它专门设计用于存储和分析高度互联的数据. 同时将阐述几种不同适用于表示数据的图形模型的相关注意事项.
脚步
通过PubMedFTP站点获取并解码每日更新的文章内容。将文章数据导入到Neo4j数据库中。借助JohnSnowLabs模型从文本中提取关键信息。将关系数据导入到Neo4j数据库中进行分析。
一如既往,所有代码都在文章底部
从 PubMed FTP 站点下载每日更新
如前述内容可知,在PubMed的FTP站点上可获取每日更新的数据;这些数据采用XML格式发布;每个文件都有唯一的增量ID;曾试图编写程序来计算特定日期下的增量ID;然而此方法并非易事 我不愿耗费宝贵的时间去理解 所以必须在代码中手动复制所需的位置
    https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
    
    
      
    
    代码解读
            import urllib
    import gzip
    import io
    import xmltodict
    
    # Get latest pubmed daily update location at
    # https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
    
    url = "https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed22n1211.xml.gz"
    oec = xmltodict.parse(gzip.GzipFile(fileobj=io.BytesIO(urllib.request.urlopen(url).read())))
    
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
        我认为将XML转换为字典并进行处理更为便捷。然而,在需要再次操作时,我会考虑采用XML搜索功能。由于需要考虑这些例外情况以确保正确提取数据。
解析字典所需的代码量相当大,并且显得有些枯燥乏味。因此选择略过这一部分。值得指出的是,在文章末尾部分提供了完整的代码实现。
在 Neo4j 中存储文章
在进行NLP处理流程之前,我们将文本内容存入Neo4j数据库中进行管理.文本内容的图模型呈现为:

该图模型具有自我描述特性,在该图模型中,默认情况下以文章作为核心节点进行构建与分析操作。我们将其 PubMed ID、标题、所属国家以及发表日期作为节点属性进行记录,并对这些信息进行标准化处理以便后续分析使用。值得注意的是,在某些特殊情况下,默认情况下可能会对国家这一属性进行拆分单独作为一个节点来表示;但在此分析中为了简化处理流程将其统一归类为节点属性处理方式更为合理和高效可行的选择是将所有这类属性参数化并存入数据表中完成统一管理过程中的确保存储效率和查询速度问题;同时为了保证数据完整性和一致性性我们采用层次化结构设计原则将所有相关属性按照功能划分归类于不同的数据表中从而实现了对原始数据的最佳保存与多维度分析需求满足。
Ps 对于多数文献而言,仅提供摘要信息。读者若需获取详细内容,请通过PubMed API自行下载大部分文献全文内容。然而,在此情况下我们会省略这一选项
在导入数据之前, 我们需设置Neo4j环境. 如果您的工作环境是Colab笔记本, 我建议您在Neo4j Sandbox中创建一个空白项目.Neo4j Sandbox 是Neo4j 的免费试用版本云实例, 您可以在此体验其基本功能. 否则, 如果您希望使用本地环境, 我建议您获取并安装Neo4j桌面应用程序, 并请确认已正确配置 APOC 库.
设置 Neo4j 实例后,将连接详细信息复制到脚本。
    # Define Neo4j connections
    import pandas as pd
    from neo4j import GraphDatabase
    host = 'bolt://18.214.25.95:7687'
    user = 'neo4j'
    pw= 'lubrication-motel-salutes'
    driver = GraphDatabase.driver(host,auth=(user, pw))
    
    def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
    
    
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
        在使用 Neo4j 的过程中形成的一个良好习惯是建立并维护数据模型中的唯一性约束以及相关的索引结构,并通过这些机制从而显著提升数据导入和查询执行的效率。
    # Define constraints
    run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Article) ASSERT a.pmid IS UNIQUE;")
    run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Author) ASSERT a.name IS UNIQUE;")
    run_query("CREATE CONSTRAINT IF NOT EXISTS ON (m:Mesh) ASSERT m.id IS UNIQUE;")
    
    
      
      
      
      
    
    代码解读
        现在我们都准备好了,我们可以继续将文章导入 Neo4j。
    import_pubmed_query = """
    UNWIND $data AS row
    // Store article
    MERGE (a:Article {pmid: row.pmid})
    SET a.completed_date = date(row.completed_date),
    a.revised_date = date(row.revised_date),
    a.title = row.title,
    a.country = row.country
    // Store sections of articles
    FOREACH (map IN row.text | 
    CREATE (a)-[r:HAS_SECTION]->(text:Section)
    SET text.text = map.text,
        r.type = map.label)
    // Store Mesh headings        
    FOREACH (heading IN row.mesh | 
    MERGE (m:Mesh {id: heading.mesh_id})
    ON CREATE SET m.text = heading.text
    MERGE (a)-[r:MENTIONS_MESH]->(m)
    SET r.isMajor = heading.major_topic)
    // Store authors    
    FOREACH (author IN row.author | 
    MERGE (au:Author {name: author})
    MERGE (a)<-[:AUTHORED]-(au))
    """
    
    # Import pubmed articles into Neo4j
    step = 1000
    for x in range(0, len(params), step):
    chunk = params[x:x+step]
    try:
        run_query(import_pubmed_query, {'data': chunk})
    except Exception as e:
        print(e)
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
        为了优化处理效率和减轻内存负担, 导入会被分成 1000 篇文章进行批处理, 这样可以有效规避一次性处理过大的事务所带来的潜在问题. Importing Cypher statements can sometimes be lengthy but straightforward. 如需进一步掌握Cypher语法, 我们建议您参加Neo4j的Graph Academy完成相关官方认证课程.
如果您打开 Neo4j 浏览器,您应该能够观察存储在图中的文章。

在进入 NLP 管道之前,我们可以快速检查数据。
    MATCH (a:Article)
    RETURN count(*) AS count
    
    
      
      
    
    代码解读
        远远超过了我们的存储库中的文章数量。
    MATCH (a:Article)
    RETURN a.pmid AS article_id,
       a.completed_date AS completed_date,
       a.revised_date AS revised_date
    ORDER BY completed_date ASC
    LIMIT 5
    
    
    
      
      
      
      
      
      
      
    
    代码解读
        结果

我们没有理解其中的原因为何超过二十年的文章会被修改需求所驱动,并且我们通过从XML文件中提取关键数据来获取相关信息。下一步,则是深入研究自2020年以后完成的文章,并分析并识别出那些在网格实体类型和使用场景上具有显著特征的关键数据点。
    MATCH (a:Article)-[rel:MENTIONS_MESH]->(mesh_entity)
    WHERE a.completed_date.year >= 2020 AND rel.isMajor = "Y"
    RETURN mesh_entity.text as entity, count(*) AS count
    ORDER BY count DESC
    LIMIT 5
    
    
      
      
      
      
      
    
    代码解读
        结果

值得注意的是,在引入了单一的日更内容后,COVID-19 占据了疫情防控的主要位置。在NLP关系提取模型兴起之前,请问您能否利用共现网络来识别实体间的潜在联系?例如,在分析过程中我们可能会关注哪些实体通常会与COVID-19同时出现?
值得注意的是,在引入了单一的日更内容后,COVID-19 占据了疫情防控的主要位置.在NLP关系提取模型兴起之前,请问您能否利用共现网络来识别实体间的潜在联系?例如,在分析过程中我们可能会关注哪些实体通常会与COVID-19同时出现?
    MATCH (e1:Mesh)<-[:MENTIONS_MESH]-(a:Article)-[:MENTIONS_MESH]->(e2)
    WHERE e1.text = 'COVID-19'
    RETURN e1.text AS entity1, e2.text AS entity2, count(*) AS count
    ORDER BY count DESC
    LIMIT 5
    
    
      
      
      
      
      
    
    代码解读
        COVID-19 的共现结果是有道理的。然而它们主要与人类及流行病有关,并且与 SARS-CoV-2 具有紧密联系。
关系抽取 NLP 流水线
一种共现分析可以被视为一种强大的技术用于实体间关系的分析。然而,这一技术却忽视了文本中可获取的丰富信息资源。因此,在这种背景下,研究人员持续投入巨大努力以开发关系抽取模型。
当共现分析能够识别实体间潜在联系时,则用于建立基于关系的抽取模型来识别这些联系。

如前文所述,在关系抽取模型中旨在识别实体间的关系类别。为了说明确定实体间关系的重要性,请看以下简例:

药物可能与文本中的特定情境或指标共存于某些情况下。然而,在使用时必须明确识别该药物的主要目的以及是否可能作为副反应导致该病症。
关系提取模型多为特定专业领域应用,并且经过专门训练后仅限于处理某类特定链接信息。针对当前案例研究场景,在NLP流程中我特意引入了两个John Snow Labs产品。其中一个专门用于识别药物与条件之间的不良反应情况,而另一个则专注于推断药物与蛋白质之间的相互作用机制。
John Snow Labs 的 NLP 管道基于 Apache Spark 构建。无需赘述:NLP 管道的输入即为 Spark DataFrame。每个管道步骤均从当前 DataFrame 获取输入数据,并将处理后的结果存回该 DataFrame。举个简单的例子:
    documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
    sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")
    
    tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")
    
    
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
        此示例管道包含三个步骤。首先, 输入文本会被转换为DocumentAssembler对象, 该过程采用Spark DataFrame中的文本列为输入, 并将处理后的数据存储于对应的文档字段中. 其次, 通过SentenceDetector工具对原始文本进行分句处理, 同时保持数据的一致性. 最后, 检测后的句子会被整合并存储于DataFrame的相应字段中.
在管道中可以自由地添加任何数量的步驟。必須注意的关键點在於確保管道中的每個步驟都具備有效的輸入與輸出列表。儘管這個NLP流水面線的定義相對簡單明了،但它所包含的操作步驟卻繁多多样;因此,在这种情况下采用圖表進行展現會更加直觀明了。

其中一些步骤与ADE(药物不良反应)以及REDL(药物与蛋白质)之间存在一定的关联。然而,在基于实体识别的任务中发现不同实体间存在复杂的关系是必要的前提条件;因此我们需要分别使用两个NER(命名实体识别)模块来分别识别这两种类型的实体;之后我们就可以将这些被识别出的实体作为输入传递给关系提取模型进行后续处理;例如;ADE模块仅能输出两种类型的关系状态;即(0,1),其中1表示药物不良反应;而REDL模块经过训练后能够识别出九种不同类型的相互作用;包括激活剂、抑制剂、激动剂等具体情况。
在最后阶段,在我们进行实体抽取的过程中

假如您不需要原始文本的痕迹记录,该系统相对简单。然而,鉴于我们意识到NLP抽取不可避免地存在局限性,在抽取的关系与原始文本之间我们通常会选择建立关联以确保完整性。该系统设计使得通过检查原始文本我们可以方便验证各种关系。在右侧的案例中我有意省略了实体与关系节点之间具体关联类型的说明;我们既可以采用通用关联类型也可以选择更为具体的提取类型如CAUSES、INHIBITS等。在这个案例中我选择了最普遍的形式因此最终构建的图模型是:

唯一剩下的就是执行代码并将提取的生物医学关系导入 Neo4j。
    from datetime import datetime
    
    # Define NLP input
    nlp_input = run_query("""
    MATCH (t:Section)
    RETURN id(t) AS nodeId, t.text as text
    LIMIT 1000
    """)
    
    # Run through NLP pipeline and store results
    step = 100  #batch size
    for i in range(0, len(nlp_input), step):
      print(f"Start processing row {i} at {datetime.now()}")
      # Create a chunk from the original Pandas Dataframe
      chunk_df = nlp_input[i: i + step]
      # Convert Pandas into Spark Dataframe
      sparkDF=spark.createDataFrame(chunk_df)
      # Run through NLP pipeline
      result = pipeline.fit(sparkDF).transform(sparkDF)
      df = result.toPandas()
      # Extract REL params
      rel_params = extract_rel_params(df)
      # Store to Neo4j
      run_query(import_rels_query, {'data': rel_params})
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
        该代码仅限于处理一千个片段,并非强制要求必须如此设计。然而,在实际应用中您可以根据需求添加额外限制。因为我们未设置节节点的唯一标识符, 数据来源是从 Neo4j 获取的包含段落及其内部节点标识的信息。这使得关联操作更加高效, 因为长段落匹配节点并非最佳实践。一般而言, 在 Google Colab 上进行此类操作可能需要约一小时, 因此建议在开发环境中进行测试以确保性能符合预期要求
现在我们可以检查结果了。首先,我们将查看提及次数最多的关系。
    MATCH (start:Entity)-[:RELATIONSHIP]->(r)-[:RELATIONSHIP]->(end:Entity)
    WITH start, end, r,
     size((r)<-[:MENTIONS]-()) AS totalMentions
    ORDER BY totalMentions DESC
    LIMIT 5
    RETURN start.name AS startNode, r.type AS rel_type, end.name AS endNode, totalMentions
    
    
      
      
      
      
      
      
    
    代码解读
        
因为我的专业领域不在医疗领域,并不具备对治疗效果进行评论的能力。基于此我认为无法对结果给出评价。若需要了解某个特定关系是否具有治疗价值或有效性,则应当将相关资料提供给专业的医疗专家进行查看分析和判断评估。
    MATCH (start:Entity)-[:RELATIONSHIP]->(r)-[:RELATIONSHIP]->(end:Entity)
    WHERE start.name = 'cytokines' AND end.name = 'chemokines'
    MATCH (r)<-[:MENTIONS]-(section)<-[:HAS_SECTION]-(article)
    RETURN section.text AS text, article.pmid AS pmid
    LIMIT 5
    
    
      
      
      
      
      
    
    代码解读
        
搜索特定实体之间的间接关系可能也很有趣。
    MATCH (start:Entity), (end:Entity)
    WHERE start.name = "cytokines" AND end.name = "CD40L"
    MATCH p=allShortestPaths((start)-[:RELATIONSHIP*..5]->(end))
    RETURN p LIMIT 25
    
    
      
      
      
      
    
    代码解读
        结果

实际上并非有意的是,在结果中所有的相互作用均为INDIRECTED_UPREGULATOR。您也可以探索其他类型的关系网络。
下一步
我们可以采用几种途径来提升我们的 NLP 处理流程的能力。其中一种途径是采用实体链接或解析器模型来实现这一目标。通常而言,在构建这类系统时会使用特定的技术方案来处理复杂的数据关系问题。具体来说,在本项目中我们将主要依赖于基于语义的理解机制来构建系统的语义模型并实现相关功能;此外还需要引入一些辅助算法来进行数据清洗与特征提取等基础工作以确保系统的稳定运行和高效性;最后还需要建立一套完整的评价体系来进行系统的性能监控和优化工作以保证系统的整体性能达到最佳状态
实体识别 通过外部数据的引入 能够有效提升知识图谱的完整性。例如 在图表中识别出两个节点实体 它们可能代表同一个现实世界的对象。

尽管John Snow Labs提供了多种实体解析模型,
但有效实现这一目标需要具备一定的专业知识和技能。
了解过的生物医学知识图谱
采用UMLS、OMIM和Entrez等资源来涵盖各种类型的实体。
使用实体解析器的第二个特点是我们可以通过外部生物医学资源来拓展我们的知识图谱。例如,在一个应用中基于知识库导入现有信息后,我们能够通过自然语言处理技术识别并建立实体间的关联。
此外, 您还可以利用不同类型的图机器学习库, 如Neo4j GDS、PyKEEN以及PyTorch Geometric, 用于推断新关系的存在.
如果您发现采用NLP管道结合图形数据库设计出任何令人兴奋的应用程序,请告知;如果有关于改进本文所涉及的NLP或知识图谱步骤的建议,请告知。
项目源码
    https://github.com/tomasonjo/blogs/blob/master/pubmed/Pubmed%20NLP.ipynb
    
    
      
    
    代码解读
            # -*- coding: utf-8 -*-
    """Pubmed NLP.ipynb
    
    
    """
    
    # Import John Snow License keys
    import json
    
    from google.colab import files
    
    license_keys = files.upload()
    
    with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)
    
    # Defining license key-value pairs as local variables
    locals().update(license_keys)
    
    # Adding license key-value pairs to environment variables
    import os
    os.environ.update(license_keys)
    
    # Installing pyspark and spark-nlp
    ! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION
    
    # Installing Spark NLP Healthcare
    ! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET
    
    # Installing Spark NLP Display Library for visualization
    ! pip install -q spark-nlp-display
    # Installing neo4j driver and xml parser
    ! pip install neo4j xmltodict
    
    """# Agenda
    In this post, we will start by downloading biomedical articles from PubMed. The PubMed provides an API to retrieve data as well as a FTP site, where daily updates are available.
    Next, we will run the data through an NLP pipeline to extract relationships between biomedical entities. There are many open-source named entity recognition models out there, but unfortunately, I haven't come across any biomedical relation extraction models that don't require manual training. Since the goal of this post is not to teach you how to train a biomedical relation extraction model but rather how to apply it to solve real-world problems, we will be using the John Snow Labs Healthcare models. [John Snow Labs](https://www.johnsnowlabs.com/) offer free models for recognizing entities and extracting relations from news-like text. However, the biomedical models are not open-source. Luckily for us, they offer a free 30-day trial period for healthcare models. To follow along with the examples in this post, you will need to start the free trial and obtain the license keys.
    In the last part of this post, we will store the extracted relations in Neo4j, a native graph database designed to store and analyze highly interconnected data. I will also explain some considerations regarding the different graph models we can use to represent the data.
    # Steps
    * Download and parse daily update of articles from the PubMed FTP site
    * Store articles in Neo4j
    * Use John Snow Labs models to extract relations from text
    * Store and analyze relations in Neo4j
    
    # Download daily update from the PubMed FTP site
    As mentioned, the PubMed daily updates are available on their FTP site. The data is available in XML format. The files have an incremental ID. I've first tried to calculate the incremental file id for a specific date programmatically. However, it's not straightforward, and I didn't want to waste my time figuring it out, so you will have to copy the desired file location in the code manually.
    """
    
    import urllib
    import gzip
    import io
    import xmltodict
    
    # Get latest pubmed daily update location at
    # https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
    
    url = "https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/pubmed22n1211.xml.gz"
    oec = xmltodict.parse(gzip.GzipFile(fileobj=io.BytesIO(urllib.request.urlopen(url).read())))
    
    """My gut instinct was that it would be easier to convert the XML to a dictionary and process that. However, if I had to do it again, I would probably use XML search functions as I had to include several exceptions to extract required data from the dictionary format correctly."""
    
    from datetime import date
    
    # Export pubmed article params
    params = list()
    
    for row in oec['PubmedArticleSet']['PubmedArticle']:
    
    # Skip articles without abstract or other text
    if not row['MedlineCitation']['Article'].get('Abstract'):
        continue
    
    # Article id
    pmid = row['MedlineCitation']['PMID']['#text']
    
    abstract_raw = row['MedlineCitation']['Article']['Abstract']['AbstractText']
    
    if isinstance(abstract_raw, str):
        text = [{'label': 'SINGLE', 'text': abstract_raw}]
    elif isinstance(abstract_raw, list):
        text = [{'label': el.get('@Label', 'SINGLE'), 'text': el['#text']}
                for el in abstract_raw if not isinstance(el, str) and el.get('#text')]
    else:
        text = [{'label': abstract_raw.get(
            '@Label', 'SINGLE'), 'text': abstract_raw.get('#text')}]
    
    # Completed date
    if row['MedlineCitation'].get('DateCompleted'):
        completed_year = int(row['MedlineCitation']['DateCompleted']['Year'])
        completed_month = int(row['MedlineCitation']['DateCompleted']['Month'])
        completed_day = int(row['MedlineCitation']['DateCompleted']['Day'])
        completed_date = date(completed_year, completed_month, completed_day)
    else:
        completed_date = None
    
    # Revised date
    revised_year = int(row['MedlineCitation']['DateRevised']['Year'])
    revised_month = int(row['MedlineCitation']['DateRevised']['Month'])
    revised_day = int(row['MedlineCitation']['DateRevised']['Day'])
    revised_date = date(revised_year, revised_month, revised_day)
    
    # title
    title_raw = row['MedlineCitation']['Article']['ArticleTitle']
    if isinstance(title_raw, str):
        title = title_raw
    else:
        title = title_raw['#text'] if title_raw else None
    # Country
    country = row['MedlineCitation']['MedlineJournalInfo']['Country']
    
    # Mesh headings
    mesh_raw = row['MedlineCitation'].get('MeshHeadingList')
    if mesh_raw:
        if isinstance(mesh_raw['MeshHeading'], list):
            mesh = [{'mesh_id': el['DescriptorName']['@UI'], 'text': el['DescriptorName']['#text'], 'major_topic': el['DescriptorName']
                     ['@MajorTopicYN']} for el in mesh_raw['MeshHeading']]
        else:
            mesh = [{'mesh_id': el['DescriptorName']['@UI'], 'text': el['DescriptorName']['#text'], 'major_topic': el['DescriptorName']
                     ['@MajorTopicYN']} for el in [mesh_raw['MeshHeading']]]
    else:
        mesh = []
    
    # Authors
    authors_raw = row['MedlineCitation']['Article'].get('AuthorList')
    if not authors_raw:
        authors = []
    elif isinstance(authors_raw['Author'], list):
        authors = [
            f"{el['ForeName']} {el['LastName']}" for el in authors_raw['Author'] if el.get('ForeName')]
    else:
        authors = [f"{authors_raw['Author']['ForeName']} {authors_raw['Author']['LastName']}"] if authors_raw['Author'].get('ForeName') else None
    
    params.append({'pmid': pmid, 'text': text, 'completed_date': completed_date,
                  'revised_date': revised_date, 'title': title, 'country': country, 'mesh': mesh, 'author': authors})
    
    """# Store articles in Neo4j
    Before moving onto the NLP extraction pipeline, we will store the articles in Neo4j.
    In the center of the graph are the articles. We store their PubMed ids, title, country, and dates as properties. Of course, we could refactor the country as a separate node if we wanted to, but here I modeled them as node properties. Each article contains one or more sections of texts. Several types of sections are available, like the abstract, methods, or conclusions. I've stored the section type as the relationship property between the article and the section. We also know who authored a particular research paper. PubMed articles in particular also contain the entities mentioned or researched in the paper, which we will store as the Mesh node as the entities are mapped to the Mesh ontology.
    P.s. For most articles, only the abstract is available. You could probably download full-text for most articles through the PubMed API. However, we won't do that here.
    Before importing the data, we have to set up our Neo4j environment. If you are using the Colab notebook, I suggest you open a [Blank Project in Neo4j Sandbox](https://sandbox.neo4j.com/?usecase=blank-sandbox). Neo4j Sandbox is a free time-limited cloud instance of Neo4j. Otherwise, if you want a local Neo4j environment, I suggest you download and install the [Neo4j Desktop](https://neo4j.com/download/) application. Make sure you install the APOC library in your local environment. 
    Once you have set up your Neo4j instance, copy the connection details to script.
    """
    
    # Define Neo4j connections
    import pandas as pd
    from neo4j import GraphDatabase
    host = 'bolt://54.89.97.91:7687'
    user = 'neo4j'
    pw= 'witnesses-bells-drunk'
    driver = GraphDatabase.driver(host,auth=(user, pw)
    
    def run_query(query, params={}):
    with driver.session() as session:
        result = session.run(query, params)
        return pd.DataFrame([r.values() for r in result], columns=result.keys())
    
    """A good practice when dealing with Neo4j is to define unique constraints and indexes to optimize the performance of both import and read queries."""
    
    # Define constraints
    run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Article) ASSERT a.pmid IS UNIQUE;")
    run_query("CREATE CONSTRAINT IF NOT EXISTS ON (a:Author) ASSERT a.name IS UNIQUE;")
    run_query("CREATE CONSTRAINT IF NOT EXISTS ON (m:Mesh) ASSERT m.id IS UNIQUE;")
    
    """Now that we are all set, we can go ahead and import articles into Neo4j."""
    
    import_pubmed_query = """
    UNWIND $data AS row
    // Store article
    MERGE (a:Article {pmid: row.pmid})
    SET a.completed_date = date(row.completed_date),
    a.revised_date = date(row.revised_date),
    a.title = row.title,
    a.country = row.country
    // Store sections of articles
    FOREACH (map IN row.text | 
    CREATE (a)-[r:HAS_SECTION]->(text:Section)
    SET text.text = map.text,
        r.type = map.label)
    // Store Mesh headings        
    FOREACH (heading IN row.mesh | 
    MERGE (m:Mesh {id: heading.mesh_id})
    ON CREATE SET m.text = heading.text
    MERGE (a)-[r:MENTIONS_MESH]->(m)
    SET r.isMajor = heading.major_topic)
    // Store authors    
    FOREACH (author IN row.author | 
    MERGE (au:Author {name: author})
    MERGE (a)<-[:AUTHORED]-(au))
    """
    
    # Import pubmed articles into Neo4j
    step = 1000
    for x in range(0, len(params), step):
    chunk = params[x:x+step]
    try:
        run_query(import_pubmed_query, {'data': chunk})
    except Exception as e:
        print(e)
    
    """The import is split into batches of 1000 articles to avoid dealing with a single huge transaction and potential memory issues. The import Cypher statement is a bit longer, but nothing too complex. We can quickly inspect the data before moving on to the NLP pipeline."""
    
    run_query("""
    MATCH (a:Article)
    RETURN count(*) AS count
    """)
    
    """We can compare the revised versus the completed date to understand better why there are so many articles."""
    
    run_query("""
    MATCH (a:Article)
    RETURN a.pmid AS article_id, a.completed_date AS completed_date, a.revised_date AS revised_date
    ORDER BY completed_date ASC
    LIMIT 5
    """)
    
    """I have no idea why articles older than 20 years are being revised, but we get that information from the XML files. Next, we can inspect which mesh entities are most frequently researched as major topics in the articles completed in 2020 or later."""
    
    run_query("""
    MATCH (a:Article)-[rel:MENTIONS_MESH]->(mesh_entity)
    WHERE a.completed_date.year >= 2020 AND rel.isMajor = "Y"
    RETURN mesh_entity.text as entity, count(*) AS count
    ORDER BY count DESC
    LIMIT 5
    """)
    
    """Interestingly, COVID-19 comes out on top even though we imported only a single daily update. Before relation extraction NLP models gained popularity, you could use co-occurrence networks to identify potential links between entities. For example, we can inspect which entities most frequently co-occur with COVID-19."""
    
    run_query("""
    MATCH (e1:Mesh)<-[:MENTIONS_MESH]-(a:Article)-[:MENTIONS_MESH]->(e2)
    WHERE e1.text = 'COVID-19'
    RETURN e1.text AS entity1, e2.text AS entity2, count(*) AS count
    ORDER BY count DESC
    LIMIT 5
    """)
    
    """Co-occurrence results for COVID-19 make sense, even though they don't explain much other than it's related to humans and pandemics and has a strong connection to SARS-CoV-2.
    # Relation Extraction NLP pipeline
    Simple co-occurrence analysis can be a powerful technique to analyse relations between entities, but it ignores a lot of information that is available in the text. For that reason, researches have been investing a lot of effort in building in training relation extraction models.
    """
    
    import json
    import os
    from pyspark.ml import Pipeline,PipelineModel
    from pyspark.sql import SparkSession
    
    from sparknlp.annotator import *
    from sparknlp_jsl.annotator import *
    from sparknlp.base import *
    import sparknlp_jsl
    import sparknlp
    import pyspark.sql.functions as F
    
    import warnings
    warnings.filterwarnings('ignore')
    
    params = {"spark.driver.memory":"6G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 
    
    spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)
    
    print ("Spark NLP Version :", sparknlp.version())
    print ("Spark NLP_JSL Version :", sparknlp_jsl.version())
    
    spark
    
    """Relationship extraction models are mostly very domain-specific and trained to detect only specific types of links. For this example, I have decided to include two John Snow Labs models in the NLP pipeline. One model will detect adverse drug effects between drugs and conditions, while the other model is used to extract relations between drugs and proteins."""
    
    documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
    sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")
    
    tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")
    
    # PoS and Dependency parser
    
    pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")
    
    dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")
    
    # NER for ReDL
    
    redl_words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("redl_embeddings")
    
    redl_drugprot_ner_tagger = MedicalNerModel.pretrained("ner_drugprot_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "redl_embeddings")\
    .setOutputCol("redl_ner_tags")
    
    redl_ner_converter = NerConverter()\
    .setInputCols(["sentences", "tokens", "redl_ner_tags"])\
    .setOutputCol("redl_ner_chunks")
    
    # NER for ADE
    
    ade_words_embedder = BertEmbeddings() \
    .pretrained("biobert_pubmed_base_cased", "en") \
    .setInputCols(["sentences", "tokens"]) \
    .setOutputCol("ade_embeddings")
    
    ade_ner_tagger = MedicalNerModel() \
    .pretrained("ner_ade_biobert", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "ade_embeddings"]) \
    .setOutputCol("ade_ner_tags")
    
    ade_ner_converter = NerConverter() \
    .setInputCols(["sentences", "tokens", "ade_ner_tags"]) \
    .setOutputCol("ade_ner_chunks")
    
    # ReDL relaton extraction
    
    # Set a filter on pairs of named entities which will be treated as relation candidates
    drugprot_re_ner_chunk_filter = RENerChunksFilter()\
    .setInputCols(["redl_ner_chunks", "dependencies"])\
    .setOutputCol("redl_re_ner_chunks")\
    .setMaxSyntacticDistance(4)
    
    drugprot_re_Model = RelationExtractionDLModel()\
    .pretrained('redl_drugprot_biobert', "en", "clinical/models")\
    .setPredictionThreshold(0.9)\
    .setInputCols(["redl_re_ner_chunks", "sentences"])\
    .setOutputCol("redl_relations")
    
    # ADE relation extraction
    
    ade_re_model = RelationExtractionModel()\
        .pretrained("re_ade_biobert", "en", 'clinical/models')\
        .setInputCols(["ade_embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
        .setOutputCol("ade_relations")\
        .setMaxSyntacticDistance(3)\
        .setPredictionThreshold(0.9)\
        .setRelationPairs(["drug-ade"]) # Possible relation pairs. Default: All Relations.
    
    # Define whole pipeline
    pipeline = Pipeline(
    stages=[documenter, sentencer, tokenizer,
            pos_tagger,
            dependency_parser,
            redl_words_embedder,
            redl_drugprot_ner_tagger,
            redl_ner_converter,
            ade_words_embedder,
            ade_ner_tagger,
            ade_ner_converter,
            drugprot_re_ner_chunk_filter,
            drugprot_re_Model,
            ade_re_model])
    
    """Some of the steps are relevant for both the ADE (Adverse Drug Effect) and REDL (Drugs and Proteins) relations. However, since the models detect relationships between different types of entities, we have to use two NER models to detect both types of entities. Then we can simply feed those entities into relation extraction models. For example, the ADE model will produce only two types of relationships (0,1), where 1 indicates an adverse drug effect. On the other hand, the REDL model is trained to detect nine different types of relations between drugs and proteins (ACTIVATOR, INHIBITOR, AGONIST…)."""
    
    def extract_rel_params(df):
      """
      Extract relationship parameters from the output dataframe for ADE and ReDL relations
      """
      rel_params = list()
      for i, row in df.iterrows():
      node_id = row['nodeId']
      if row['redl_relations']:
          for result in row['redl_relations']:
              rel_type = result['result'].replace('-', '_')
              confidence = result['metadata']['confidence']
              entity_1_type = result['metadata']['entity1']
              entity_1_label = result['metadata']['chunk1']
              entity_2_type = result['metadata']['entity2']
              entity_2_label = result['metadata']['chunk2']
    
              rel_params.append({'node_id': node_id, 'rel_type': rel_type, 'confidence': confidence,
                                'entity_1_type': entity_1_type, 'entity_1_label': entity_1_label, 'entity_2_type': entity_2_type, 'entity_2_label': entity_2_label})
      if row['ade_relations']:
          for result in row['ade_relations']:
              # Skip when ADE is not found
              if result['result'] == '0':
                  continue
              rel_type = 'ADE'
              confidence = result['metadata']['confidence']
              entity_1_type = result['metadata']['entity1']
              entity_1_label = result['metadata']['chunk1']
              entity_2_type = result['metadata']['entity2']
              entity_2_label = result['metadata']['chunk2']
    
              rel_params.append({'node_id': node_id, 'rel_type': rel_type, 'confidence': confidence,
                                'entity_1_type': entity_1_type, 'entity_1_label': entity_1_label, 'entity_2_type': entity_2_type, 'entity_2_label': entity_2_label})
    
      return rel_params
    
    """Lastly, we need to define the graph model to represent extracted entities. Mostly, it depends if you want the extracted relationships to point to their original text or not."""
    
    # Define neo4j import query
    import_rels_query = """
    UNWIND $data AS row
    MATCH (a:Section)
    WHERE id(a) = toInteger(row.node_id)
    WITH row, a 
    CALL apoc.merge.node(
      ['Entity', row.entity_1_type],
      {name: row.entity_1_label},
      {},
      {}
    ) YIELD node AS startNode
    CALL apoc.merge.node(
      ['Entity', row.entity_2_type],
      {name: row.entity_2_label},
      {},
      {}
    ) YIELD node AS endNode
    
    MERGE (startNode)-[:RELATIONSHIP]->(rel:Relationship {type: row.rel_type})-[:RELATIONSHIP]->(endNode)
    
    MERGE (a)-[:MENTIONS]->(startNode)
    MERGE (a)-[:MENTIONS]->(endNode)
    MERGE (a)-[rm:MENTIONS]->(rel)
    SET rm.confidence = row.confidence
    
    """
    
    """The only thing left is to execute the code and import extracted biomedical relations into Neo4j."""
    
    from datetime import datetime
    
    # Define NLP input
    nlp_input = run_query("""
    MATCH (t:Section)
    RETURN id(t) AS nodeId, t.text as text
    LIMIT 1000
    """)
    
    # Run through NLP pipeline and store results
    step = 100  #batch size
    for i in range(0, len(nlp_input), step):
      print(f"Start processing row {i} at {datetime.now()}")
      # Create a chunk from the original Pandas Dataframe
      chunk_df = nlp_input[i: i + step]
      # Convert Pandas into Spark Dataframe
      sparkDF=spark.createDataFrame(chunk_df)
      # Run through NLP pipeline
      result = pipeline.fit(sparkDF).transform(sparkDF)
      df = result.toPandas()
      # Extract REL params
      rel_params = extract_rel_params(df)
      # Store to Neo4j
      run_query(import_rels_query, {'data': rel_params})
    
    """This code processes only 1000 sections, but you can increase the limit if you want. Since we didn't specify any unique id of the Section nodes, I've fetched the text and section internal node ids from Neo4j, which will make the import of relations faster as matching nodes by long text is not the most optimized way. Usually, you can get around this problem by calculating and storing a hash of text like sha1. In Google Colab, it takes about an hour to process 1000 sections.
    
    Now we can examine the results. First, we will look at the relationships with the most mentions.
    """
    
    run_query("""
    MATCH (start:Entity)-[:RELATIONSHIP]->(r)-[:RELATIONSHIP]->(end:Entity)
    WITH start, end, r,
      size((r)<-[:MENTIONS]-()) AS totalMentions
    ORDER BY totalMentions DESC
    LIMIT 5
    RETURN start.name AS startNode, r.type AS rel_type, end.name AS endNode, totalMentions
    """)
    
    """Since I am not a medical doctor, I won't comment the results as I have no idea how accurate they are. If we were to ask a medical doctor if a specific relation is valid, we can present them with the original text and let them decide."""
    
    run_query("""
    MATCH (start:Entity)-[:RELATIONSHIP]->(r)-[:RELATIONSHIP]->(end:Entity)
    WHERE start.name = 'cytokines' AND end.name = 'chemokines'
    MATCH (r)<-[:MENTIONS]-(section)<-[:HAS_SECTION]-(article)
    RETURN section.text AS text, article.pmid AS pmid
    LIMIT 5
    """)
    
    """What might also be interesting is to search for indirect relationships between specific entities."""
    
    run_query("""
    MATCH (start:Entity), (end:Entity)
    WHERE start.name = "cytokines" AND end.name = "CD40L"
    MATCH p=allShortestPaths((start)-[:RELATIONSHIP*..5]->(end))
    RETURN [n in nodes(p) | coalesce(n.name, n.type)] AS result LIMIT 25
    """)
    
    """# Next steps
    There are a couple of options we have to enhance our NLP pipeline. The first that comes to mind is using entity linking or resolver models. Basically the entity resolver maps an entity to a target knowledge base like UMLS or Ensembl. By accurately linking entities to a target knowledge base we achieve two things:
    * Entity disambiguation
    * Ability to enrich our knowledge graph with external sources 
    
    For example, I've found two nodes entities in our graph that might refer to the same real-world entity. While John Snow Labs offers multiple Entity Resolution models, it takes a bit of domain knowledge to map entities to a specified target knowledge base efficiently. I've seen some real-world biomedical knowledge graphs that use multiple target knowledge bases like UMLS, OMIM, Entrez to cover all types of entities.
    The second feature of using entity resolvers is that we can enrich our knowledge graph by using external biomedical sources. For example, one application would be to use a knowledge base to import existing knowledge and then find new relations between entities through NLP extraction.
    Lastly, you could also use various graph machine learning libraries like the Neo4j GDS, PyKEEN, or even PyTorch Geometric to predict new relationships.
    
    Let me know if you find any exciting application using the combination of NLP pipelines and graph databases. Also let me know if you have some suggestions to improve any of the NLP or Knowledge Graph steps in this post. Thanks for reading!
    """
    
    
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
    代码解读
        