BeautifulSoup4的详解+PubMed医学文献爬取

阅读量：

点击名片关注阿尘blog，一起学习，一起成长

本文深入阐述了BeautifulSoup4的应用与实现：详细解析PubMed医学文献标题、摘要及pmid的提取过程。

—

BeautifulSoup4

安装及初步使用

安装

复制代码

    pip install beautifulsoup4 -i http://pypi.tuna.tsinghua.edu.cn/simple/ --trusted-host pypi.tuna.tsinghua.edu.cn

复制代码

    使用

可以通过将一个文档传递给BeautifulSoup的构造方法来进行处理；还可以通过传递一段字符串或句柄来进行操作。

复制代码

    #导入方法
    from bs4 import BeautifulSoup
    #实例化
    soup = BeautifulSoup(open("index.html"))
    soup = BeautifulSoup("<html>data</html>")

BeautifulSoup能够根据输入的内容识别出最适合它使用的解析器，并且同时也可以通过配置来指定。

源码

复制代码

 def __init__(self, markup="", features=None, builder=None,

    
          parse_only=None, from_encoding=None, exclude_encodings=None,
    
          element_classes=None, **kwargs):
    
     """Constructor.
    
   6.   7.     :param markup: A string or a file-like object representing
    
      markup to be parsed.
    
   10.   11.     :param features: Desirable features of the parser to be
    
      used. This may be the name of a specific parser ("lxml",
    
      "lxml-xml", "html.parser", or "html5lib") or it may be the
    
      type of markup to be used ("html", "html5", "xml"). It's
    
      recommended that you name a specific parser, so that
    
      Beautiful Soup gives you the same results across platforms
    
      and virtual environments.
    
   19.   20.     :param builder: A TreeBuilder subclass to instantiate (or
    
      instance to use) instead of looking one up based on
    
      `features`. You only need to use this if you've implemented a
    
      custom TreeBuilder.
    
   25.   26.     :param parse_only: A SoupStrainer. Only parts of the document
    
      matching the SoupStrainer will be considered. This is useful
    
      when parsing part of a document that would otherwise be too
    
      large to fit into memory.
    
   31.   32.     :param from_encoding: A string indicating the encoding of the
    
      document to be parsed. Pass this in if Beautiful Soup is
    
      guessing wrongly about the document's encoding.
    
   36.   37.     :param exclude_encodings: A list of strings indicating
    
      encodings known to be wrong. Pass this in if you don't know
    
      the document's encoding but you know Beautiful Soup's guess is
    
      wrong.
    
   42.   43.     :param element_classes: A dictionary mapping BeautifulSoup
    
      classes like Tag and NavigableString, to other classes you'd
    
      like to be instantiated instead as the parse tree is
    
      built. This is useful for subclassing Tag or NavigableString
    
      to modify default behavior.
    
   49.   50.     :param kwargs: For backwards compatibility purposes, the
    
      constructor accepts certain keyword arguments used in
    
      Beautiful Soup 3. None of these arguments do anything in
    
      Beautiful Soup 4; they will result in a warning and then be
    
      ignored.
    
      
    
      Apart from this, any keyword arguments passed into the
    
      BeautifulSoup constructor are propagated to the TreeBuilder
    
      constructor. This makes it possible to configure a
    
      TreeBuilder by passing in arguments, not just by saying which
    
      one to use.
    
     """

Beautiful Soup将复杂HTML文档转换为一个层次结构,其中每个节点都是Python对象,这些对象可归类为四种类型: Tag 类型的对象代表标签信息,NavigableString 对象存储可导航的字符串内容,BeautifulSoup 类型的对象负责解析和管理文档结构,而Comment 类型的对象则用于存储注释信息.

Tag 对象与XML或HTML原生文档中的tag相同:

复制代码

 from bs4 import BeautifulSoup

    
 import requests
    
 # 发起请求
    
 response = requests.get('https://www.baidu.com').content
    
 soup = BeautifulSoup(response,'html.parser')
    
  
    
  
    
 print(soup.prettify()) #.prettify()格式化html结构
    
 print(soup.title) # 获取title标签的整个内容-- <title>百度一下，你就知道</title>
    
 print(soup.title.name) # 获取title标签的名字-- title
    
 print(soup.title.string) # 获取title标签的内容-- 百度一下，你就知道
    
 print(soup.find_all('a')) # 获取所有a标签，返回一个列表
    
 print(soup.find('p')) # 获取一个第一个p标签
    
 href_list = []
    
 for item in soup.find_all('a'):
    
     i = item.get('href') # 获取a标签的href属性
    
     href_list.append(i)
    
 print(href_list)

Attributes

一个tag通常具有多个属性. 标签 <b class="boldest"> 具有一个名为 "class" 的属性,其值为 "boldest". 对于 tag 的所有属性,遵循字典式的操作方式：读取和设置.

也可以直接”点”取属性, 比如: .attrs :

复制代码

    print(soup.p.attrs) # {'id': 'lh'}

tag的功能包含增删改三种基本操作。再次强调一下, tag功能的操作方式与字典相似

复制代码

 soup.p['id'] = 1

    
 print(soup.p.attrs) # 获取一个第一个p标签 {'id': 1}

字符串通常会存在于tag中。Beautiful Soup则通过NavigableString类来包装这些tag中的字符串。

通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

tag中的字符串无法直接编辑,但可以通过 replace_with() 方法进行替换

NavigableString 对象能够提供遍历文档树和搜索文档树的大部分属性, 但并非全部. 特别是, 一个字符串不能包含其他嵌套内容（tag 可以包含字符串或其他 tag）, 而字符串本身不支持 .contents, .string, 或 find() 方法.

BeautifulSoup

BeautifulSoup 实例化后代表整个文档内容. 大部分情况下, 可以将其视为 Tag 实例, 它提供了遍历文档树以及查找文档树中所描述的方法.

因为 BeautifulSoup 对象并非真正的HTML或XML标签,因此它不具备 name 和 attribute 属性. 然而，在某些情况下访问它的 .name 属性却非常便捷,因此该对象特别设置了这样一个属性: 其值为 "[document]".

复制代码

    print(soup.name) #[document]

Comment 对象属于 NavigableString 类的一个子类：该对象会将注释信息采用特殊格式进行输出。

find()方法:返回第一个找到的对应结果

复制代码

 def find(self, name=None, attrs={}, recursive=True, string=None,

    
      **kwargs):
    
     """Look in the children of this PageElement and find the first
    
     PageElement that matches the given criteria.
    
   6.   7.     All find_* methods take a common set of arguments. See the online
    
     documentation for detailed explanations.
    
   10.   11.     :param name: A filter on tag name.
    
     :param attrs: A dictionary of filters on attribute values.
    
     :param recursive: If this is True, find() will perform a
    
     recursive search of this PageElement's children. Otherwise,
    
     only the direct children will be considered.
    
     :param limit: Stop looking after finding this many results.
    
     :kwargs: A dictionary of filters on attribute values.
    
     :return: A PageElement.
    
     :rtype: bs4.element.Tag | bs4.element.NavigableString
    
     """

find_all( name , attrs , recursive , string , **kwargs )

该方法通过调用find_all()函数来查找与当前tag相关的所有子标签节点，并对这些子标签节点进行过滤器条件匹配判断。

find_parent() 和 find_parents() 分别对应于查找符合条件的单个符合的上层节点以及相关的上层节点集合

find_next_siblings() 方法返回所有符合条件的后续兄弟节点, find_next_sibling() 仅返回符合条件的后面的第一个这样的 tag 节点.

该方法将所有符合条件的前驱者节点返回, 此方法则仅返回第一个符合条件的前驱者节点:

该方法用于获取所有满足条件的节点, 另一个方法则只返回第一个符合条件的节点:

该方法能够生成所有满足条件的节点；另一个方法则会找到第一个符合条件的节点

CSS选择器

BeautifulSoup支持广泛使用的CSS选择器[6] ，通过将字符串参数传递给Tag或BeautifulSoup对象的.select()方法即可调用CSS选择器语法以定位所需标签：

复制代码

 print(soup.select('title')) # 获取title标签

    
 print(soup.select('body a')) # 在body里面挨个找a标签
    
 print(soup.select('p > a')) #找到p标签的直接字标签a
    
 print(soup.select('.bri')) # 按照class类名查找，.bri是class的值
    
 print(soup.select('#kw')) # 按照id来找

—

实战爬取pubmed医学文献的标题、摘要、pmid

复制代码

 from bs4 import BeautifulSoup

    
 import requests
    
 import time
    
 import csv
    
 term = str(input("请输入查询关键词："))
    
 term = term.replace(" ","+")
    
 # print(term)
    
 url = f'https://pubmed.ncbi.nlm.nih.gov/?term={term}&filter=simsearch1.fha&filter=datesearch.y_5'
    
 response = requests.get(url,headers={
    
     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    
     }).content
    
 time.sleep(3)
    
 # print(response.content)
    
 #使用BeautifulSoup解析页面内容
    
 soup = BeautifulSoup(response,'html.parser')
    
  
    
  
    
 results = soup.find_all(name='a', attrs={'class':'docsum-title'})
    
  
    
  
    
 #print(results)
    
 link_list = []
    
 for item in results:
    
     #获取文章详情链接
    
     i = item.get('href')
    
     url2 = f'https://pubmed.ncbi.nlm.nih.gov{i}'
    
     response2 = requests.get(url2).content
    
     soup2 = BeautifulSoup(response2,'html.parser')
    
  
    
  
    
     #获取标题
    
     head_tag = soup2.find('h1',attrs={'class':'heading-title'})
    
     headline = head_tag.text
    
     #print(headline)
    
     #获取PMID
    
     current_id = soup2.find('strong',attrs={'class':'current-id'})
    
     pmid = current_id.text
    
     #print(pmid)
    
     #获取摘要
    
     abstract_tag = soup2.find('div',attrs={'class':'abstract-content selected'}).find_next('p')
    
     abstract = abstract_tag.text
    
     #print(abstract)
    
     #将获取到的内容写入csv文件中
    
     with open('article.csv','a',newline='',encoding='utf-8-sig') as csvfile:
    
     FieldNames = ['标题','PMID','摘要']
    
     writer = csv.DictWriter(csvfile,fieldnames=FieldNames)
    
     writer.writeheader()
    
     writer.writerow({'标题':headline,'PMID':pmid,'摘要':abstract})
    
     csvfile.close()
    
     time.sleep(5)

注意：此份代码草率编写，
未对代码进行封装优化处理，
但目前仅能获取单页内容信息，
却认为这些问题还算不上什么大问题。
我们相信大家都能轻松应对。
哈哈,不妨共同探讨一下！

扫描二维码关注阿尘blog，一起交流学习

参考官方文档地址

Beautiful Soup 4.4.0 文档 — Beautiful Soup 4.2.0 documentation

全部评论 (0)

还没有任何评论哟~

BeautifulSoup4的详解+PubMed医学文献爬取

点击名片关注阿尘blog，一起学习，一起成长本文主要介绍了BeautifulSoup4的使用和实践：PubMed医学文献标题、摘要、pmid的爬取 01 — BeautifulSoup4 安装及初步...

python爬取pubmed的文献_爬虫获取pubmed中文献的标题和摘要

为了满足快速浏览pubmed中相关文献标题和摘要的需求，写了个简单的爬虫目前只实现了单个关键词以及多个关键词的and检索，用于批量获取感兴趣文献的标题和摘要。使用编辑器是python，所编写的爬虫主...

python爬取pubmed的文献_爬虫获取pubmed中文献的标题和摘要

获取开源医学文献图片（PubMed）制成数据集

获取开源医学文献图片（PubMed）制成数据集以github源代码为基础 1.获取关键词（及相关联的近/同义词） 2.通过关键词搜索文献（keywords/abstract/body） 3.获取文章...

python爬取pubmed的文献_使用python來調用pubmed API快速整理文獻

在pubmed上用關鍵字取得的文獻後，想要把這些文獻直接收集起來，可以使用pubmed所提供的API，可以很簡單快速的達到自己想要的資料收集方式，這邊使用python來實作：載入需要用到的包 imp...

爬虫获取pubmed中文献的标题和摘要

为了满足快速浏览pubmed中相关文献标题和摘要的需求，写了个简单的爬虫（目前只实现了单个关键词以及多个关键词的and检索），用于批量获取感兴趣文献的标题和摘要。使用编辑器是python，所编写的爬...

使用PubMed检索器获取生物医学文献：LangChain实践指南

使用PubMed检索器获取生物医学文献：LangChain实践指南 1\.引言在生物医学研究和临床实践中，快速获取最新的科研文献至关重要。PubMed作为美国国家生物技术信息中心（NCBI）和国家医...

python爬取pubmed的文献_利用selenium爬取pubmed，获得搜索的关键字最近五年发表文章数量...

PubMed是一个提供生物医学方面的论文搜寻以及摘要，并且免费搜寻的数据库。是一个做生物方面经常要用到的一个查找文献的网站。最近刚学了爬虫相关的知识包括urllib库，requests库，xpath表...

Selenium定向爬取PubMed生物医学摘要信息

目录一、前言 1、PubMed是什么？ 2、PubMed特点二、实现代码三、分析HTML 四、运行结果一、前言本文主要是自己的在线代码笔记。在生物医学本体Ontology构建过程中，我使用S...

pythonrequests爬虫外文文献_Python3 爬虫 requests+BeautifulSoup4(BS4) 爬取小说网站数据...

刚学Python爬虫不久，迫不及待的找了一个网站练手，新笔趣阁：一个小说网站。前提准备安装Python以及必要的模块requests，bs4，不了解requests和bs4的同学可以去官网看个大概...

是否确定退出登录?

BeautifulSoup4的详解+PubMed医学文献爬取

安装及初步使用

全部评论 (0)

相关文章推荐

BeautifulSoup4的详解+PubMed医学文献爬取

python爬取pubmed的文献_爬虫获取pubmed中文献的标题和摘要

python爬取pubmed的文献_爬虫获取pubmed中文献的标题和摘要

获取开源医学文献图片（PubMed）制成数据集

python爬取pubmed的文献_使用python來調用pubmed API快速整理文獻

爬虫获取pubmed中文献的标题和摘要

使用PubMed检索器获取生物医学文献：LangChain实践指南

python爬取pubmed的文献_利用selenium爬取pubmed，获得搜索的关键字最近五年发表文章数量...

Selenium定向爬取PubMed生物医学摘要信息

pythonrequests爬虫外文文献_Python3 爬虫 requests+BeautifulSoup4(BS4) 爬取小说网站数据...