Advertisement

BeautifulSoup4的详解+PubMed医学文献爬取

阅读量:

点击名片关注 阿尘blog,一起学习,一起成长

本文深入阐述了BeautifulSoup4的应用与实现:详细解析PubMed医学文献标题、摘要及pmid的提取过程。

01

BeautifulSoup4

安装及初步使用

安装

复制代码
    pip install beautifulsoup4 -i http://pypi.tuna.tsinghua.edu.cn/simple/ --trusted-host pypi.tuna.tsinghua.edu.cn
复制代码
    使用

可以通过将一个文档传递给BeautifulSoup的构造方法来进行处理;还可以通过传递一段字符串或句柄来进行操作。

复制代码
    #导入方法
    from bs4 import BeautifulSoup
    #实例化
    soup = BeautifulSoup(open("index.html"))
    soup = BeautifulSoup("<html>data</html>")

BeautifulSoup能够根据输入的内容识别出最适合它使用的解析器,并且同时也可以通过配置来指定。

源码

复制代码
 def __init__(self, markup="", features=None, builder=None,

    
          parse_only=None, from_encoding=None, exclude_encodings=None,
    
          element_classes=None, **kwargs):
    
     """Constructor.
    
   6.   7.     :param markup: A string or a file-like object representing
    
      markup to be parsed.
    
   10.   11.     :param features: Desirable features of the parser to be
    
      used. This may be the name of a specific parser ("lxml",
    
      "lxml-xml", "html.parser", or "html5lib") or it may be the
    
      type of markup to be used ("html", "html5", "xml"). It's
    
      recommended that you name a specific parser, so that
    
      Beautiful Soup gives you the same results across platforms
    
      and virtual environments.
    
   19.   20.     :param builder: A TreeBuilder subclass to instantiate (or
    
      instance to use) instead of looking one up based on
    
      `features`. You only need to use this if you've implemented a
    
      custom TreeBuilder.
    
   25.   26.     :param parse_only: A SoupStrainer. Only parts of the document
    
      matching the SoupStrainer will be considered. This is useful
    
      when parsing part of a document that would otherwise be too
    
      large to fit into memory.
    
   31.   32.     :param from_encoding: A string indicating the encoding of the
    
      document to be parsed. Pass this in if Beautiful Soup is
    
      guessing wrongly about the document's encoding.
    
   36.   37.     :param exclude_encodings: A list of strings indicating
    
      encodings known to be wrong. Pass this in if you don't know
    
      the document's encoding but you know Beautiful Soup's guess is
    
      wrong.
    
   42.   43.     :param element_classes: A dictionary mapping BeautifulSoup
    
      classes like Tag and NavigableString, to other classes you'd
    
      like to be instantiated instead as the parse tree is
    
      built. This is useful for subclassing Tag or NavigableString
    
      to modify default behavior.
    
   49.   50.     :param kwargs: For backwards compatibility purposes, the
    
      constructor accepts certain keyword arguments used in
    
      Beautiful Soup 3. None of these arguments do anything in
    
      Beautiful Soup 4; they will result in a warning and then be
    
      ignored.
    
      
    
      Apart from this, any keyword arguments passed into the
    
      BeautifulSoup constructor are propagated to the TreeBuilder
    
      constructor. This makes it possible to configure a
    
      TreeBuilder by passing in arguments, not just by saying which
    
      one to use.
    
     """
344a6d01a6396e90a4820bc7599a7119.png

Beautiful Soup将复杂HTML文档转换为一个层次结构,其中每个节点都是Python对象,这些对象可归类为四种类型: Tag 类型的对象代表标签信息,NavigableString 对象存储可导航的字符串内容,BeautifulSoup 类型的对象负责解析和管理文档结构,而Comment 类型的对象则用于存储注释信息.

Tag 对象与XML或HTML原生文档中的tag相同:

复制代码
 from bs4 import BeautifulSoup

    
 import requests
    
 # 发起请求
    
 response = requests.get('https://www.baidu.com').content
    
 soup = BeautifulSoup(response,'html.parser')
    
  
    
  
    
 print(soup.prettify()) #.prettify()格式化html结构
    
 print(soup.title) # 获取title标签的整个内容-- <title>百度一下,你就知道</title>
    
 print(soup.title.name) # 获取title标签的名字-- title
    
 print(soup.title.string) # 获取title标签的内容-- 百度一下,你就知道
    
 print(soup.find_all('a')) # 获取所有a标签,返回一个列表
    
 print(soup.find('p')) # 获取一个第一个p标签
    
 href_list = []
    
 for item in soup.find_all('a'):
    
     i = item.get('href') # 获取a标签的href属性
    
     href_list.append(i)
    
 print(href_list)

Attributes

一个tag通常具有多个属性. 标签 <b class="boldest"> 具有一个名为 "class" 的属性,其值为 "boldest". 对于 tag 的所有属性,遵循字典式的操作方式:读取和设置.

也可以直接”点”取属性, 比如: .attrs :

复制代码
    print(soup.p.attrs) # {'id': 'lh'}

tag的功能包含增删改三种基本操作。再次强调一下, tag功能的操作方式与字典相似

复制代码
 soup.p['id'] = 1

    
 print(soup.p.attrs) # 获取一个第一个p标签 {'id': 1}

字符串通常会存在于tag中。Beautiful Soup则通过NavigableString类来包装这些tag中的字符串。

通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:

tag中的字符串无法直接编辑,但可以通过 replace_with() 方法进行替换

NavigableString 对象能够提供 遍历文档树 和 搜索文档树 的大部分属性, 但并非全部. 特别是, 一个字符串不能包含其他嵌套内容(tag 可以包含字符串或其他 tag), 而字符串本身不支持 .contents, .string, 或 find() 方法.

BeautifulSoup

BeautifulSoup 实例化后代表整个文档内容. 大部分情况下, 可以将其视为 Tag 实例, 它提供了遍历文档树以及查找文档树中所描述的方法.

因为 BeautifulSoup 对象并非真正的HTML或XML标签,因此它不具备 name 和 attribute 属性. 然而,在某些情况下访问它的 .name 属性却非常便捷,因此该对象特别设置了这样一个属性: 其值为 "[document]".

复制代码
    print(soup.name) #[document]

Comment 对象属于 NavigableString 类的一个子类:该对象会将注释信息采用特殊格式进行输出。

find()方法:返回第一个找到的对应结果

复制代码
 def find(self, name=None, attrs={}, recursive=True, string=None,

    
      **kwargs):
    
     """Look in the children of this PageElement and find the first
    
     PageElement that matches the given criteria.
    
   6.   7.     All find_* methods take a common set of arguments. See the online
    
     documentation for detailed explanations.
    
   10.   11.     :param name: A filter on tag name.
    
     :param attrs: A dictionary of filters on attribute values.
    
     :param recursive: If this is True, find() will perform a
    
     recursive search of this PageElement's children. Otherwise,
    
     only the direct children will be considered.
    
     :param limit: Stop looking after finding this many results.
    
     :kwargs: A dictionary of filters on attribute values.
    
     :return: A PageElement.
    
     :rtype: bs4.element.Tag | bs4.element.NavigableString
    
     """

find_all( name , attrs , recursive , string , **kwargs )

该方法通过调用find_all()函数来查找与当前tag相关的所有子标签节点,并对这些子标签节点进行过滤器条件匹配判断。

find_parent()find_parents() 分别对应于查找符合条件的单个符合的上层节点以及相关的上层节点集合

find_next_siblings() 方法返回所有符合条件的后续兄弟节点, find_next_sibling() 仅返回符合条件的后面的第一个这样的 tag 节点.

该方法将所有符合条件的前驱者节点返回, 此方法则仅返回第一个符合条件的前驱者节点:

该方法用于获取所有满足条件的节点, 另一个方法则只返回第一个符合条件的节点:

该方法能够生成所有满足条件的节点;另一个方法则会找到第一个符合条件的节点

CSS选择器

BeautifulSoup支持广泛使用的CSS选择器[6] ,通过将字符串参数传递给TagBeautifulSoup对象的.select()方法即可调用CSS选择器语法以定位所需标签:

复制代码
 print(soup.select('title')) # 获取title标签

    
 print(soup.select('body a')) # 在body里面挨个找a标签
    
 print(soup.select('p > a')) #找到p标签的直接字标签a
    
 print(soup.select('.bri')) # 按照class类名查找,.bri是class的值
    
 print(soup.select('#kw')) # 按照id来找

02

实战爬取pubmed医学文献的标题、摘要、pmid

复制代码
 from bs4 import BeautifulSoup

    
 import requests
    
 import time
    
 import csv
    
 term = str(input("请输入查询关键词:"))
    
 term = term.replace(" ","+")
    
 # print(term)
    
 url = f'https://pubmed.ncbi.nlm.nih.gov/?term={term}&filter=simsearch1.fha&filter=datesearch.y_5'
    
 response = requests.get(url,headers={
    
     "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    
     }).content
    
 time.sleep(3)
    
 # print(response.content)
    
 #使用BeautifulSoup解析页面内容
    
 soup = BeautifulSoup(response,'html.parser')
    
  
    
  
    
 results = soup.find_all(name='a', attrs={'class':'docsum-title'})
    
  
    
  
    
 #print(results)
    
 link_list = []
    
 for item in results:
    
     #获取文章详情链接
    
     i = item.get('href')
    
     url2 = f'https://pubmed.ncbi.nlm.nih.gov{i}'
    
     response2 = requests.get(url2).content
    
     soup2 = BeautifulSoup(response2,'html.parser')
    
  
    
  
    
     #获取标题
    
     head_tag = soup2.find('h1',attrs={'class':'heading-title'})
    
     headline = head_tag.text
    
     #print(headline)
    
     #获取PMID
    
     current_id = soup2.find('strong',attrs={'class':'current-id'})
    
     pmid = current_id.text
    
     #print(pmid)
    
     #获取摘要
    
     abstract_tag = soup2.find('div',attrs={'class':'abstract-content selected'}).find_next('p')
    
     abstract = abstract_tag.text
    
     #print(abstract)
    
     #将获取到的内容写入csv文件中
    
     with open('article.csv','a',newline='',encoding='utf-8-sig') as csvfile:
    
     FieldNames = ['标题','PMID','摘要']
    
     writer = csv.DictWriter(csvfile,fieldnames=FieldNames)
    
     writer.writeheader()
    
     writer.writerow({'标题':headline,'PMID':pmid,'摘要':abstract})
    
     csvfile.close()
    
     time.sleep(5)

注意:此份代码草率编写,
未对代码进行封装优化处理,
但目前仅能获取单页内容信息,
却认为这些问题还算不上什么大问题。
我们相信大家都能轻松应对。
哈哈,不妨共同探讨一下!

扫描二维码关注阿尘blog,一起交流学习

6243ee248770b5c0a5411957de519aab.png

参考官方文档地址

Beautiful Soup 4.4.0 文档 — Beautiful Soup 4.2.0 documentation

全部评论 (0)

还没有任何评论哟~