BeautifulSoup4的详解+PubMed医学文献爬取
点击名片关注 阿尘blog,一起学习,一起成长
本文深入阐述了BeautifulSoup4的应用与实现:详细解析PubMed医学文献标题、摘要及pmid的提取过程。
01
—
BeautifulSoup4
安装及初步使用
安装
pip install beautifulsoup4 -i http://pypi.tuna.tsinghua.edu.cn/simple/ --trusted-host pypi.tuna.tsinghua.edu.cn
使用
可以通过将一个文档传递给BeautifulSoup的构造方法来进行处理;还可以通过传递一段字符串或句柄来进行操作。
#导入方法
from bs4 import BeautifulSoup
#实例化
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
BeautifulSoup能够根据输入的内容识别出最适合它使用的解析器,并且同时也可以通过配置来指定。
源码
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
element_classes=None, **kwargs):
"""Constructor.
6. 7. :param markup: A string or a file-like object representing
markup to be parsed.
10. 11. :param features: Desirable features of the parser to be
used. This may be the name of a specific parser ("lxml",
"lxml-xml", "html.parser", or "html5lib") or it may be the
type of markup to be used ("html", "html5", "xml"). It's
recommended that you name a specific parser, so that
Beautiful Soup gives you the same results across platforms
and virtual environments.
19. 20. :param builder: A TreeBuilder subclass to instantiate (or
instance to use) instead of looking one up based on
`features`. You only need to use this if you've implemented a
custom TreeBuilder.
25. 26. :param parse_only: A SoupStrainer. Only parts of the document
matching the SoupStrainer will be considered. This is useful
when parsing part of a document that would otherwise be too
large to fit into memory.
31. 32. :param from_encoding: A string indicating the encoding of the
document to be parsed. Pass this in if Beautiful Soup is
guessing wrongly about the document's encoding.
36. 37. :param exclude_encodings: A list of strings indicating
encodings known to be wrong. Pass this in if you don't know
the document's encoding but you know Beautiful Soup's guess is
wrong.
42. 43. :param element_classes: A dictionary mapping BeautifulSoup
classes like Tag and NavigableString, to other classes you'd
like to be instantiated instead as the parse tree is
built. This is useful for subclassing Tag or NavigableString
to modify default behavior.
49. 50. :param kwargs: For backwards compatibility purposes, the
constructor accepts certain keyword arguments used in
Beautiful Soup 3. None of these arguments do anything in
Beautiful Soup 4; they will result in a warning and then be
ignored.
Apart from this, any keyword arguments passed into the
BeautifulSoup constructor are propagated to the TreeBuilder
constructor. This makes it possible to configure a
TreeBuilder by passing in arguments, not just by saying which
one to use.
"""

Beautiful Soup将复杂HTML文档转换为一个层次结构,其中每个节点都是Python对象,这些对象可归类为四种类型: Tag 类型的对象代表标签信息,NavigableString 对象存储可导航的字符串内容,BeautifulSoup 类型的对象负责解析和管理文档结构,而Comment 类型的对象则用于存储注释信息.
Tag 对象与XML或HTML原生文档中的tag相同:
from bs4 import BeautifulSoup
import requests
# 发起请求
response = requests.get('https://www.baidu.com').content
soup = BeautifulSoup(response,'html.parser')
print(soup.prettify()) #.prettify()格式化html结构
print(soup.title) # 获取title标签的整个内容-- <title>百度一下,你就知道</title>
print(soup.title.name) # 获取title标签的名字-- title
print(soup.title.string) # 获取title标签的内容-- 百度一下,你就知道
print(soup.find_all('a')) # 获取所有a标签,返回一个列表
print(soup.find('p')) # 获取一个第一个p标签
href_list = []
for item in soup.find_all('a'):
i = item.get('href') # 获取a标签的href属性
href_list.append(i)
print(href_list)
Attributes
一个tag通常具有多个属性. 标签 <b class="boldest"> 具有一个名为 "class" 的属性,其值为 "boldest". 对于 tag 的所有属性,遵循字典式的操作方式:读取和设置.
也可以直接”点”取属性, 比如: .attrs :
print(soup.p.attrs) # {'id': 'lh'}
tag的功能包含增删改三种基本操作。再次强调一下, tag功能的操作方式与字典相似
soup.p['id'] = 1
print(soup.p.attrs) # 获取一个第一个p标签 {'id': 1}
字符串通常会存在于tag中。Beautiful Soup则通过NavigableString类来包装这些tag中的字符串。
通过 unicode() 方法可以直接将 NavigableString 对象转换成Unicode字符串:
tag中的字符串无法直接编辑,但可以通过 replace_with() 方法进行替换
NavigableString 对象能够提供 遍历文档树 和 搜索文档树 的大部分属性, 但并非全部. 特别是, 一个字符串不能包含其他嵌套内容(tag 可以包含字符串或其他 tag), 而字符串本身不支持 .contents, .string, 或 find() 方法.
BeautifulSoup
BeautifulSoup 实例化后代表整个文档内容. 大部分情况下, 可以将其视为 Tag 实例, 它提供了遍历文档树以及查找文档树中所描述的方法.
因为 BeautifulSoup 对象并非真正的HTML或XML标签,因此它不具备 name 和 attribute 属性. 然而,在某些情况下访问它的 .name 属性却非常便捷,因此该对象特别设置了这样一个属性: 其值为 "[document]".
print(soup.name) #[document]
Comment 对象属于 NavigableString 类的一个子类:该对象会将注释信息采用特殊格式进行输出。
find()方法:返回第一个找到的对应结果
def find(self, name=None, attrs={}, recursive=True, string=None,
**kwargs):
"""Look in the children of this PageElement and find the first
PageElement that matches the given criteria.
6. 7. All find_* methods take a common set of arguments. See the online
documentation for detailed explanations.
10. 11. :param name: A filter on tag name.
:param attrs: A dictionary of filters on attribute values.
:param recursive: If this is True, find() will perform a
recursive search of this PageElement's children. Otherwise,
only the direct children will be considered.
:param limit: Stop looking after finding this many results.
:kwargs: A dictionary of filters on attribute values.
:return: A PageElement.
:rtype: bs4.element.Tag | bs4.element.NavigableString
"""
find_all( name , attrs , recursive , string , **kwargs )
该方法通过调用find_all()函数来查找与当前tag相关的所有子标签节点,并对这些子标签节点进行过滤器条件匹配判断。
find_parent() 和 find_parents() 分别对应于查找符合条件的单个符合的上层节点以及相关的上层节点集合
find_next_siblings() 方法返回所有符合条件的后续兄弟节点, find_next_sibling() 仅返回符合条件的后面的第一个这样的 tag 节点.
该方法将所有符合条件的前驱者节点返回, 此方法则仅返回第一个符合条件的前驱者节点:
该方法用于获取所有满足条件的节点, 另一个方法则只返回第一个符合条件的节点:
该方法能够生成所有满足条件的节点;另一个方法则会找到第一个符合条件的节点
CSS选择器
BeautifulSoup支持广泛使用的CSS选择器[6] ,通过将字符串参数传递给Tag或BeautifulSoup对象的.select()方法即可调用CSS选择器语法以定位所需标签:
print(soup.select('title')) # 获取title标签
print(soup.select('body a')) # 在body里面挨个找a标签
print(soup.select('p > a')) #找到p标签的直接字标签a
print(soup.select('.bri')) # 按照class类名查找,.bri是class的值
print(soup.select('#kw')) # 按照id来找
02
—
实战爬取pubmed医学文献的标题、摘要、pmid
from bs4 import BeautifulSoup
import requests
import time
import csv
term = str(input("请输入查询关键词:"))
term = term.replace(" ","+")
# print(term)
url = f'https://pubmed.ncbi.nlm.nih.gov/?term={term}&filter=simsearch1.fha&filter=datesearch.y_5'
response = requests.get(url,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
}).content
time.sleep(3)
# print(response.content)
#使用BeautifulSoup解析页面内容
soup = BeautifulSoup(response,'html.parser')
results = soup.find_all(name='a', attrs={'class':'docsum-title'})
#print(results)
link_list = []
for item in results:
#获取文章详情链接
i = item.get('href')
url2 = f'https://pubmed.ncbi.nlm.nih.gov{i}'
response2 = requests.get(url2).content
soup2 = BeautifulSoup(response2,'html.parser')
#获取标题
head_tag = soup2.find('h1',attrs={'class':'heading-title'})
headline = head_tag.text
#print(headline)
#获取PMID
current_id = soup2.find('strong',attrs={'class':'current-id'})
pmid = current_id.text
#print(pmid)
#获取摘要
abstract_tag = soup2.find('div',attrs={'class':'abstract-content selected'}).find_next('p')
abstract = abstract_tag.text
#print(abstract)
#将获取到的内容写入csv文件中
with open('article.csv','a',newline='',encoding='utf-8-sig') as csvfile:
FieldNames = ['标题','PMID','摘要']
writer = csv.DictWriter(csvfile,fieldnames=FieldNames)
writer.writeheader()
writer.writerow({'标题':headline,'PMID':pmid,'摘要':abstract})
csvfile.close()
time.sleep(5)
注意:此份代码草率编写,
未对代码进行封装优化处理,
但目前仅能获取单页内容信息,
却认为这些问题还算不上什么大问题。
我们相信大家都能轻松应对。
哈哈,不妨共同探讨一下!
扫描二维码关注阿尘blog,一起交流学习

参考官方文档地址
Beautiful Soup 4.4.0 文档 — Beautiful Soup 4.2.0 documentation
