【Python爬虫第二弹】基于爬虫爬取豆瓣书籍的书籍信息查询
发布时间
阅读量:
阅读量
爬虫学了有半个月的时间了,其实这半个月真正学到的东西也不过就是requsets和beautifulsoup的用法,惭愧,收获不太大,还没有接触scrapy框架,但是光这个beautifulsoup可以完成的事情已经很多了,然后简单的使用了pandas可以将爬取到的数据整理一下,还没到可以分析的地步
由于先前无知,没想到爬取速度过快会导致被封ip,所以在某一天爬豆瓣的时候什么信息都爬不出来了,然后就百度了一下得添加请求头,然后加上等待时间(讲道理,这个等待时间时及其不愿意搞得,太慢的,也不会分布式,搞得效率特别低,但是为了防止被封ip,还是加上了),然后就是添加请求头,不知道我的请求头添加的有用没,看网上还有个cookies,到目前为止还没搞清楚cookies怎么玩,学长说的加ip池也不知道怎么加,所以,至今仍是小白一个,然后简单的写了一个基于爬虫查询书籍信息,还不够完美,先扔出来吧
#根据用户输入书名查询该书在豆瓣的评分及相关书籍信息
import pandas
import requests
import time
import pandas
from bs4 import BeautifulSoup
url = 'https://read.douban.com/search?q={}&start={}'
url2 = 'https://read.douban.com'
#设置请求头
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
headers['Host'] = 'read.douban.com'
headers['Connection'] = 'keep-alive'
headers['Accept-Encoding'] = 'gzip, deflate, br'
headers['Accept-Language'] = 'zh-CN,zh;q=0.8'
def get_bookcon(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
#获取该书导言
print('本书导言')
for soup3 in soup.find_all('div',class_='article-profile-intro article-abstract collapse-context'):
for soup4 in soup3.find_all('p'):
print(soup4.text)
print('\n\n')
#获取该书热门划线
print('本书热门划线')
for soup3 in soup.find_all('div',class_='popular-annotations'):
for soup4 in soup3.find_all('li'):
print('”',soup4.text,'“','\n')
print('\n')
def get_message(soup2,bookname):
result = {}
#将含有书籍评分的书籍信息打印出来
if len(soup2.find_all('div',class_='title')) > 0 and len(soup2.find_all('a',class_='author-item')) > 0:
bookname = soup2.find_all('div',class_='title')[0].text
booklink = url2 + soup2.find_all('div',class_='title')[0].find_all('a')[0]['href']#获取该书详情链接
bookauthor = soup2.find_all('a',class_='author-item')[0].text
if len(soup2.find_all('span',class_='rating-average')) > 0:
bookscore = float(soup2.find_all('span',class_='rating-average')[0].text)
else:
print('该书籍暂无评分!系统默认评分为0')
bookscore = 0
#result['书名'] = bookname
#result['作者'] = bookauthor
#result['评分'] = bookscore
print('书名','《',bookname,'》','\t作者:',bookauthor,'\t评分:',bookscore,'\n\n')
print(booklink)
get_bookcon(booklink)
if soup2.find_all('div',class_='title')[0].text == bookname:#将查询到的与用户输入的书籍名称一样的书籍信息存入到字典中
result['书名'] = bookname
result['作者'] = bookauthor
result['评分'] = bookscore
return result
bookname = input('请输入所要查找的书籍名称/作者:')
booklist2 = []
def search_book(bookname):
session = requests.Session()
i = 1
for num in range(0,50,10):#默认搜查到书籍前五页的书籍信息评分
print('第',i,'页的书籍信息')
newurl = url.format(str(bookname),num)
bookhtml = session.get(newurl,headers = headers)
soup = BeautifulSoup(bookhtml.text,'lxml')
for soup2 in soup.find_all('li',class_='item store-item'):
#print(soup2)
booklist2.append(get_message(soup2,bookname))
time.sleep(2)
i = i+1
search_book(bookname)
df2 = pandas.DataFrame(booklist2,columns = ['书名','作者','评分'])
df2.dropna(how='any')#去掉含有缺失值的行


欢迎大佬测试修改
全部评论 (0)
还没有任何评论哟~
