Advertisement

【Python爬虫第二弹】基于爬虫爬取豆瓣书籍的书籍信息查询

阅读量:

爬虫学了有半个月的时间了,其实这半个月真正学到的东西也不过就是requsets和beautifulsoup的用法,惭愧,收获不太大,还没有接触scrapy框架,但是光这个beautifulsoup可以完成的事情已经很多了,然后简单的使用了pandas可以将爬取到的数据整理一下,还没到可以分析的地步
由于先前无知,没想到爬取速度过快会导致被封ip,所以在某一天爬豆瓣的时候什么信息都爬不出来了,然后就百度了一下得添加请求头,然后加上等待时间(讲道理,这个等待时间时及其不愿意搞得,太慢的,也不会分布式,搞得效率特别低,但是为了防止被封ip,还是加上了),然后就是添加请求头,不知道我的请求头添加的有用没,看网上还有个cookies,到目前为止还没搞清楚cookies怎么玩,学长说的加ip池也不知道怎么加,所以,至今仍是小白一个,然后简单的写了一个基于爬虫查询书籍信息,还不够完美,先扔出来吧

复制代码
    #根据用户输入书名查询该书在豆瓣的评分及相关书籍信息
    import pandas      
    import requests
    import time
    import pandas
    from bs4 import BeautifulSoup
    url = 'https://read.douban.com/search?q={}&start={}'
    url2 = 'https://read.douban.com'
    #设置请求头
    headers = {}
    headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'
    headers['Host'] = 'read.douban.com'
    headers['Connection'] = 'keep-alive'
    headers['Accept-Encoding'] = 'gzip, deflate, br'
    headers['Accept-Language'] = 'zh-CN,zh;q=0.8'
    
    def get_bookcon(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'html.parser')
    #获取该书导言
    print('本书导言')
    for soup3 in soup.find_all('div',class_='article-profile-intro article-abstract collapse-context'):
        for soup4 in soup3.find_all('p'):
            print(soup4.text)
    print('\n\n')
    #获取该书热门划线
    print('本书热门划线')
    for soup3 in soup.find_all('div',class_='popular-annotations'):
        for soup4 in soup3.find_all('li'):
            print('”',soup4.text,'“','\n')
    print('\n')    
    
    def get_message(soup2,bookname):
    result = {}
    #将含有书籍评分的书籍信息打印出来
    if len(soup2.find_all('div',class_='title')) > 0 and len(soup2.find_all('a',class_='author-item')) > 0:
        bookname = soup2.find_all('div',class_='title')[0].text
    
        booklink = url2 + soup2.find_all('div',class_='title')[0].find_all('a')[0]['href']#获取该书详情链接
    
        bookauthor = soup2.find_all('a',class_='author-item')[0].text
        if len(soup2.find_all('span',class_='rating-average')) > 0:
            bookscore = float(soup2.find_all('span',class_='rating-average')[0].text)
        else:
            print('该书籍暂无评分!系统默认评分为0')
            bookscore = 0
        #result['书名'] = bookname
        #result['作者'] = bookauthor
        #result['评分'] = bookscore
        print('书名','《',bookname,'》','\t作者:',bookauthor,'\t评分:',bookscore,'\n\n')
        print(booklink)
        get_bookcon(booklink)
        if soup2.find_all('div',class_='title')[0].text == bookname:#将查询到的与用户输入的书籍名称一样的书籍信息存入到字典中
            result['书名'] = bookname
            result['作者'] = bookauthor
            result['评分'] = bookscore
    return result
    
    
    bookname = input('请输入所要查找的书籍名称/作者:')
    booklist2 = []
    def search_book(bookname):    
    session = requests.Session()
    i = 1
    for num in range(0,50,10):#默认搜查到书籍前五页的书籍信息评分
            print('第',i,'页的书籍信息')
            newurl = url.format(str(bookname),num)
            bookhtml = session.get(newurl,headers = headers)
            soup = BeautifulSoup(bookhtml.text,'lxml')
            for soup2 in soup.find_all('li',class_='item store-item'):
                #print(soup2)
                booklist2.append(get_message(soup2,bookname))
                time.sleep(2)
            i = i+1
    
    
    
    search_book(bookname)
    df2 = pandas.DataFrame(booklist2,columns = ['书名','作者','评分'])
    df2.dropna(how='any')#去掉含有缺失值的行
    
    
    
    
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    

查询结果
查询到的书籍信息以表格形式输出
欢迎大佬测试修改

全部评论 (0)

还没有任何评论哟~