【爬虫】爬取B站小黑屋

阅读量：

爬取B站小黑屋信息

由于b站更新了反爬虫策略，现在爬取B站可以采用模拟浏览器操作进行爬取。需要安装以下python模块:

复制代码

    pip3 install selenium 
    pip3 install bs4

使用selenium模拟浏览器操作，对小黑屋进行模拟下拉操作，可以设置下拉次数（这里要注意每次下拉后要sleep一段时间，否则网页会加载不完）。等获取到足够的页面后在进行数据清洗。

复制代码

    from selenium import webdriver
    from bs4 import BeautifulSoup  
    import time
    import json
    import re
    
    
    class BSpider():
    
    def __init__(self):
        # 设置无界面模式
        options = webdriver.FirefoxOptions()
        options.add_argument('--headless')
        self.browser = webdriver.Firefox(options = options)
        self.blackroom_page = 'https://www.bilibili.com/blackroom/ban'
        self.count = 0
    
    # 获取页面
    def get_page(self):
        
        self.browser.get(self.blackroom_page)
        # 只获取弹幕内容
        self.browser.find_element_by_xpath('//*[@id="app"]/div/div/div/div[2]/div[1]/div[2]/div[1]/i').click()
        time.sleep(0.5)
        self.browser.find_element_by_xpath('//*[@id="app"]/div/div/div/div[2]/div[1]/div[2]/div[2]/p[3]').click()
        time.sleep(0.5)
        # 下拉页面, 下拉300次
        index, max_count = 0, 300
        while index < max_count:
            print("scroll down: %d ..." % (index))
            self.browser.execute_script(
                'window.scrollTo(0,document.body.scrollHeight)'
            )
            time.sleep(0.8)
            index = index + 1
    
    # 字符串找中文字符
    def find_chinese(self, article):
        pattern = re.compile(r'[^\u4e00-\u9fa5]')
        chinese = re.sub(pattern, '', article)
        return chinese
    	# 删除星号*
    def delete_star(self, article):
        pattern = re.compile(r'[*]')
        no_star = re.sub(pattern, '', article)
        return no_star
    # 解析页面，对数据进行清洗 在这里只获取账号封禁时间（永久/15天/7天......）和发的弹幕
    def paser_page(self):
        html = BeautifulSoup(self.browser.page_source)
        
        output_data = []
        for dl in html.find_all('dl'):
            sub_output_data = {}
            black_cube = dl.parent
            try:
                temp_type = (black_cube.find(class_='jc').get_text())
                first_p_text = self.delete_star(dl.dt.p.text)
                # first_p_text = dl.dt.p.text
            except Exception as e:
                print(e)
            
            # sub_output_data["reason"] = temp_reson
            sub_output_data["type"] = temp_type
            sub_output_data['article'] = first_p_text
    
            if first_p_text != '':
                output_data.append(sub_output_data)
        # print(output_data)
        # 存储数据
        print('dump to json file ...')
        with open(r'2020\ML\ML_action\3.NaiveBayes\data\blackroom.json', 'w', encoding='utf-8') as f:
            json.dump(output_data, f, ensure_ascii=False,sort_keys=False, indent=4)
        print('dump  file done.')
    
    b = BSpider()
    print("init....")
    b.get_page()
    b.paser_page()

全部评论 (0)

还没有任何评论哟~

【爬虫】爬取B站小黑屋

爬取B站小黑屋信息由于b站更新了反爬虫策略，现在爬取B站可以采用模拟浏览器操作进行爬取。需要安装以下python模块: pip3installselenium pip3installbs4 使用se...

Python爬虫一爬取B站小视频源码

importrequests importtime ua=UserAgent defdownloaderurl,path: start=time.time开始时间 size=0 headers= ‘U...

Python爬虫一爬取B站小视频源码

如果要爬取多页的话在最下方循环中填写好循环的次数就可以了项目源码 fromfakeuseragentimportUserAgent importrequests importtime ua=User...

爬虫爬取B站视频封面

使用将代码中的url换成视频的链接运行代码就能把封面地址打印出来，浏览器打开就能查看代码 @Time:2020/10/2817:59 @Author:zym @File:bilibili.py ...

python爬虫爬取b站视频_Python爬虫教程：爬取下载b站视频【附源码】

爬取下载b站视频【附源码】，话不多说，说干就干。下载仓库 git@github.com:inspurer/PythonSpider.git 或者直接下载:https://github.com/ins...

python爬虫爬取b站视频_Python爬虫练习：爬取b站热门视频并导入Excel

本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。以下文章来源于python教程，作者小雨这篇文章主要介绍了Python如何爬取...

python爬虫_爬取B站视频标题

着手写爬虫前，需要了解的几个概念： URL 全称UniformResourceLocator（统一资源定位器），格式为：协议+主机+端口+路径。比如：<https://www.bilibili.co...

Python爬虫：爬取B站视频信息

一、爬取任务： 1.选择bilibili网站作为数据源，使用python语言编写程序，通过网络爬虫技术抓取5000个视频的相关信息。 2.获取字段：UP主信息（用户名、用户ID）、视频标题、视频上传时...

python爬虫爬取网站小说

加载模块 importrequests frombs4importBeautifulSoup 定义所有章节和链接函数 defgetnovelchapters: url=https://www.89wx...

macOS 版爬虫学习--爬取B站弹幕

给自己定的小目标是每周学习一个小项目，转眼间这周马上就要结束啦，还好抓住了尾巴，嘻嘻！这周的小项目是==“爬取B站弹幕”== 作为一名“科班出身”的计算机人，却又完全小白的我，之前就听说过“爬虫”，...

是否确定退出登录?

【爬虫】爬取B站小黑屋

爬取B站小黑屋信息

全部评论 (0)

相关文章推荐

【爬虫】爬取B站小黑屋

Python爬虫一爬取B站小视频源码

Python爬虫一爬取B站小视频源码

爬虫爬取B站视频封面

python爬虫爬取b站视频_Python爬虫教程：爬取下载b站视频【附源码】

python爬虫爬取b站视频_Python爬虫练习：爬取b站热门视频并导入Excel

python爬虫_爬取B站视频标题

Python爬虫：爬取B站视频信息

python爬虫爬取网站小说

macOS 版 爬虫学习--爬取B站弹幕

macOS 版爬虫学习--爬取B站弹幕