Scrapy爬取网易新闻

阅读量：

创建一个scrapy项目

复制代码

    #在cmd中 依次输入 
    #scrapy startproject news
    #cd news
    #scrapy genspider -t crawl news163 news.163.com

在items.py文件里输入要爬取的内容

复制代码

    import scrapy
    
    class NewsItem(scrapy.Item):
    news_thread = scrapy.Field()
    news_title = scrapy.Field()
    news_time = scrapy.Field()
    news_source = scrapy.Field()
    source_url = scrapy.Field()
    news_text = scrapy.Field()
    news_url = scrapy.Field()

3.分析页面源代码并编写news163.py 文件

复制代码

    #导入需要的第三方库
    import scrapy
    from news.items import NewsItem
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider,Rule
    
    #编写正则表达式
    #https://news.163.com/20/0427/20/FB8E63MK00018AOR.html
    #https://news.163.com/20/0428/07/FB9K5VRS0001899O.html
    #对比两段url,确定正则表达式
    https://news.163.com/20/0428/\d+/.*?html
    
    #编写News163Spider(CrawlSpider)
    #部分内容自动生成，只需要根据要爬取的网站修改
    class News163Spider(CrawlSpider):
    name = 'news163'
    allowed_domains = ['news.163.com']
    start_urls = ['http://news.163.com/']
    
    rules = (
        Rule(LinkExtractor(allow=r'https://news.163.com/20/0428/\d+/.*?html'), callback='parse_item', follow=True),
    )
    
    #具象化item
      def parse_item(self, response):
        item = NewsItem()
        item['news_thread']=response.url.strip().split('/')[-1][:-5]
        return item
    #l例子：获取标题
    	self.get_title(response,item)
    
    def get_title(self,response,item):
        title=response.css('title::text').extract()#根据网站编写css提取策略
        if title:
            print('title:{}'.format(title[0]))
            item['news_title']=title[0]

复制代码

     #获取其他内容也是类似的方法，根据实际情况进行调整，整段代码如下：
     def parse_item(self, response):
        item = NewsItem()
        item['news_thread']=response.url.strip().split('/')[-1][:-5]
        self.get_title(response,item)
        self.get_time(response, item)
        self.get_source(response, item)
        self.get_source_url(response, item)
        self.get_text(response, item)
        self.get_url(response, item)
        return item
    
    def get_url(self, response, item):
        url= response.url
        if url:
            item['news_url'] = url
    
    def get_text(self, response, item):
        text = response.css('.post_text p::text').extract()
        if text:
            print('text:{}'.format(text))
            item['news_text'] = text
    
    def get_source_url(self, response, item):
        source_url = response.css('#ne_article_source::attr(href)').extract()
        if source_url:
            print('source_url:{}'.format(source_url[0]))
            item['source_url'] = source_url[0]
    
    def get_source(self, response, item):
        source = response.css('#ne_article_source::text').extract()
        if source:
            print('source:{}'.format(source[0]))
            item['news_source'] = source[0]
    
    def get_title(self,response,item):
        title=response.css('title::text').extract()
        if title:
            print('title:{}'.format(title[0]))
            item['news_title']=title[0]
    
    def get_time(self,response,item):
        time=response.css('div.post_time_source::text').extract()
        if time:
            print('time:{}'.format(time[0].strip().replace('来源','').replace('\u3000','')))
            item['news_time'] = time[0].strip().replace('来源','').replace('\u3000','')

4.编写pipelines.py

复制代码

    from scrapy.exporters import CsvItemExporter
    
    class NewsPipeline(object):
    def __init__(self):
        self.file=open('news_data.csv','wb')
        self.exporter=CsvItemExporter(self.file,encoding="UTF-8")
        self.exporter.start_exporting()
    
    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.file.close()
    
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

5.设置setting.py文件

6.爬取数据

复制代码

    在cmd里输入 scrapy crawl news163

左侧出现一个news_data的csv文件,即为爬取的数据

全部评论 (0)

还没有任何评论哟~

Scrapy爬取网易新闻

1.创建一个scrapy项目在cmd中依次输入 scrapystartprojectnews cdnews scrapygenspidertcrawlnews163news.163.com 2.在i...

Scrapy爬取网易新闻

classFiveSpiderscrapy.Spider: name=“five” starturls=[“https://news.163.com/”] 储存导航栏模块的url modelurls=...

scrapy网易新闻数据爬取

前言需求：爬取网易新闻中的五大板块的新闻数据（标题和内容） 1.通过网易新闻的首页解析出五大板块对应的详情页的url（没有动态加载） 2.每一个板块对应的新闻标题都是动态加载出来的（动态加载） 3....

利用Scrapy爬取网易新闻

利用Scrapy爬取网易新闻本次利用Scrapy爬取网易新闻当天的新闻标题，内容，来源等信息并存储到csv文件中，具体操作如下。爬取 1.在items.py中提前设置好相关的爬取内容函数： imp...

python爬网易新闻_python爬虫——基于scrapy框架爬取网易新闻内容

python爬虫——基于scrapy框架爬取网易新闻内容1、需求【前期准备】 2、分析及代码实现1获取五大板块详情页url2解析每个板块3解析每个模块里的标题中详情页信息 1、需求爬取网易新闻的标题...

使用scrapy简单爬取网易新闻

已经安装scrapy的跳过 1.scrapy的安装和项目的创建安装scrapy pipinstallscrapy 项目的创建在Termianl中输入scrapystartproject项目名字回车...

Scrapy爬虫框架抓取网易新闻

@scrapy 环境安装 Windows scrapy的安装需要5个依赖库，先安装好这5个依赖库，然后在dos命令中利用pipinstall安装scrapy框架即可，首先要确保python的目录是添加...

文本分类（二）：scrapy爬取网易新闻

文本分类的第一项应该就是获取文本了吧。在木有弄懂scrapy的情况下写的，纯应用，或许后续会补上scrapy的原理。首先说一下我的环境：ubuntu14.10 scrapy安装指南（肯定官网的最权...

[Python爬虫]Scrapy框架爬取网易国内新闻

启动文件main.py fromscrapy.cmdlineimportexecute execute'scrapycrawlwangyi'.split 执行spider文件下的爬取文件 coding...

使用Scrapy进行网易新闻的简单爬取

爬虫小白的第一个实验。。。网页分析网易新闻的主页长这个样子：点几个栏目进去看看，发现有的栏目的网页布局是类似的，比如国内、国际、军事、航空等，这样一来可以统一对他们进行爬取。（其实首页下面一点里...

是否确定退出登录?

Scrapy爬取网易新闻

全部评论 (0)

相关文章推荐

Scrapy爬取网易新闻

Scrapy爬取网易新闻

scrapy网易新闻数据爬取

利用Scrapy爬取网易新闻

python爬网易新闻_python爬虫——基于scrapy框架爬取网易新闻内容

使用scrapy简单爬取网易新闻

Scrapy爬虫框架抓取网易新闻

文本分类（二）：scrapy爬取网易新闻

[Python爬虫]Scrapy框架爬取网易国内新闻

使用Scrapy进行网易新闻的简单爬取