scrapy+Xpath实现爬取当当网商品信息

阅读量：

实现目标及效果：

可以通过scrapy+Xpath表达式实现爬取当当网商品的标题、评论和商品链接，并能实现自动分页爬取多页商品信息（比如40页的信息），并将爬取到的信息全部写入数据库当中。
由于会用到XPath表达式，在正式开始之前先做简要的用法说明，方便理解。

【补充】XPath与正则表达式的区别：

1.XPath表达式效率会高一点

2.正则表达式功能会强大一些

3.一般来说优先选择XPath，但是XPath解决不了的问题就选择正则表达式来解决。

XPath表达式用法的简单补充

复制代码

    XPath表达式的基础补充
    
    1./ 代表逐层提取
    2.text()提取标签下面的文本
    E.g.如果提取网页的标题文本：
    	/html/head/titile/text()
    3.//标签名 提取所有的名为**的标签
    E.g.提取所有div标签
    	//div
    4.//标签名[@属性=‘属性值’] 提取属性为***的标签
    5.@属性名 代表取某个属性
    E.g.提取div表情中<div class="tools">标签的内容：
    	//div[@class='tools']

具体实现步骤

1.创建项目并进入项目目录

复制代码

    scrapy startproject dangdang
    #成功执行上述命令后,进入dangdang这个项目目录
    cd dangdang

2.创建爬虫文件

复制代码

    scrapy genspider -t basic dd dangdang.com

3.用pycharm打开dangdang这个项目文件夹，然后找到items.py文件，进行编写。

复制代码

    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()#存储商品名
    link=scrapy.Field()#存储商品链接
    comment=scrapy.Field()#存储评论数

在settings.py文件中需要去掉几行注释

复制代码

    # Configure item pipelines
    # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
    'dangdang.pipelines.DangdangPipeline': 300,
    }

4.编写爬虫文件[在spiders文件下我们刚创建的dd文件就是]

复制代码

    # -*- coding: utf-8 -*-
    import scrapy
    from dangdang.items import DangdangItem
    
    
    class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://category.dangdang.com/pg1-cid4008154.html']
    	#拿到一个网址之后，首先要分析他的特点。可以将某几页的网址复制到粘贴板上然后比较那几个字段值发生了变化，改变哪个字段的值可以实现翻页。在我找的这个网志中，如果修改pg1变为pg10就可以实现从第一页到第十页的效果（其他部分不动）。
    def parse(self, response):
        item=DangdangItem()
        item["title"]=response.xpath("//a[@name='itemlist-picture']/@title").extract()
        item["link"]=response.xpath("//a[@name='itemlist-picture']/@href").extract()
        item["comment"]=response.xpath("//a[@name='itemlist-review']/text()").extract()
        #@name=''主要是用来定位title/link/comment所在标签的位置，具体情况具体分析，所以不同的网页可能会通过不同的name属性来定位a标签。
        print(item["title"])#方便第五步测试用的，后续可注释掉

5.在cmd命令行中输入scrapy crawl dd --nolog,可以测试查看是否爬取成功，爬取成功会打印相应的title信息。

6.编辑pipelines.py

复制代码

    class DangdangPipeline(object):
    def process_item(self, item, spider):
        for i in range(0,len(item["title"])):
            title=item["title"][i]
            link=item["link"][i]
            comment=item["comment"][i]
            print(title+":"+link+":"+comment)#测试用的，后续可以删除掉。
        return item

7.测试第六步是否成功。先把第四步的最后一句给注释掉，然后在cmd中重新运行scrapy crawl dd --nolog，可以看到标题链接还有评论数都显示出来啦~

8.安装Python操纵Mysql数据库的模块

复制代码

    pip install pymysql

9.创建本次存储爬取信息的数据库和相应的表.创建一个名为spider的数据库，然后在该数据库中建立一个goods的表，字段及其约束见下。

复制代码

    CREATE DATABASE spider;
    CREATE TABLE goods (
    id INT ( 32 ) auto_increment PRIMARY KEY,
    title VARCHAR ( 100 ),
    link VARCHAR ( 100 ) UNIQUE,
    COMMENT VARCHAR ( 100 ) 
    );

10.在pipelines.py中添加操作数据库的相关代码，注意修改自己本地的mysql密码和所连接的数据库名字。

复制代码

    # -*- coding: utf-8 -*-
    import pymysql
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class DangdangPipeline(object):
    def process_item(self, item, spider):
        conn=pymysql.connect(host="127.0.0.1",user="root",password="root",db="spider")
        cursor=conn.cursor()#通过连接获取游标
        for i in range(0,len(item["title"])):
            title=item["title"][i]
            link=item["link"][i]
            comment=item["comment"][i]
            sql="insert into goods(title,link,comment) values ('"+title+"','"+link+"','"+comment+"')"
            cursor.execute(sql)#使用游标执行SQL语句
            conn.commit()#提交数据
        cursor.close()
        conn.close()
        return item

11.翻页自动爬取。在dd.py中作如下修改。

复制代码

    import scrapy
    from dangdang.items import DangdangItem
    from scrapy.http import Request
    
    class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://category.dangdang.com/pg1-cid4008154.html']
    
    def parse(self, response):
        item=DangdangItem()
        item["title"]=response.xpath("//a[@name='itemlist-picture']/@title").extract()
        item["link"]=response.xpath("//a[@name='itemlist-picture']/@href").extract()
        item["comment"]=response.xpath("//a[@name='itemlist-review']/text()").extract()
        yield item
        #翻页，一直到80页
        for i in range(2,81):
            url='http://category.dangdang.com/pg+'+str(i)+'-cid4008154.html'
            yield Request(url,callback=self.parse)

12.异常处理。继续回到pipelines.py中添加异常处理部分的代码。

复制代码

     try:
           cursor.execute(sql)
           conn.commit()
      except Exception as err:
            print(err)

13.在命令行中键入scrapy crawl dd -nolog等待执行成功后查看数据库即可看到当当网前80页的信息啦！
实现效果图

完整的代码：

dd.py

复制代码

    # -*- coding: utf-8 -*-

    import scrapy
    from dangdang.items import DangdangItem
    from scrapy.http import Request
    
    class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://category.dangdang.com/pg1-cid4008154.html']
    
    def parse(self, response):
        item=DangdangItem()
        item["title"]=response.xpath("//a[@name='itemlist-picture']/@title").extract()
        item["link"]=response.xpath("//a[@name='itemlist-picture']/@href").extract()
        item["comment"]=response.xpath("//a[@name='itemlist-review']/text()").extract()
        yield item
        #翻页，一直到80页
        for i in range(2,81):
            url='http://category.dangdang.com/pg+'+str(i)+'-cid4008154.html'
            yield Request(url,callback=self.parse)

items.py

复制代码

    # -*- coding: utf-8 -*-

    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://docs.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()#存储商品名
    link=scrapy.Field()#存储商品链接
    comment=scrapy.Field()#存储评论数

pipelines.py

复制代码

    # -*- coding: utf-8 -*-

      import pymysql
      # Define your item pipelines here
      #
      # Don't forget to add your pipeline to the ITEM_PIPELINES setting
      # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
    
    
      class DangdangPipeline(object):
      def process_item(self, item, spider):
          conn=pymysql.connect(host="127.0.0.1",user="root",password="Question0901-",db="spider")
          cursor=conn.cursor()
          for i in range(0,len(item["title"])):
              title=item["title"][i]
              link=item["link"][i]
              comment=item["comment"][i]
              sql="insert into goods(title,link,comment) values ('"+title+"','"+link+"','"+comment+"')"
              try:
                  cursor.execute(sql)
                  conn.commit()
              except Exception as err:
                  print(err)
          cursor.close()
          conn.close()
          return item

全部评论 (0)

还没有任何评论哟~

scrapy+Xpath实现爬取当当网商品信息

实现目标及效果：可以通过scrapy+Xpath表达式实现爬取当当网商品的标题、评论和商品链接，并能实现自动分页爬取多页商品信息（比如40页的信息），并将爬取到的信息全部写入数据库当中。由于会用...

scrapy爬取当当网商品信息

目标：利用scrapy框架爬取多页当当网商品标题，链接和评论数信息并保存在本地数据库中首先创建爬虫项目和爬虫模板文件爬取商品标题商品链接商品评论创建容器容纳他们查看网页源代码，找到规律通...

Scrapy入门、当当网商品爬取实战

文章目录一.如何创建Scrapy爬虫项目二.Scrapy的一些指令说明三.当当网商品爬取实战一.如何创建Scrapy爬虫项目（1）Win+R打开cmd，假如我要在F盘的Scrapy文件中创建...

python爬虫之--爬取当当网商品信息

python爬虫之爬取当当网图商品信息利用：requests，re 爬取目标：目标：中国文学书籍商品标题商品链接商品价格商品评论第一步：打开网址，查看网址变化规律，构造网址第一页：h...

Scrapy实践-爬取当当网书籍信息

PythonScrapy库爬虫——爬取当当网书籍实现爬虫获得豆瓣书籍信息存入数据库中，学习记录根据分类获取书籍信息，包括书籍名字、作者、出版社、出版日期、价格等信息根据书籍类别存入数据库完整爬...

用selenium爬当当网商品信息

【项目介绍】参考崔庆才老师《Python3网络爬虫开发实战》第七章动态渲染页面爬取里爬淘宝网的实例，由于现在淘宝网查找需要先登录，故用当当网进行尝试。 1.动态加载页面的判断？ F12→找到对应ur...

爬虫项目实战十一：爬取当当网商品信息

爬取当当网商品信息目标项目准备网站分析页码分析反爬分析代码实现效果显示目标批量爬取当当网商品信息，保存为csv文件到本地。项目准备软件：Pycharm 第三方库：requests...

requests库爬取当当商品信息I

requests库爬取当当商品信息（requests，lxml）简单记录一下实习学习爬虫的内容，本次学习包括requests库爬取当当，豆瓣等网站的信息，使用jieba对爬取到到的评论的中文进行处理...

爬虫实战：使用Scrapy框架爬取当当网商品信息。（信息存入本地数据库）

一.Xpath表达式基础 1.XPath与正则表达式简单对比。（1）XPath表达式效率高一些。（2）正则表达式功能强大一点。 ...（3）一般来说，优先选择XPath，但是XPath解决不...

Python爬虫深入爬取当当网商品基本信息

Python爬虫深入爬取当当网商品基本信息使用scrapy爬虫框架，创建爬虫项目。基本命令： scrapystartprojectdangdang scrapygenspiderl scrapy...

是否确定退出登录?

scrapy+Xpath实现爬取当当网商品信息

实现目标及效果：

【补充】XPath与正则表达式的区别：

XPath表达式用法的简单补充

具体实现步骤

全部评论 (0)

相关文章推荐

scrapy+Xpath实现爬取当当网商品信息

scrapy爬取当当网商品信息

Scrapy入门、当当网商品爬取实战

python爬虫之--爬取当当网商品信息

Scrapy实践-爬取当当网书籍信息

用selenium爬当当网商品信息

爬虫项目实战十一：爬取当当网商品信息

requests库爬取当当商品信息I

爬虫实战：使用Scrapy框架爬取当当网商品信息。（信息存入本地数据库）

Python爬虫深入 爬取当当网商品基本信息

Python爬虫深入爬取当当网商品基本信息