scrapy爬取某网站景区评论爬虫

阅读量：

step1.研究网页结构，每个景点有一个景区的超“链接” https://piao.ctrip.com/ticket/dest/t2286.html

step2.链接到景区后，评论，在scrapy shell中不显示。推测应该是ajax等的发起的请求。

找到的地址是：https://sec-m.ctrip.com/restapi/soa2/12530/json/viewCommentList
请求体中包含，景区的viewid，就是景区链接里2286，其他就是一些分页等的内容，可以自己设定。

step3.计划这个爬虫分2步

爬取景点的code
根据code爬取景区的评论

step4.源码放到git上了：https://github.com/wenwen0220/xiechengDemo

主要代码如下：

爬取code：

复制代码

 import scrapy

    
 from xiechengDemo.items import SceneryCodeItem
    
 import random
    
 import re
    
 #爬取景区的code
    
 class SceneryCodeSpider(scrapy.Spider):
    
 	name = "sceneryCode"
    
 	#要爬取的url集合
    
 	# start_urls = ['https://you.ctrip.com/sightlist/shandong100039/s0-p2.html']
    
 	#可以直接读取文件
    
 	start_urls=[i.strip() for i in open('/Users/jw/python/xiechengDemo/urls.txt').readlines()]
    
  
    
 	def parse(slf,response):
    
 		# print(response)
    
 		#用xpath获取需要的内容
    
 		sceneryName_list=response.xpath('.//*[@class="list_mod2"]/div[2]/dl/dt/a/text()').extract()
    
 		#获取景区的url连接地址
    
 		sceneryUrl_list=response.xpath('.//*[@class="list_mod2"]/div[2]/dl/dt/a/@href').extract()
    
 		# print(sceneryName_list)
    
 		list=[]
    
  
    
 		for i,j in zip(sceneryName_list,sceneryUrl_list):
    
 			#将url切分，获取景区code与城市名称
    
 			uri=j.split("/")
    
 			sceneryItem=SceneryCodeItem()
    
 			# item['_id']=str(random.randint(1,1000))
    
 			sceneryItem['provinceName']= "shandong"
    
 			#获取所有非数字的，正则表达式（qingdao）
    
 			sceneryItem['cityName']= re.findall("\D+",uri[2])[0]
    
 			sceneryItem['sceneryName']=i
    
 			#获取所有数字的，正则表达式（1234）
    
 			sceneryItem['sceneryCode']=re.findall("\d+",uri[3])[0]
    
 			print(sceneryItem)
    
 			yield sceneryItem
    
 		# 	list.append(sceneryItem)
    
 		# return list

爬取评论

复制代码

 import scrapy

    
 from xiechengDemo.items import SceneryCommentsItem
    
 import random
    
 import json
    
 import re
    
 import datetime
    
 from datetime import date
    
  
    
 #根据景区的id爬取景区的评论
    
 class SceneryCommentSpider(scrapy.Spider):
    
 	name = "sceneryComment"
    
  
    
 	def start_requests(self):
    
  
    
 		postUrl="https://sec-m.ctrip.com/restapi/soa2/12530/json/viewCommentList"
    
 		for data in self.getBody():
    
 			#FormRequest方法的content-type默认是“application/x-www-form-urlencoded”，请求会返回空，用下边的方法替换。
    
 			# yield scrapy.FormRequest(url=postUrl,formdata=data,callback=self.parse) 
    
 			yield scrapy.Request(
    
 				postUrl, 
    
 				body=json.dumps(data[0]), 
    
 				method='POST', 
    
 				headers={'Content-Type': 'application/json'},
    
 				callback=lambda response,sceneryCode=data[1],sceneryName=data[2]: self.parse(response,sceneryCode,sceneryName))
    
  
    
 	def parse(slf,response,sceneryCode,sceneryName):
    
 		# print(response.text)
    
  
    
 		# date=time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
    
 		#获取今天的时间
    
 		# today = date.today()
    
 		beginDate=date(2019,1,1)
    
  
    
  
    
 		jsonArray=json.loads(response.body)['data']['comments']
    
 		for i in jsonArray:
    
 			#评论日期
    
 			# commentDate=datetime.datetime.strptime(i['date'],'%Y-%m-%d')
    
 			#获取年-月-日，格式是str
    
 			commentDateStr=datetime.datetime.strptime(i['date'], '%Y-%m-%d %H:%M').strftime('%Y-%m-%d')
    
 			#str转换成datetime
    
 			b=datetime.datetime.strptime(commentDateStr,'%Y-%m-%d')
    
 			#datetime转换成date
    
 			commentDate=datetime.datetime.date(b)
    
 			# print("------is",commentDate)
    
 			#不是2019年的就跳出
    
 			if commentDate<beginDate :
    
 				continue
    
  
    
 			sceneryCommentsItem=SceneryCommentsItem()
    
 			sceneryCommentsItem['id']=i['id']
    
 			sceneryCommentsItem['uid']=i['uid']
    
 			sceneryCommentsItem['title']=i['title']
    
 			sceneryCommentsItem['content']=i['content']
    
 			sceneryCommentsItem['date']=i['date']
    
 			sceneryCommentsItem['score']=i['score']
    
 			sceneryCommentsItem['sceneryCode']=sceneryCode
    
 			sceneryCommentsItem['sceneryName']=sceneryName
    
 			yield sceneryCommentsItem
    
  
    
 	#获取body的方法
    
 	def getBody(self):
    
 		# f=open("/Users/didi/jw/python/xiechengDemo/sceneryCode.json")
    
 		# res=f.read
    
 		# jsonArray=json.load(res)
    
 		#读取json文件
    
 		listData=[]
    
 		with open('/Users/jw/python/xiechengDemo/sceneryCode.json','r') as f:
    
 			#直接用load方法
    
 			jsonArray=json.load(f)
    
 		for i in jsonArray:
    
 			# print(i['sceneryCode'])
    
 			#请求的内容根据自己要爬取的页面数，与页面size自定义
    
 			data={
    
 				"pageid": "10650000804",
    
 			    "viewid": i['sceneryCode'],
    
 			    "tagid": "0",
    
 			    "pagenum": "1",
    
 			    "pagesize": "50",
    
 			    "contentType": "json",
    
 			    "head": {
    
 			        "appid": "100013776",
    
 			        "cid": "09031037211035410190",
    
 			        "ctok": "",
    
 			        "cver": "1.0",
    
 			        "lang": "01",
    
 			        "sid": "8888",
    
 			        "syscode": "09",
    
 			        "auth": "",
    
 			        "extension": [
    
 			            {
    
 			                "name": "protocal",
    
 			                "value": "https"
    
 			            }
    
 			        ]
    
 			    },
    
 			    "ver": "7.10.3.0319180000"
    
 			}
    
 			list=[]
    
 			list.append(data)
    
 			list.append(i['sceneryCode'])
    
 			list.append(i['sceneryName'])
    
 			listData.append(list)
    
 		return listData

最后的结果，写到了mysql如下：

全部评论 (0)

还没有任何评论哟~

scrapy爬取某网站景区评论爬虫

step1.研究网页结构，每个景点有一个景区的超“链接”<https://piao.ctrip.com/ticket/dest/t2286.html step2.链接到景区后，评论，在scrapysh...

爬虫 - Scrapy 爬取某招聘网站

文章目录项目简介一、创建项目 1、终端创建项目 2、修改配置二、爬取列表数据 1、数据分析 2、模型建立 3、存储为json数据 4、存储为mysql数据三、爬取列表下一页及所有数据 1、特征...

爬虫框架之Scrapy -- 爬取某招聘网站（一）

案例1：爬取内容存储为一个文件 1.Scrapy框架的组织架构、运行原理介绍组件名称组件作用 ScrapyEngine:Scrapy引擎Scrapy引擎是整个框架的核心，其作用是控制调试器、下载器、...

Python爬虫爬取某网站

importjson importrandom importtime frombs4importBeautifulSoup importrequests importre importpymysql ...

Python 爬虫（六）：Scrapy 爬取景区信息

Scrapy是一个使用Python语言开发，为了爬取网站数据，提取结构性数据而编写的应用框架，它用途广泛，比如：数据挖掘、监测和自动化测试。安装使用终端命令pipinstallScrapy即可。

Scrapy爬虫之网站图片爬取

第1关：爬取网站实训图片的链接本关任务：使用Scrapy爬取给定网站的图片链接，并保存到本地。 <!DOCTYPEhtml <htmllang=en <head <metacharset=UTF8 ...

Scrapy爬虫之网站图片爬取

第1关：爬取网站实训图片的链接任务描述本关任务：使用Scrapy爬取给定网站的图片链接，并保存到本地。编程要求首先，通过审查元素，观察图片链接的代码规律；然后，点击代码文件旁边的三角符号，选择...

Scrapy爬虫之网站图片爬取

第2关：爬取网站实训图片并下载任务描述本关任务：上一关爬取的是图片链接，本关需要更进一步，将图片下载下来并保存到根目录下的images文件夹中（不存在需新建），并且根据提取的信息对图片进行命名。

Scrapy爬取拉钩网的爬虫（爬取整站CrawlSpider）

经过我的测试，拉钩网是一个不能直接进行爬取的网站，由于我的上一个网站是扒的接口，所以这次我使用的是scrapy的整站爬取，贴上当时的代码（代码是我买的视频里面的，但是当时是不需要登陆就可以爬取的）： ...

python爬虫爬取某网站图片

学习分享今天刚学完爬虫，就随便写了一个爬虫代码爬取某网站的图片网站就是这个图片网站，我选的是1080p格式，4k的要会员，我反正是还不会导入的包如下 importrequests frombs4i...

是否确定退出登录?

scrapy爬取某网站景区评论爬虫

全部评论 (0)

相关文章推荐

scrapy爬取某网站景区评论爬虫

爬虫 - Scrapy 爬取某招聘网站

爬虫框架之Scrapy -- 爬取某招聘网站（一）

Python爬虫爬取某网站

Python 爬虫（六）：Scrapy 爬取景区信息

Scrapy爬虫之网站图片爬取

Scrapy爬虫之网站图片爬取

Scrapy爬虫之网站图片爬取

Scrapy爬取拉钩网的爬虫（爬取整站CrawlSpider）

python爬虫爬取某网站图片