scrapy爬虫爬取新片场信息

阅读量：

复制代码

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    from scrapy import Request
    import json
    
    def convert(s):
    if s is str and s.isdigit:
        return int(s.resplace(','))
    else:
        return 0
    
    class XpcSpider(scrapy.Spider):
    name = 'xpc'
    allowed_domains = ['xinpianchang.com','openapi-vtom.vmovier.com']
    start_urls = ['https://www.xinpianchang.com/channel/index/sort-like?from=tabArticle']
    #獲取每個視頻的鏈接
    def parse(self,response):
        pid_list=response.xpath('//ul[@class="video-list"]/li[@class="enter-filmplay"]/@data-articleid').extract() #pid列表
    
    
        cookies={"Authorization":"01D3EF58AA36A73BCAA36A438BAA36A9459AA36AFD0C8371FE04"}
    
    
        for pid in pid_list:
            url ='https://www.xinpianchang.com/a%s?from=ArticleList' %pid
            request=response.follow(url,self.parse_post)
            request.meta['pid']=pid
            yield request
            '''
            pages=response.xpath('//div[@class="page-wrap"]/div[@class="page"]/a/@href').extract()
            
            for page in pages:
                yield response.follow(page,self.parse,cookies=cookies)
    '''
    
    #解析單個視頻信息
    def parse_post(self, response):
        pid=response.meta['pid']
        post={
            "pid":pid
        }
        cotegray=response.xpath('//span[contains(@class,"cate")]/a/text()').extract()
        post['cotegray']='|'.join([cote.strip() for cote in cotegray])
        post['title']=response.xpath('//div[@class="title-wrap"]/h3/text()').extract()
        post['play_counts']=response.xpath('//i[@class="fs_12 fw_300 c_b_6 v-center play-counts"]/text()').extract()
        post['like_counts']=response.xpath('//span[@class="v-center like-counts fs_12 c_w_f fw_300"]/text()').extract()
        lable=response.xpath('//div[@class="fs_12 fw_300 c_b_3 tag-wrapper"]//text()').extract()
        post['lable']='|'.join([lab.strip() for lab in lable])
    
        #視頻鏈接
        vid,=re.findall('vid: "(.*?)"',response.text)
        video_url='https://openapi-vtom.vmovier.com/v3/video/%s?expand=resource&usage=xpc_web' %vid
        request= Request(video_url,self.parse_video)
        request.meta['post']=post
        yield request
    
        #評論鏈接
        comment_url="https://app.xinpianchang.com/comments?resource_id=%s&type=article&page=1&per_page=24" %pid
        request =Request(comment_url,self.parse_comment)
        yield request
    
        #創作人信息
        cid_list = response.xpath('//ul[@class="creator-list"]/li/a/@data-userid').extract()
        creator_url="https://www.xinpianchang.com/u%s?from=articleList"
    
        #中间关系信息表
        for cid in cid_list:
            request=response.follow(creator_url %cid,self.parse_composer)
            request.meta['cid']=cid
            yield request
            cr={}
            cr['pid']=pid
            cr['cid']=cid
            cr['roles']=response.xpath('//ul[@class="creator-list"]/li/a[@data-userid=$var]/following-sibling::div[1]/a/following-sibling::span/text()',var=cid).extract()
            yield cr
    
    #作品信息表
    def parse_video(self,response):
        post=response.meta['post']
    
        result=json.loads(response.text)
        post['video']=result['data']['resource']['default']['https_url']
        post['duration']=result['data']['resource']['default']['duration']
        yield post
    
    #评论信息表
    def parse_comment(self,response):
        comment={}
        result=json.loads(response.text)
        list=result['data']['list']
        for li in list:
            #評論人信息：名字、頭像、id、當前作品pid、評論、評論被點贊
            comment['uname']=li['userInfo']['username']
            comment['avatar']=li['userInfo']['avatar']
            comment['cid']=li['userInfo']['id']
            comment['pid']=li['resource_id']
            comment['contentlove']=li['count_approve']
            comment['content']=li['content']
        yield comment
    
    #工作人员信息
    def parse_composer(self,response):
        composer={}
        composer['banner']=response.xpath('//div[@class="banner-wrap"]/@style').extract()
        composer['name']=response.xpath('//div[@class="creator-info"]/p[@class="creator-name fs_26 fw_600 '
                                        'c_b_26"]/text()').extract()
        composer['like_counts']=convert(response.xpath('//div[@class="creator-info"]/p[@class="creator-det'
                                         'ail fs_14 fw_300 c_b_9"]/span[@class="like-wrap"]/span[2]/text()').extract())
        composer['fans_counts']=response.xpath('//span[@class="fans-wrap"]/span[2]/text()').extract()
        composer['follow_counts']=response.xpath('  //div[@class="creator-info"]/p[@class="creator-detail fs_14 fw'
                                          '_300 c_b_9"]/span[@class="follow-wrap"]/span[2]/text()').extract()
        composer['location']=response.xpath('//span[@class="icon-location v-center"]/following-sibling::span[1]/text()').extract()
        composer['carerr']=response.xpath('  //div[@class="creator-info"]/p[@class="creator-detail fs_14 fw_300 c_b_9'
                                          '"]/span[@class="icon-career v-center"]/following-sibling::span[1]/text()').extract()
        yield composer

全部评论 (0)

还没有任何评论哟~

scrapy爬虫爬取新片场信息

coding:utf8 importscrapy importre fromscrapyimportRequest importjson defconverts: ifsisstrands.isdig...

Python 爬虫（六）：Scrapy 爬取景区信息

Scrapy是一个使用Python语言开发，为了爬取网站数据，提取结构性数据而编写的应用框架，它用途广泛，比如：数据挖掘、监测和自动化测试。安装使用终端命令pipinstallScrapy即可。

web爬虫讲解—Scrapy框架爬虫—Scrapy爬取百度新闻，爬取Ajax动态生成的信息

crapy爬取百度新闻，爬取Ajax动态生成的信息，抓取百度新闻首页的新闻rul地址有多网站，当你浏览器访问时看到的信息，在html源文件里却找不到，由得信息还是滚动条滚动到对应的位置后才显示信息，...

Scrapy爬虫爬取书籍网站信息（一）

本文运用了Scrapy爬虫的知识，爬取了点击打开链接网站中的书籍信息，可以了解到基本Scrapy爬虫框架的使用方法。一、项目需求： 1、其中每本书的信息包括：书名、价格、评价等级、产品编码、库存量、...

Scrapy爬虫爬取书籍网站信息（二）

上文中我们了解到了如何在网页中的源代码中查找到相关信息，接下来进行页面爬取工作： 1、首先创建一个Scrapy项目，取名为toscrapebook，接下来创建Spider文件以及Spider类，步骤如...

scrapy爬虫实战(二)-------------爬取IT招聘信息

主要从智联招聘，51job,周伯通三个网站爬取IT类企业的招聘信息，并按照编程语言和职位数量和平均薪资进行统计，计算。源代码github地址: <https://github.com/happyAn...

python 爬取企业注册信息_Python爬虫框架Scrapy爬取企业信息

首先得安装scrapy和pymongo 简单的安装和创建爬虫项目我们就简单的过一下 pipinstallscrapy pipinstallpymongo scrapystartprojectsells...

scrapy爬虫框架学习（二）scrapy爬取多级网页信息

scrapy爬虫框架学习（二）scrapy爬取多级网页信息 1爬取目标： 1.1针对一级页面获取专利详情页的链接信息 1.2针对专利详情页进行详细信息 2.项目代码实现 2.1item.py:定义要收...

Scrapy爬虫之网站图片爬取

第1关：爬取网站实训图片的链接本关任务：使用Scrapy爬取给定网站的图片链接，并保存到本地。 <!DOCTYPEhtml <htmllang=en <head <metacharset=UTF8 ...

Scrapy爬虫之网站图片爬取

第1关：爬取网站实训图片的链接任务描述本关任务：使用Scrapy爬取给定网站的图片链接，并保存到本地。编程要求首先，通过审查元素，观察图片链接的代码规律；然后，点击代码文件旁边的三角符号，选择...

是否确定退出登录?

scrapy爬虫爬取新片场信息

全部评论 (0)

相关文章推荐

scrapy爬虫爬取新片场信息

Python 爬虫（六）：Scrapy 爬取景区信息

web爬虫讲解—Scrapy框架爬虫—Scrapy爬取百度新闻，爬取Ajax动态生成的信息

Scrapy爬虫爬取书籍网站信息（一）

Scrapy爬虫爬取书籍网站信息（二）

scrapy爬虫实战(二)-------------爬取IT招聘信息

python 爬取企业注册信息_Python爬虫框架Scrapy爬取企业信息

scrapy爬虫框架学习（二）scrapy爬取多级网页信息

Scrapy爬虫之网站图片爬取

Scrapy爬虫之网站图片爬取