爬取网易云音乐评论过万歌曲

阅读量：

其他同学普遍采用的方法是通过获取全部歌单并筛选出评论量超过一万首的作品。
考虑到不同歌单可能存在重叠的内容，
这种办法可能会导致效率低下以及遗漏某些作品的情况。
因此我的计划是先收集每一位歌手及其音乐作品库中的热门歌曲列表，
并从中挑选出评论量超过一万首的作品进行分析。
就目前情况来看，
在各个歌手中超过五十分评论量的作品数量非常有限，
所以这种探索性的研究方法在当前阶段仍然是可行的选择。

访问歌手页面后,您将看到华语男女声,欧美男女声等共计15个分类区.这些区域的编号分别为:

复制代码

    group = ['1001', '1002', '1003', '2001', '2002', '2003', '6001', '6002', '6003', '7001', '7002', '7003', '4001', '4002', '4003']

而每个板块又按照首字母分成了27个子页面（包括热门歌手页面），子页面代号如下：

复制代码

    initial = ['0']
    for i in range(65, 91):
    initial.append(str(i))

15×27=405，则我们需要获取这405个歌手子页面链接，并可通过前述代号来生成这些歌手子页面的链接。

复制代码

    urls = []
    for g in group:
    for i in initial:
        url = 'http://music.163.com/discover/artist/cat?id=' + g + '&initial=' + i
        urls.append(url)

然后就是用爬虫从这些页面上爬取歌手的id：

复制代码

    def get_artist(url):
    k = 0
    t = []
    while True:
        try:
            resp = request.urlopen(url)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
            l = soup.find_all('a', class_='nm nm-icn f-thide s-fc0')
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                t.append(re.match(p, i['href']).group(1))
            return t
        except Exception as e:
            print(e)
            k += 1
            if k > 10:
                print('页面' + url + '发生错误')
                return None
            t = []
            continue

获得歌手id以后，再让爬虫爬取歌手的个人页面，获取热门50单曲的歌曲id：

复制代码

    def get_song(artist_id):
    k = 0
    t = []
    while True:
        url = 'http://music.163.com/artist?id=' + artist_id
        try:
            req = request.Request(url)
            req.add_header('User-Agent',
                           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4399.400 QQBrowser/9.7.12777.400')
            req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
            resp = request.urlopen(req)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
        except Exception as e:
            k += 1
            if k > 10:
                print('歌手' + artist_id + '发生错误')
                print(e)
                return None
            continue
        try:
            a = soup.find('ul', class_='f-hide')
            l = a.children
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                music_id = re.match(p, i.a['href']).group(1)
                data = (music_id, artist_id)
                t.append(data)
            return t
        except Exception as e:
            print(e)
            print('歌手' + artist_id + '发生错误')
            return None

通过歌曲ID访问相应页面以获取其评论数量

复制代码

    # -*- coding: utf-8 -*-
    from Crypto.Cipher import AES
    import base64
    import requests
    import json
    import codecs
    import time
    import random
    
    #代理ip
    proxy_host = '122.72.18.35'
    proxy = {'http':proxy_host}
    
    # 头部信息
    headers={'Host':'music.163.com',
         'Accept':'*/*',
         'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
         'Accept-Encoding':'gzip, deflate',
         'Content-Type':'application/x-www-form-urlencoded',
         'Referer':'http://music.163.com/song?id=347597',
         'Content-Length':'484',
         'Cookie':'__s_=1; _ntes_nnid=f17890f7160fd145486752ebbf2066df,1505221478108; _ntes_nuid=f17890f7160fd145486752ebbf2066df; JSESSIONID-WYYY=Z99pE%2BatJVOAGco1d%2FJpojOK94Xe9GHqe0epcCOj23nqP2SlHt1XwzWQ2FXTwaM2xgIN628qJGj8%2BikzfYkv%2FXAUo%2FSzwMxjdyO9oeQlGKBvH6nYoFpJpVlA%2F8eP57fkZAVEsuB9wqkVgdQc2cjIStE1vyfE6SxKAlA8r0sAgOnEun%2BV%3A1512200032388; _iuqxldmzr_=32; __utma=94650624.1642739310.1512184312.1512184312.1512184312.1; __utmc=94650624; __utmz=94650624.1512184312.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); playerid=10841206',
         'Connection':'keep-alive',
         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'}
    
    # offset的取值为:(评论页数-1)*20,total第一页为true，其余页为false
    first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}' 
    second_param = "010001" 
    third_param = "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
    forth_param = "0CoJUm6Qyw8W8jud"
    
    # 获取参数
    def get_params(page): # page为传入页数
    iv = "0102030405060708"
    first_key = forth_param
    second_key = 16 * 'F'
    if(page == 1): # 如果为第一页
        first_param = '{rid:"", offset:"0", total:"true", limit:"20", csrf_token:""}'
        h_encText = AES_encrypt(first_param, first_key, iv)
    else:
        offset = str((page-1)*20)
        first_param = '{rid:"", offset:"%s", total:"%s", limit:"20", csrf_token:""}' %(offset,'false')
        h_encText = AES_encrypt(first_param, first_key, iv)
    h_encText = AES_encrypt(h_encText, second_key, iv)
    return h_encText
    
    # 获取 encSecKey
    def get_encSecKey():
    encSecKey = "257348aecb5e556c066de214e531faadd1c55d814f9be95fd06d6bff9f4c7a41f831f6394d5a3fd2e3881736d94a02ca919d952872e7d0a50ebfa1769a7a62d512f5f1ca21aec60bc3819a9c3ffca5eca9a0dba6d6f7249b06f5965ecfff3695b54e1c28f3f624750ed39e7de08fc8493242e26dbc4484a01c76f739e135637c"
    return encSecKey
    
    # 解密过程
    def AES_encrypt(text, key, iv):
    pad = 16 - len(text) % 16
    text = text + pad * chr(pad)
    encryptor = AES.new(key, AES.MODE_CBC, iv)
    encrypt_text = encryptor.encrypt(text)
    encrypt_text = base64.b64encode(encrypt_text)
    encrypt_text = str(encrypt_text, encoding="utf-8") #注意一定要加上这一句，没有这一句则出现错误
    return encrypt_text
    
    def get_json(url, params, encSecKey):
    data = {
         "params": params,
         "encSecKey": encSecKey
    }
    response = requests.post(url, headers=headers, data=data, proxies=proxy)
    return response.content
    
    #外部调用方法
    def get_comments_total(id):
    url = 'http://music.163.com/weapi/v1/resource/comments/R_SO_4_'+str(id)+'?csrf_token='
    params = get_params(1)
    encSecKey = get_encSecKey()
    json_text = get_json(url,params,encSecKey)
    json_dict = json.loads(json_text)
    comments_num = int(json_dict['total'])
    return comments_num

最后再将获得的数据逐条写入数据库就可以了
总的代码如下：

复制代码

    # _*_ coding: utf-8 _*_
    from urllib import request
    import requests
    import json
    from bs4 import BeautifulSoup as bs
    from Crypto.Cipher import AES
    import base64
    import re
    import mysql.connector
    import get_comments_total as gct
    import threading
    
    
    
    
    group = ['1001', '1002', '1003', '2001', '2002', '2003', '6001', '6002', '6003', '7001', '7002', '7003', '4001', '4002',
         '4003']
    
    initial = ['0']
    for i in range(65, 91):
    initial.append(str(i))
    
    urls = []
    for g in group:
    for i in initial:
        url = 'http://music.163.com/discover/artist/cat?id=' + g + '&initial=' + i
        urls.append(url)
    
    #写入数据库
    def write(L):
    try:
        conn = mysql.connector.connect(user='root', password='lixiao187.', database='cloudmusic', charset='utf8')
        cursor = conn.cursor()
        for l in L:
            try:
                cursor.execute(
                    'insert into music(music_id, music_name, artist_id, artist_name, comments) values (%s, %s, %s, %s, %s)',
                    l)
                conn.commit()
            except Exception as e:
                print(e)
                print(l)
                continue
        cursor.close()
        conn.close()
    except Exception as e:
        print(e)
        print(L)
    
    
    # 获得某字母页面上的歌手id列表
    def get_artist(url):
    k = 0
    t = []
    while True:
        try:
            resp = request.urlopen(url)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
            l = soup.find_all('a', class_='nm nm-icn f-thide s-fc0')
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                t.append(re.match(p, i['href']).group(1))
            return t
        except Exception as e:
            print(e)
            k += 1
            if k > 10:
                print('页面' + url + '发生错误')
                return None
            t = []
            continue
    
    
    # 获得某歌手的热门歌曲id列表
    def get_song(artist_id):
    k = 0
    t = []
    while True:
        url = 'http://music.163.com/artist?id=' + artist_id
        try:
            req = request.Request(url)
            req.add_header('User-Agent',
                           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4399.400 QQBrowser/9.7.12777.400')
            req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
            resp = request.urlopen(req)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
        except Exception as e:
            k += 1
            if k > 10:
                print('歌手' + artist_id + '发生错误')
                print(e)
                return None
            continue
        try:
            a = soup.find('ul', class_='f-hide')
            l = a.children
            p = r'\s*\/[a-z]+\?[a-z]+\=([0-9]+)'
            for i in l:
                music_id = re.match(p, i.a['href']).group(1)
                data = (music_id, artist_id)
                t.append(data)
            return t
        except Exception as e:
            print(e)
            print('歌手' + artist_id + '发生错误')
            return None
    
    # 获得全部所需信息
    def get_data(music_id, artist_id):
    k = 0
    while True:
        try:
            comments = gct.get_comments_total(music_id)
            print('歌曲'+music_id+'，评论数：'+str(comments))
            if comments < 10000:
                return None
            url = 'http://music.163.com/song?id=' + music_id
            data = []
            req = request.Request(url)
            req.add_header('User-Agent',
                           'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4399.400 QQBrowser/9.7.12777.400')
            req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8')
            resp = request.urlopen(req)
            html = resp.read().decode('utf-8')
            soup = bs(html, 'html.parser')
            d = soup.find('div', class_='tit')
            p = soup.find('p', class_='des s-fc4')
            s = soup.find('span', class_='j-flag')
            music_name = d.em.text
            artist_name = p.span['title']
            data.append(music_id)
            data.append(music_name)
            data.append(artist_id)
            data.append(artist_name)
            data.append(comments)
            return data
        except Exception as e:
            k += 1
            if k > 10:
                print('歌曲' + music_id + '发生错误')
                return None
            continue
    
    # 逐条写入
    def get_and_write(artists, name):
    data = []
    for a in artists:
        songs = get_song(a)
        if songs == None:
            continue
        for s in songs:
            d = get_data(s[0], a)
            if d == None:
                continue
            data.append(d)
    if len(data) > 0:
        write(data)
    
    # 歌手子页面爬取线程
    def crawl(url, name):
    L = []
    artists = get_artist(url)
    if artists == None:
        return
    for a in artists:
        L.append(a)
        if len(L) > 9:
            t = threading.Thread(target=get_and_write, args=(L, ''))
            t.start()
            L = []
    t = threading.Thread(target=get_and_write, args=(L, ''))
    t.start()
    
    
    # 总方法
    def threads_crawl(start, end):
    L = []
    for i in range(start - 1, end):
        t = threading.Thread(target=crawl, args=(urls[i], ''))
        L.append(t)
    for t in L:
        t.start()
    for t in L:
        t.join()
    
    
    threads_crawl(1, 405)

全部评论 (0)

还没有任何评论哟~

爬取网易云音乐评论过万歌曲

看到网上其他同学的思路是爬取所有歌单，然后筛选出评论过万的歌曲。但我觉得不同歌单之间会有交叉，这种方式可能效率不高，而且可能会有漏网之鱼。所以我准备爬取所有歌手，再爬取他们的热门50单曲，从中筛选评论...

python爬虫-爬取网易云音乐歌曲评论

本文借鉴了@平胸小仙女的知乎回复https://www.zhihu.com/question/36081767以及@lyrichu的博客https://www.cnblogs.com/lyrichu/...

python selenium 爬取网易云音乐歌曲评论

目标网站这是评论页思路是先获取这一页全部歌曲的url，然后再逐个访问url获取评论这里主要还是切换iframe的问题打开歌曲的评论页后由于刷新了页面，要再切进iframe 结果：完整代码： ...

python爬取网易云音乐歌曲

python爬取网易云歌曲并且保存到本地打开网易云音乐首页随便打开了一个歌单列表<https://music.163.com//playlist?id=924680166先贴代码为敬 importr...

爬取网易云音乐的歌曲

python爬取网易云音乐 1、找到需要下载的歌单的地址这是我实验时用的几个歌单歌单的地址还是非常好找的，打开需要爬取的歌单，地址栏里直接复制就OK了 playurl='http://music....

网易云音乐歌手歌曲、用户评论、用户信息爬取

这里以邓紫棋歌手为例，可以去网易云音乐看她的主页：所有完整的代码在楼主的github：<https://github.com/duchp/pythonall/tree/master/webcrawl...

python爬虫爬取网易云音乐歌曲_如何用爬虫获取网易云音乐歌单中的歌曲？

————————————————————————————————— 泻药，以我抓取了307835首网易云音乐的歌单歌曲的经验，讲一下这个问题。喜欢用Github的可以直接看我的项目源码，代码简单、具...

爬取网易云音乐评论

爬取网站介绍与分析网站介绍本次爬取的是网易云音乐（https://music.163.com/）的评论信息，以音乐榜单中的云音乐飙升榜为例，点击“查看全部”，其中展示了100首飙升榜音乐，爬取这1...

网易云音乐评论爬取。

欢迎关注天善智能，我们是专注于商业智能BI，人工智能AI，大数据分析与挖掘领域的垂直社区，学习，问答、求职一站式搞定！对商业智能BI、大数据分析挖掘、机器学习，python，R等数据领域感兴趣的同学...

网易云音乐评论爬取

咳咳,终于要搞网易了,好激动啊我开始以为网易云的加密很简单,但是最终还是小看他了不得不说网易的程序员为了那些情怀满满的音乐评论还是下了很大的功夫的这里不放源码了,毕竟大家都不容易简单说下思路:...

是否确定退出登录?

爬取网易云音乐评论过万歌曲

全部评论 (0)

相关文章推荐

爬取网易云音乐评论过万歌曲

python爬虫-爬取网易云音乐歌曲评论

python selenium 爬取网易云音乐歌曲评论

python爬取网易云音乐歌曲

爬取网易云音乐的歌曲

网易云音乐歌手歌曲、用户评论、用户信息爬取

python爬虫爬取网易云音乐歌曲_如何用爬虫获取网易云音乐歌单中的歌曲？

爬取网易云音乐评论

网易云音乐评论爬取。

网易云音乐评论爬取