日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy爬取QQ音乐、评论、下载、歌曲、歌词

發(fā)布時(shí)間:2023/12/20 编程问答 41 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Scrapy爬取QQ音乐、评论、下载、歌曲、歌词 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

  • Scrapy爬取QQ音樂(lè)、評(píng)論、下載、歌曲、歌詞
    • 爬取分析
      • 1、分析頁(yè)面的歌手信息
      • 2、編寫代碼
        • Item.py中編寫爬取的信息
        • setting.py中的配置信息
        • Spider下的music.py編寫代碼
      • 3、分析歌單列表
        • 在music.py中繼續(xù)編程
      • 4、分析歌詞請(qǐng)求
        • 爬取歌詞代碼的編寫
        • 歌詞信息的清洗
      • 5、分析評(píng)論
      • 6、下載歌曲的url
      • 7、將數(shù)據(jù)保存到Mongo
      • 7、隨機(jī)User-Agent
      • 8、在setting.py中開啟中間件的使用
      • 9、運(yùn)行程序
    • 總結(jié)

Scrapy爬取QQ音樂(lè)、評(píng)論、下載、歌曲、歌詞

之前寫過(guò)一篇詳細(xì)的Scrapy爬取豆瓣電影的教程,這次就不寫這么詳細(xì)了,QQ音樂(lè)下載解密的時(shí)候用了下以前的文章教程。

Scrapy爬取豆瓣電影
QQ付費(fèi)音樂(lè)爬取
QQ音樂(lè)無(wú)損下載

目前Python3.7和Scrapy有沖突,建議用3.6,使用的模塊如下。

python3.6.5 Scrapy--1.5.1 pymongo--3.7.1

爬取分析

爬取思路:我是從歌手分類開始爬取,然后爬取每個(gè)人下面的歌曲,然后依次爬取歌詞、評(píng)論、下載鏈接等。

1、分析頁(yè)面的歌手信息


找到url后我們接著來(lái)分析url里面的參數(shù)信息。

https://u.y.qq.com/cgi-bin/musicu.fcg?callback=getUCGI43917153213009863&g_tk=5381&jsonpCallback=getUCGI43917153213009863&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A0%2C%22cur_page%22%3A1%7D%7D%7D


在分析后發(fā)現(xiàn)有關(guān)的參數(shù)只要data里面的就行。
sin:默認(rèn)為0,當(dāng)?shù)诙?yè)的時(shí)候就是80
curl_page:當(dāng)前的頁(yè)數(shù)

2、編寫代碼

首先創(chuàng)建Scrapy項(xiàng)目scrapy startproject qq_music,
生成Spider文件scrapy genspider music y.qq.com。

Item.py中編寫爬取的信息

# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapy from scrapy import Fieldclass QqMusicItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()# 數(shù)據(jù)庫(kù)表名collection = table = 'singer'id = Field()# 歌手名字singer_name = Field()# 歌曲名song_name = Field()# 歌曲地址song_url = Field()# 歌詞lrc = Field()# 評(píng)論comment = Field()

setting.py中的配置信息

MAX_PAGE = 3 # 爬取幾頁(yè) SONGER_NUM = 1 # 爬取歌手幾首歌,按歌手歌曲的火熱程度。 MONGO_URL = '127.0.0.1' MONGO_DB = 'music' # mongo數(shù)據(jù)庫(kù)

Spider下的music.py編寫代碼

# -*- coding: utf-8 -*- import scrapy import json from qq_music.items import QqMusicItem from scrapy import Requestclass MusicSpider(scrapy.Spider):name = 'music'allowed_domains = ['y.qq.com']start_urls = ['https://u.y.qq.com/cgi-bin/musicu.fcg?data=%7B%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer' \'%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genr' \'e%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A{num}%2C%22cur_page%22%3A{id}%7D%7D%7D'] # 其實(shí)爬取地址song_down = 'https://c.y.qq.com/base/fcgi-bin/fcg_music_express_mobile3.fcg?&jsonpCallback=MusicJsonCallback&ci' \'d=205361747&songmid={songmid}&filename=C400{songmid}.m4a&guid=9082027038' # 歌曲下載地址song_url = 'https://c.y.qq.com/v8/fcg-bin/fcg_v8_singer_track_cp.fcg?singermid={singer_mid}&order=listen&num={sum}' # 歌曲列表lrc_url = 'https://c.y.qq.com/lyric/fcgi-bin/fcg_query_lyric.fcg?nobase64=1&musicid={musicid}' # 歌詞列表discuss_url = 'https://c.y.qq.com/base/fcgi-bin/fcg_global_comment_h5.fcg?cid=205360772&reqtype=2&biztype=1&topid=' \'{song_id}&cmd=8&pagenum=0&pagesize=25' # 歌曲評(píng)論# 構(gòu)造請(qǐng)求,爬取的頁(yè)數(shù)。def start_requests(self):for i in range(1, self.settings.get('MAX_PAGE') + 1): # 在配置信息里獲取爬取頁(yè)數(shù)yield Request(self.start_urls[0].format(num=80 * (i - 1), id=i), callback=self.parse_user)def parse_user(self, response):"""依次爬取歌手榜的用戶信息singer_mid:用戶midsinger_name:用戶名稱返回爬取用戶熱歌信息。:param response::return:"""singer_list = json.loads(response.text).get('singerList').get('data').get('singerlist') # 獲取歌手列表for singer in singer_list:singer_mid = singer.get('singer_mid') # 歌手midsinger_name = singer.get('singer_name') # 歌手名字yield Request(self.song_url.format(singer_mid=singer_mid, sum=self.settings.get('SONGER_NUM')),callback=self.parse_song, meta={'singer_name': singer_name}) # 爬取歌手的熱歌

3、分析歌單列表



我們發(fā)現(xiàn)singermid和order參數(shù)是必須常量,num是獲取歌曲的數(shù)量。

我們分析返回的請(qǐng)求里的需要信息。

在music.py中繼續(xù)編程

def parse_song(self, response):"""爬取歌手下面的熱歌歌曲id是獲取評(píng)論用的歌曲mid是獲取歌曲下載地址用的:param response::return:"""songer_list = json.loads(response.text).get('data').get('list')for songer_info in songer_list:music = QqMusicItem()singer_name = response.meta.get('singer_name') # 歌手名字song_name = songer_info.get('musicData').get('songname') # 歌曲名字music['singer_name'] = singer_namemusic['song_name'] = song_namesong_id = songer_info.get('musicData').get('songid') # 歌曲idmusic['id'] = song_idsong_mid = songer_info.get('musicData').get('songmid') # 歌曲midmusicid = songer_info.get('musicData').get('songid') # 歌曲列表yield Request(url=self.discuss_url.format(song_id=song_id), callback=self.parse_comment,meta={'music': music, 'musicid': musicid, 'song_mid': song_mid})

4、分析歌詞請(qǐng)求


musicid:參數(shù)是剛才獲取歌曲列表中的songid,我們繼續(xù)寫music.py里面的代碼。

爬取歌詞代碼的編寫

def parse_lrc(self, response):"""爬取歌曲的歌詞:param response::return:"""music = response.meta.get('music')music['lrc'] = response.textsong_mid = response.meta.get('song_mid')yield Request(url=self.song_down.format(songmid=song_mid), callback=self.parse_url,meta={'music': music, 'songmid': song_mid})

我們打開歌詞的url鏈接,直接在瀏覽器上打不開,因?yàn)闆](méi)有referer參數(shù),所以我們?cè)趐ostman里面構(gòu)造請(qǐng)求,發(fā)現(xiàn)數(shù)據(jù)需要清洗。

歌詞信息的清洗

我們?cè)趐ipelines.py文件中清洗數(shù)據(jù)。

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import jsonimport pymongo import refrom scrapy.exceptions import DropItem from qq_music.items import QqMusicItemclass QqMusicPipeline(object):def process_item(self, item, spider):return itemclass lrcText(object):"""獲取的歌詞需要清洗"""def __init__(self):passdef process_item(self, item, spider):"""進(jìn)行正則匹配獲取的單詞:param item::param spider::return:"""if isinstance(item, QqMusicItem):if item.get('lrc'):result = re.findall(r'[\u4e00-\u9fa5]+', item['lrc'])item['lrc'] = ' '.join(result)return itemelse:return DropItem('Missing Text')

5、分析評(píng)論


一樣的套路分析url里面的參數(shù)信息,主要的還是剛才獲取的singid。

def parse_comment(self, response):"""歌曲的評(píng)論:param response::return:"""comments = json.loads(response.text).get('hot_comment').get('commentlist') # 爬取一頁(yè)的熱評(píng)if comments:comments = [{'comment_name': comment.get('nick'), 'comment_text': comment.get('rootcommentcontent')} forcomment in comments]else:comments = 'null'music = response.meta.get('music')music['comment'] = commentsmusicid = response.meta.get('musicid') # 傳遞需要的參數(shù)song_mid = response.meta.get('song_mid')yield Request(url=self.lrc_url.format(musicid=musicid), callback=self.parse_lrc,meta={'music': music, 'song_mid': song_mid})

6、下載歌曲的url

歌曲我就不下載了,直接獲取的是下載的url鏈接,這個(gè)分析可以看QQ音樂(lè)歌曲下載的另一篇博客,這里就直接寫代碼了。

def parse_url(self, response):"""解析歌曲下載地址的url:param response::return:"""song_text = json.loads(response.text)song_mid = response.meta.get('songmid')vkey = song_text['data']['items'][0]['vkey'] # 加密的參數(shù)music = response.meta.get('music')if vkey:music['song_url'] = 'http://dl.stream.qqmusic.qq.com/C400' + song_mid + '.m4a?vkey=' + \vkey + '&guid=9082027038&uin=0&fromtag=66'else:music['song_url'] = 'null'yield music

7、將數(shù)據(jù)保存到Mongo

配置信息是在setting.py設(shè)置,crawler.settings.get可以直接獲取配置信息。

class MongoPipline(object):"""保存到Mongo數(shù)據(jù)庫(kù)"""def __init__(self, mongo_url, mongo_db):self.mongo_url = mongo_urlself.mongo_db = mongo_dbself.client = pymongo.MongoClient(self.mongo_url)self.db = self.client[self.mongo_db]@classmethoddef from_crawler(cls, crawler):return cls(mongo_url=crawler.settings.get('MONGO_URL'),mongo_db=crawler.settings.get('MONGO_DB'))def open_spider(self, spider):passdef process_item(self, item, spider):if isinstance(item, QqMusicItem):data = dict(item)self.db[item.collection].insert(data)return itemdef close_spider(self, spider):self.client.close()

7、隨機(jī)User-Agent

我用的自己維護(hù)的代理池,這里就不寫隨機(jī)代理了,直接寫隨機(jī)User-Agent。
在middlewares.py中編寫隨機(jī)頭。

class my_useragent(object):def process_request(self, request, spider):user_agent_list = ["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1","Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5","Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3","Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24","Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]user_agent = random.choice(user_agent_list)request.headers['User_Agent'] = user_agent

8、在setting.py中開啟中間件的使用

ROBOTSTXT_OBEY = False DOWNLOADER_MIDDLEWARES = {# 'qq_music.middlewares.QqMusicDownloaderMiddleware': 543,'qq_music.middlewares.my_useragent': 544, } ITEM_PIPELINES = {# 'qq_music.pipelines.QqMusicPipeline': 300,'qq_music.pipelines.lrcText': 300,'qq_music.pipelines.MongoPipline': 302, }

9、運(yùn)行程序

(venv) ? qq_music git:(master) ? scrapy crawl music

總結(jié)

  • 在用requests構(gòu)造請(qǐng)求測(cè)試歌手列表的url時(shí),發(fā)現(xiàn)構(gòu)造嵌套data請(qǐng)求的時(shí)候,將data里面的數(shù)據(jù)封裝為字典類型,然后用json.jumps轉(zhuǎn)換為json格式,然后在提交url請(qǐng)求。請(qǐng)求是失敗的因?yàn)檎龜?shù)前面它有個(gè)+號(hào),然后獲取失敗的url用replace將+刪除,就獲得最終url了。
  • https://u.y.qq.com/cgi-bin/musicu.fcg?callback=getUCGI5078555865872545&g_tk=5381&jsonpCallback=getUCGI5078555865872545&loginUin=0&hostUin=0&format=jsonp&inCharset=utf8&outCharset=utf-8&notice=0&platform=yqq&needNewCode=0&data=%7B%22comm%22%3A%7B%22ct%22%3A24%2C%22cv%22%3A10000%7D%2C%22singerList%22%3A%7B%22module%22%3A%22Music.SingerListServer%22%2C%22method%22%3A%22get_singer_list%22%2C%22param%22%3A%7B%22area%22%3A-100%2C%22sex%22%3A-100%2C%22genre%22%3A-100%2C%22index%22%3A-100%2C%22sin%22%3A0%2C%22cur_page%22%3A1%7D%7D%7D


    2. 在使用yield Request(self.start_urls, callback=self.parse_user, meta={‘demo’:demo})調(diào)用下個(gè)請(qǐng)求方法時(shí),用meta傳遞參數(shù),不然每次用yield 返回一個(gè)item后,就是要進(jìn)行pipelines處理。
    3. 請(qǐng)求要依次處理,調(diào)用另一個(gè)函數(shù)會(huì)自動(dòng)在headers中生成referer鏈接。

    總結(jié)

    以上是生活随笔為你收集整理的Scrapy爬取QQ音乐、评论、下载、歌曲、歌词的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。