目录

python爬虫酷狗音乐爬取练习

【python爬虫】酷狗音乐爬取练习

注意:本次爬取的音乐仅有1分钟试听,仅作学习爬虫的原理,完整音乐需要自行下载客户端。

一、 初步分析

登陆酷狗音乐后随机选取一首歌,在请求里发现一段mp3文件,复制网址,确实是我们需要的url。 https://i-blog.csdnimg.cn/img_convert/15ae184dfb743706e8562f3d4290a5ba.png 复制音频的名字,搜索找到发起请求的网址,发现是在songinfo里 https://i-blog.csdnimg.cn/img_convert/940fa1fefb6fd55db500f261224c7ce7.png 查看参数和请求头,刷新一次,查看是否有哪些参数是变化的。可以发现图中两次请求的这些参数都不同,接下来就寻找这些参数的生成方式,参数clienttime为时间戳,那么就只需要找到signature的生成方式就可以了: https://i-blog.csdnimg.cn/img_convert/f5c4b68cd9bc8af814aa6818d74cf30a.png

二、分析参数signature

1 分析过程

搜索参数signature,并在可能生成的位置打上断点,然后刷新网页 https://i-blog.csdnimg.cn/img_convert/b9b418c158631690d3b5fb2917d6e41e.png 网页断在了此处,可以看见参数signature跟函数d与数组s有关。

补充:如果s的长度不为13,需要放行一下,点击这个按钮https://i-blog.csdnimg.cn/img_convert/14f49fec2f9a6dbd03531dd447f7086e.png https://i-blog.csdnimg.cn/img_convert/02d8c19ec75badeac5d125e292b87f08.png 查看函数d的定义 https://i-blog.csdnimg.cn/img_convert/5d442fd93fab1135c0c63b971f9d5aab.png 发现函数内部没有wordsToBytes函数和函数i的相关定义,那么打下断点,查看具体函数的位置 https://i-blog.csdnimg.cn/img_convert/7112fb615bb262bfc23656dd3dfd0b8e.png 获得wordsToBytes函数的具体定义 https://i-blog.csdnimg.cn/img_convert/7712982bf6b6c4c0eeee33e2ab3505f7.png 获得i函数的具体定义 https://i-blog.csdnimg.cn/img_convert/03a7ddbbb192bf52e3db2aa8e860cba9.png i函数里没有r.stringToBytes(t)的相关定义,继续打下断点 https://i-blog.csdnimg.cn/img_convert/1dd5d49186ced672d1211f282df4c257.png 找到r.stringToBytes(t)的相关定义 https://i-blog.csdnimg.cn/img_convert/afddec2f1164d464501290c93c09d43f.png 接着查看s的内容:在console里查看s的内容,发现s的值跟之前请求的参数类似。而且下标为0和下标为12的值跟u的值相同 https://i-blog.csdnimg.cn/img_convert/72f5a5abb2a1c07eebbec4098e14ad52.png 往上查找u的定义,发现u的值是固定的 https://i-blog.csdnimg.cn/img_convert/6736d2fc6c2f7c74205073e07948b46a.png

2 代码实现

那么就开始实现生成signature的代码(以下为JavaScript代码 ) function bytesToWords(t) { for (var n = [], r = 0, e = 0; r < t.length; r++, e += 8) n[e »> 5] |= t[r] « 24 - e % 32; return n } function rotl(t, n) { return t « n | t »> 32 - n } function endian(t) { if (t.constructor == Number) return 16711935 & rotl(t, 8) | 4278255360 & rotl(t, 24); for (var n = 0; n < t.length; n++) t[n] = endian(t[n]); return t } function i(t, c) { var l = { utf8: { stringToBytes: function (t) { return l.bin.stringToBytes(unescape(encodeURIComponent(t))) }, bytesToString: function (t) { return decodeURIComponent(escape(l.bin.bytesToString(t))) } }, bin: { stringToBytes: function (t) { for (var n = [], r = 0; r < t.length; r++) n.push(255 & t.charCodeAt(r)); return n }, bytesToString: function (t) { for (var n = [], r = 0; r < t.length; r++) n.push(String.fromCharCode(t[r])); return n.join("") } } }; i._ff = function (t, n, r, e, o, i, c) { var s = t + (n & r | ~n & e) + (o »> 0) + c; return (s « i | s »> 32 - i) + n }, i._gg = function (t, n, r, e, o, i, c) { var s = t + (n & e | r & ~e) + (o »> 0) + c; return (s « i | s »> 32 - i) + n }, i._hh = function (t, n, r, e, o, i, c) { var s = t + (n ^ r ^ e) + (o »> 0) + c; return (s « i | s »> 32 - i) + n }, i._ii = function (t, n, r, e, o, i, c) { var s = t + (r ^ (n | ~e)) + (o »> 0) + c; return (s « i | s »> 32 - i) + n }; t.constructor == String ? t = c && “binary” === c.encoding ? o.stringToBytes(t) : l.utf8.stringToBytes(t) : e(t) ? t = Array.prototype.slice.call(t, 0) : Array.isArray(t) || (t = t.toString()); for (var s = bytesToWords(t), a = 8 * t.length, l = 1732584193, u = -271733879, f = -1732584194, d = 271733878, g = 0; g < s.length; g++) s[g] = 16711935 & (s[g] « 8 | s[g] »> 24) | 4278255360 & (s[g] « 24 | s[g] »> 8); s[a »> 5] |= 128 « a % 32, s[14 + (a + 64 »> 9 « 4)] = a; for (var b = i._ff, p = i._gg, h = i._hh, m = i._ii, g = 0; g < s.length; g += 16) { var y = l , j = u , S = f , v = d; u = m(u = m(u = m(u = m(u = h(u = h(u = h(u = h(u = p(u = p(u = p(u = p(u = b(u = b(u = b(u = b(u, f = b(f, d = b(d, l = b(l, u, f, d, s[g + 0], 7, -680876936), u, f, s[g + 1], 12, -389564586), l, u, s[g + 2], 17, 606105819), d, l, s[g + 3], 22, -1044525330), f = b(f, d = b(d, l = b(l, u, f, d, s[g + 4], 7, -176418897), u, f, s[g + 5], 12, 1200080426), l, u, s[g + 6], 17, -1473231341), d, l, s[g + 7], 22, -45705983), f = b(f, d = b(d, l = b(l, u, f, d, s[g + 8], 7, 1770035416), u, f, s[g + 9], 12, -1958414417), l, u, s[g + 10], 17, -42063), d, l, s[g + 11], 22, -1990404162), f = b(f, d = b(d, l = b(l, u, f, d, s[g + 12], 7, 1804603682), u, f, s[g + 13], 12, -40341101), l, u, s[g + 14], 17, -1502002290), d, l, s[g + 15], 22, 1236535329), f = p(f, d = p(d, l = p(l, u, f, d, s[g + 1], 5, -165796510), u, f, s[g + 6], 9, -1069501632), l, u, s[g + 11], 14, 643717713), d, l, s[g + 0], 20, -373897302), f = p(f, d = p(d, l = p(l, u, f, d, s[g + 5], 5, -701558691), u, f, s[g + 10], 9, 38016083), l, u, s[g + 15], 14, -660478335), d, l, s[g + 4], 20, -405537848), f = p(f, d = p(d, l = p(l, u, f, d, s[g + 9], 5, 568446438), u, f, s[g + 14], 9, -1019803690), l, u, s[g + 3], 14, -187363961), d, l, s[g + 8], 20, 1163531501), f = p(f, d = p(d, l = p(l, u, f, d, s[g + 13], 5, -1444681467), u, f, s[g + 2], 9, -51403784), l, u, s[g + 7], 14, 1735328473), d, l, s[g + 12], 20, -1926607734), f = h(f, d = h(d, l = h(l, u, f, d, s[g + 5], 4, -378558), u, f, s[g + 8], 11, -2022574463), l, u, s[g + 11], 16, 1839030562), d, l, s[g + 14], 23, -35309556), f = h(f, d = h(d, l = h(l, u, f, d, s[g + 1], 4, -1530992060), u, f, s[g + 4], 11, 1272893353), l, u, s[g + 7], 16, -155497632), d, l, s[g + 10], 23, -1094730640), f = h(f, d = h(d, l = h(l, u, f, d, s[g + 13], 4, 681279174), u, f, s[g + 0], 11, -358537222), l, u, s[g + 3], 16, -722521979), d, l, s[g + 6], 23, 76029189), f = h(f, d = h(d, l = h(l, u, f, d, s[g + 9], 4, -640364487), u, f, s[g + 12], 11, -421815835), l, u, s[g + 15], 16, 530742520), d, l, s[g + 2], 23, -995338651), f = m(f, d = m(d, l = m(l, u, f, d, s[g + 0], 6, -198630844), u, f, s[g + 7], 10, 1126891415), l, u, s[g + 14], 15, -1416354905), d, l, s[g + 5], 21, -57434055), f = m(f, d = m(d, l = m(l, u, f, d, s[g + 12], 6, 1700485571), u, f, s[g + 3], 10, -1894986606), l, u, s[g + 10], 15, -1051523), d, l, s[g + 1], 21, -2054922799), f = m(f, d = m(d, l = m(l, u, f, d, s[g + 8], 6, 1873313359), u, f, s[g + 15], 10, -30611744), l, u, s[g + 6], 15, -1560198380), d, l, s[g + 13], 21, 1309151649), f = m(f, d = m(d, l = m(l, u, f, d, s[g + 4], 6, -145523070), u, f, s[g + 11], 10, -1120210379), l, u, s[g + 2], 15, 718787259), d, l, s[g + 9], 21, -343485551), l = l + y »> 0, u = u + j »> 0, f = f + S »> 0, d = d + v »> 0 } return endian([l, u, f, d]) } function wordsToBytes(t) { for (var n = [], r = 0; r < 32 * t.length; r += 8) n.push(t[r »> 5] »> 24 - r % 32 & 255); return n } function bytesToHex(t) { for (var n = [], r = 0; r < t.length; r++) n.push((t[r] »> 4).toString(16)), n.push((15 & t[r]).toString(16)); return n.join("") } function d(t, r) { if (void 0 === t || null === t) throw new Error(“Illegal argument " + t); var e = wordsToBytes(i(t, r)); return r && r.asBytes ? e : r && r.asString ? o.bytesToString(e) : bytesToHex(e) } function getsianature() { var s = [ “NVPh5oo715z5DIWAeQlhMDsWXXQV4hwt”, “appid=1014”, “clienttime=1741411989613”, “clientver=20000”, “dfid=3Mm61k0WDxvm033Epz2worRG”, “encode_album_audio_id=j410q60”, “mid=70789bebe63fb74c52e4a911853f5450”, “platid=4”, “srcappid=2919”, “token=cbfe2e174e4b97fd6aca35682cdba3d2b431c4ed95e2dbd1779e37a7975b672c”, “userid=2307902397”, “uuid=70789bebe63fb74c52e4a911853f5450”, “NVPh5oo715z5DIWAeQlhMDsWXXQV4hwt” ]; return d(s.join(”")) } console.log(getsianature()); 经过一系列调试,可以发现生成的结果与浏览器生成的值一样,那么生成signature的代码就没问题了。 https://i-blog.csdnimg.cn/img_convert/844d3ec89dd4a68d3dd84e59da39aa27.png 将一些会变的值改为变量,可以看到参数s的值里只有clienttime的值是会变的,因此修改上述代码中getsianature函数,将参数s的值放在python代码中,getsianature函数改完如下图所示: function getsianature(s) { return d(s.join("")) }

三、获取多首歌

1 分析过程

点击不同的歌,可以发现每首歌的参数encode_album_audio_id都不同,因此需要获取encode_album_audio_id https://i-blog.csdnimg.cn/img_convert/06f7106dce6fdf5cb13fd53be422ece2.png 搜索歌名,找到歌曲的id https://i-blog.csdnimg.cn/img_convert/524dcef08f11566cd6f85411a8753d7e.png 接着查看请求参数,同样有一个signature参数,刷新多次网页发现signature参数会变化,那么重复之前分析signature的步骤 https://i-blog.csdnimg.cn/img_convert/e0939435cedd67ed053adc1efbe1fd4c.png 打下断点,查看中断的位置,参数s有所变化 https://i-blog.csdnimg.cn/img_convert/0b59125f0153ef1a6707391c9939ae9c.png 打印s,查看s的内容,除此之外没有变化,那么就沿用先前的代码 https://i-blog.csdnimg.cn/img_convert/9e6ee92daf5c1cdc6090c968e2da24be.png

2 代码实现

完整代码如下,注意这里调用了JavaScript代码,需要安装PyExecJS模块:pip install PyExecJS -i https://pypi.tuna.tsinghua.edu.cn/simple。 本代码中JavaScript文件名为kugou.js,JavaScript代码在参数signature的分析中有写到,以下为python代码: import json import re import time import requests import execjs class kugou_music: def init(self): self.headers = { ‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36’, ‘Referer’: ‘ , } def get_signature(self, s): with open(“kugou.js”, “r”, encoding=“utf-8”) as f: js = f.read() ctx = execjs.compile(js) signature = ctx.call(“getsianature”, s) return signature def get_one_song_url(self, audio_id): timestamp = str(int(time.time() * 1000)) s = [ “NVPh5oo715z5DIWAeQlhMDsWXXQV4hwt”, “appid=1014”, f"clienttime={timestamp}", “clientver=20000”, “dfid=3Mm61k0WDxvm033Epz2worRG”, f"encode_album_audio_id={audio_id}", “mid=70789bebe63fb74c52e4a911853f5450”, “platid=4”, “srcappid=2919”, “token=cbfe2e174e4b97fd6aca35682cdba3d2b431c4ed95e2dbd1779e37a7975b672c”, “userid=2307902397”, “uuid=70789bebe63fb74c52e4a911853f5450”, “NVPh5oo715z5DIWAeQlhMDsWXXQV4hwt” ] signature = self.get_signature(s) params = { ‘srcappid’: ‘2919’, ‘clientver’: ‘20000’, ‘clienttime’: timestamp, ‘mid’: ‘70789bebe63fb74c52e4a911853f5450’, ‘uuid’: ‘70789bebe63fb74c52e4a911853f5450’, ‘dfid’: ‘3Mm61k0WDxvm033Epz2worRG’, ‘appid’: ‘1014’, ‘platid’: ‘4’, ’encode_album_audio_id’: audio_id, ’token’: ‘cbfe2e174e4b97fd6aca35682cdba3d2b431c4ed95e2dbd1779e37a7975b672c’, ‘userid’: ‘2307902397’, ‘signature’: signature } one_song_url = ‘ response = requests.get(one_song_url, headers=self.headers, params=params) song_url = response.json()[‘data’][‘play_url’] return song_url def get_signal_music(self, audio_id, audio_name): song_url = self.get_one_song_url(audio_id) response = requests.get(song_url, headers=self.headers) with open(f’{audio_name}.mp3’, ‘wb’) as f: f.write(response.content) print(f’{audio_name}.mp3下载完成’) def get_song_id(self,keyword): timestamp = str(int(time.time() * 1000)) s = [ “NVPh5oo715z5DIWAeQlhMDsWXXQV4hwt”, “appid=1014”, “bitrate=0”, “callback=callback123”, f"clienttime={timestamp}", “clientver=1000”, “dfid=3Mm61k0WDxvm033Epz2worRG”, “filter=10”, “inputtype=0”, “iscorrection=1”, “isfuzzy=0”, f"keyword={keyword}", “mid=70789bebe63fb74c52e4a911853f5450”, “page=1”, “pagesize=30”, “platform=WebFilter”, “privilege_filter=0”, “srcappid=2919”, “token=cbfe2e174e4b97fd6aca35682cdba3d2b431c4ed95e2dbd1779e37a7975b672c”, “userid=2307902397”, “uuid=70789bebe63fb74c52e4a911853f5450”, “NVPh5oo715z5DIWAeQlhMDsWXXQV4hwt” ] signature = self.get_signature(s) params = { ‘callback’: ‘callback123’, ‘srcappid’: ‘2919’, ‘clientver’: ‘1000’, ‘clienttime’: timestamp, ‘mid’: ‘70789bebe63fb74c52e4a911853f5450’, ‘uuid’: ‘70789bebe63fb74c52e4a911853f5450’, ‘dfid’: ‘3Mm61k0WDxvm033Epz2worRG’, ‘keyword’: keyword, ‘page’: ‘1’, ‘pagesize’: ‘30’, ‘bitrate’: ‘0’, ‘isfuzzy’: ‘0’, ‘inputtype’: ‘0’, ‘platform’: ‘WebFilter’, ‘userid’: ‘2307902397’, ‘iscorrection’: ‘1’, ‘privilege_filter’: ‘0’, ‘filter’: ‘10’, ’token’: ‘cbfe2e174e4b97fd6aca35682cdba3d2b431c4ed95e2dbd1779e37a7975b672c’, ‘appid’: ‘1014’, ‘signature’: signature } song_id_url = ‘ response = requests.get(song_id_url, headers=self.headers, params=params, verify=False) temp = re.findall(r’callback123(.*)’, response.text)[0][1:-1] temp = json.loads(temp) song = temp[‘data’][’lists’] return song def get_all_song(self,keyword): song = self.get_song_id(keyword) for i in song: song_id = i.get(‘EMixSongID’) song_name = i.get(‘FileName’)

print(song_name, song_id)

try: self.get_signal_music(song_id, song_name) except Exception as e: print(f’{song_name}下载失败:’, e) if name == ‘main’: kugou = kugou_music()

kugou.get_signal_music(‘j410q60’)

keyword=‘周杰伦’ kugou.get_all_song(keyword)