日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

起点中文网爬虫

發布時間:2023/12/14 编程问答 32 豆豆
生活随笔 收集整理的這篇文章主要介紹了 起点中文网爬虫 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python3 起點中文網架空歷史小說爬蟲

嘗試爬取起點中文網的全部架空歷史小說的一些信息。
信息包括:小說網址、書名、作者、簡介、評分、評論數、字數。

首先去到起點的架空歷史小說排行的網頁:https://www.qidian.com/all?chanId=5&subCateId=22&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=1
第一步:解析頁面,獲取每本書的URL
去到排行的頁面后,我們右鍵檢查書名,可以看到這本書的URL。

我們通過requests.get()方法,獲取到網頁源碼,然后通過正則表達式,將書的每本的id提取出來,再拼湊成URL。

#獲取各書的id import re import requests def book_list():url = 'https://www.qidian.com/all?chanId=5&subCateId=22&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=4'# 打開相應url并把頁面作為返回html = requests.get(url).textprint(html)ren = r' <h4><a href="//book.qidian.com/info/(.*?)"'ren_url = re.compile(ren)book_url = ren_url.findall(html)return book_url


book_url就是每本書的id,按照網頁的格式拼湊成URL就好。

第二步:獲取總頁數

可以發現URL的page=1時為第一頁,現在顯示的總頁數為1016,讓我們把這個總頁數爬下來!

#獲取總頁數 def get_page_Count(url):html = requests.get(url).textpageCount = re.compile(r'data-page="(.*?)"').findall(html)[-1]return pageCount

這個pageCount就是總的頁數啦,也就是目前表示的1016。

第三步:進入每本書的URL,爬取書的信息
進入每本書的URL之后,我們會看到這樣的界面

其中被紅色標注的就是要爬去的信息。
首先我們來獲取書的名字。還是老方法了,requests加正則。

#獲取書名 import re import requests def book_name():url = 'https://book.qidian.com/info/1010136878'html = requests.get(url).textren = r' <h1>.*?<em>(.*?)</em>'ren_name = re.compile(ren)name = ren_name.findall(html)name = ".".join(name)print("書名:"+name)return name book_name()

同理,作者也這么獲取。

#獲取作者 import re import requests def book_authorname():url = 'https://book.qidian.com/info/1010136878'html = requests.get(url).textren = r"authorName: '(.*?)'"ren_autorname = re.compile(ren)authorname = ren_autorname.findall(html)authorname = ".".join(authorname)print("作者:" + authorname)return authorname book_authorname()

然后是書的簡介,一樣的方法,不過要注意簡介里面有很多的標點符號,正則的時候需要注意。

#獲取簡介 import re import requests def book_summary():url = 'https://book.qidian.com/info/1010136878'html = requests.get(url).textren = r'<div class="book-intro">\s+<p>\s+(.*?)\s+</p>' #[\u4e00-\u9fa5]ren_symmary = re.compile(ren)summary = ren_symmary.findall(html)summary = ".".join(summary)summary=re.sub('[^\u4e00-\u9fa51-9,。?!.、:;''"《》()—]','',summary)print("簡介:"+summary)return summary book_summary()

簡單的搞定了,接下來遇到了第一個麻煩,書的評分是js傳值的,直接在網頁源碼中不能看到評分。
不會了,怎么辦?看看別人是怎么解決的~
我這里就不細說了,可以參考這位朋友的文章:https://blog.csdn.net/ridicuturing/article/details/81123587
貼我自己的代碼

#獲取評價 def book_rate(id):id = "".join(id)url = 'https://book.qidian.com/ajax/comment/index?_csrfToken=nJO0N8zar6LMkYrhA9rwSTraUEIPhtcKkxyyF4mz&bookId='+id+'&pageSize=15'rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r'"rate":(.*?),'ren_symmary = re.compile(ren)rate = ren_symmary.findall(html)rate = ".".join(rate)print("評分:"+rate)bookrate.append(rate)return rate

評論數和評價一樣

#獲取評論數 def book_userCount(id):id = "".join(id)url = 'https://book.qidian.com/ajax/comment/index?_csrfToken=nJO0N8zar6LMkYrhA9rwSTraUEIPhtcKkxyyF4mz&bookId=' + id + '&pageSize=15'rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r'"userCount":(.*?),'ren_symmary = re.compile(ren)userCount = ren_symmary.findall(html)userCount = ".".join(userCount)print("評論數:" + userCount)bookuserCount.append(userCount)return userCount

解決了一個問題好開心~,然后遇到了更難的問題。。。。。。
起點設置了字體反爬,數字顯示為小方框,看不到字數,感覺真是日了狗了。

又遇到問題不會解決怎么辦?!繼續看別人怎么解決的!
終于找到了解決方法:
1.方框轉換成16進制的Unicode編碼:https://blog.csdn.net/qq_42336573/article/details/80698580
2.Unicode編碼通過字體映射成woff文件中的數字:https://blog.csdn.net/qq_35741999/article/details/82018049
看了這兩篇文章以后,知道了原因所在,原來網頁上顯示的數字用的是電腦沒有的字體,所以顯示不出來。

我們需要把表示數字的方框爬下來,然后通過這個woff文件將其轉為電腦能識別的數字。
我的代碼

#獲取woff文件 def get_woff_previous(id):id = "".join(id)start_url = "https://book.qidian.com/info/" + idresponse = requests.get(start_url).textdoc = pq(response)doc = doc.text()a = re.compile(r'(.*?).woff').findall(doc)[0]font = re.compile(r'\w+').findall(a)[4]return font#得到字體的十六進制碼 def get_code(id):id = "".join(id)start_url = "https://book.qidian.com/info/" + idresponse = requests.get(start_url).textdoc = pq(response)doc = doc.text()num = re.compile(r'(.*?)萬字').findall(doc)[0]for i in num:numlist.append(ord(i)) #numlist[]是16進制數字,需要小方框一個一個的轉換,所以用list表示return numlist #得到字體 def get_font():url = "https://qidian.gtimg.com/qd_anti_spider/" + get_woff_previous(id) + ".woff"response = requests.get(url)font = TTFont(BytesIO(response.content))cmap = font.getBestCmap()font.close()return cmap#字體轉碼 def get_encode(cmap, values):WORD_MAP = {'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7','eight': '8', 'nine': '9', 'period': '.'}word_count = ''for value in values:key = cmap[int(value)]word_count += WORD_MAP[key]return word_count#獲取書的字數 def get_num():for i in get_code(id):global s,numlists = s + "&#" + str(i) + ";"s = re.compile(r'[0-9]+').findall(s)cmap = get_font()word_count = get_encode(cmap, s)booknum.append(word_count)print("字數:" + word_count + "萬字")s = "" #s是要轉變的16進制數numlist=[]

需要的功能都寫好了,可以寫進數據庫啦。

db = pymysql.connect(host='localhost', port=3306, user='root', password='123', db='spider', charset='utf8')cursor = db.cursor()sql1 = "insert into bookspider(book_url,book_name,book_author,book_summary,book_id,book_rate,book_userCount,book_num) values ('%s','%s','%s','%s','%s','%s','%s','%s')" % (bookurl[0],bookname[0],bookauthor[0],booksummary[0],bookid[0],bookrate[0],bookuserCount[0],booknum[0])cursor.execute(sql1)db.commit()print("存入數據庫成功!")

整體代碼

from urllib import request import re import time import pymysql import requests from pyquery import PyQuery as pq from fontTools.ttLib import TTFont from io import BytesIO#這些list都是存儲信息,為了之后導入數據庫 bookurl=[] bookname=[] bookauthor=[] booksummary=[] bookid=[] bookrate=[] bookuserCount=[] booknum=[]numlist=[] s="" book_url_list=[]#獲取總頁數 def get_page_Count(url):html = requests.get(url).textpageCount = re.compile(r'data-page="(.*?)"').findall(html)[-1]return pageCount#獲取書名 def book_name(url):rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r' <h1>.*?<em>(.*?)</em>'ren_name = re.compile(ren)name = ren_name.findall(html)name = ".".join(name)print("書名:"+name)bookname.append(name)return name#獲取作者 def book_authorname(url):rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r"authorName: '(.*?)'"ren_autorname = re.compile(ren)authorname = ren_autorname.findall(html)authorname = ".".join(authorname)print("作者:"+authorname)bookauthor.append(authorname)return authorname#獲取簡介 def book_summary(url):rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r'<div class="book-intro">\s+<p>\s+(.*?)\s+</p>' #[\u4e00-\u9fa5]ren_symmary = re.compile(ren)summary = ren_symmary.findall(html)summary = ".".join(summary)summary=re.sub('[^\u4e00-\u9fa51-9,。?!.、:;''"《》()—]','',summary)print("簡介:"+summary)booksummary.append(summary)return summary#獲取評價 def book_rate(id):id = "".join(id)url = 'https://book.qidian.com/ajax/comment/index?_csrfToken=nJO0N8zar6LMkYrhA9rwSTraUEIPhtcKkxyyF4mz&bookId='+id+'&pageSize=15'rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r'"rate":(.*?),'ren_symmary = re.compile(ren)rate = ren_symmary.findall(html)rate = ".".join(rate)print("評分:"+rate)bookrate.append(rate)return rate#獲取評論數 def book_userCount(id):id = "".join(id)url = 'https://book.qidian.com/ajax/comment/index?_csrfToken=nJO0N8zar6LMkYrhA9rwSTraUEIPhtcKkxyyF4mz&bookId=' + id + '&pageSize=15'rsp = request.urlopen(url)html = rsp.read()html = html.decode()ren = r'"userCount":(.*?),'ren_symmary = re.compile(ren)userCount = ren_symmary.findall(html)userCount = ".".join(userCount)print("評論數:" + userCount)bookuserCount.append(userCount)return userCount#獲取每頁的書的id def book_list(i):url = "https://www.qidian.com/all?chanId=5&subCateId=22&orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=" + str(i)# 打開相應url并把頁面作為返回rsp = request.urlopen(url)# 按住Ctrl鍵不送,同時點擊urlopen,可以查看文檔,有函數的具體參數和使用方法# 把返回結果讀取出來html = rsp.read()# 解碼html = html.decode()print(html)ren = r' <h4><a href="//book.qidian.com/info/(.*?)"'ren_url = re.compile(ren)book_url = ren_url.findall(html)return book_url#獲取woff文件 def get_woff_previous(id):id = "".join(id)start_url = "https://book.qidian.com/info/" + idresponse = requests.get(start_url).textdoc = pq(response)doc = doc.text()a = re.compile(r'(.*?).woff').findall(doc)[0]font = re.compile(r'\w+').findall(a)[4]return font#得到字體的十六進制碼 def get_code(id):id = "".join(id)start_url = "https://book.qidian.com/info/" + idresponse = requests.get(start_url).textdoc = pq(response)doc = doc.text()num = re.compile(r'(.*?)萬字').findall(doc)[0]for i in num:numlist.append(ord(i))return numlist#得到字體 def get_font():url = "https://qidian.gtimg.com/qd_anti_spider/" + get_woff_previous(id) + ".woff"response = requests.get(url)font = TTFont(BytesIO(response.content))cmap = font.getBestCmap()font.close()return cmap#字體轉碼 def get_encode(cmap, values):WORD_MAP = {'zero': '0', 'one': '1', 'two': '2', 'three': '3', 'four': '4', 'five': '5', 'six': '6', 'seven': '7','eight': '8', 'nine': '9', 'period': '.'}word_count = ''for value in values:key = cmap[int(value)]word_count += WORD_MAP[key]return word_count#獲取書的字數 def get_num():for i in get_code(id):global s,numlists = s + "&#" + str(i) + ";"s = re.compile(r'[0-9]+').findall(s)cmap = get_font()word_count = get_encode(cmap, s)booknum.append(word_count)print("字數:" + word_count + "萬字")s = ""numlist=[]# 使用urllib.request請求一個網頁的內容,并把內容打印出來 if __name__ == '__main__':# 定義需要爬的頁面page_url = 'https://www.qidian.com/all?chanId=5&subCateId=22'page_Count = int(get_page_Count(page_url))for i in range(1,page_Count):for j in book_list(i):url = 'https://book.qidian.com/info/'+jprint(url) #打印網址bookurl.append(url)book_name(url)book_authorname(url)book_summary(url)id = re.compile(r'[0-9]+').findall(url)print("編號:" + "".join(id))bookid.append("".join(id))book_rate(id)book_userCount(id)get_num()#數據庫db = pymysql.connect(host='localhost', port=3306, user='root', password='123', db='spider', charset='utf8')cursor = db.cursor()sql1 = "insert into bookspider(book_url,book_name,book_author,book_summary,book_id,book_rate,book_userCount,book_num) values ('%s','%s','%s','%s','%s','%s','%s','%s')" % (bookurl[0],bookname[0],bookauthor[0],booksummary[0],bookid[0],bookrate[0],bookuserCount[0],booknum[0])cursor.execute(sql1)db.commit()print("存入數據庫成功!")print() #空格bookurl = []bookname = []bookauthor = []booksummary = []bookid = []bookrate = []bookuserCount = []booknum = []#time.sleep(10) #時間間隔,最好加上


最后,這次終于是自己獨立的徹徹底底的完成一次爬蟲,前后一共花費了兩天的時間,唉,實在是太差了。。。不過這次的收獲還是很大的!特別是對于反爬這方面,了解很多知識。其次是requests庫,這次還是第一次用,之前都是用urllib的,所以代碼里有些是requests有些是uillib,感覺requests還是方便多了。還有字體映射那部分,其實還是不怎么懂怎么寫出來的,不過還好會用。這個簡單的爬蟲還是有很多需要改進的地方,try-catch還沒加,實時功能也沒寫,也不是分布式,還有很長的路要走呀~
如果有人看到這篇文章發現有不理解的地方,問我吧,雖然我也不一定能解答出哈哈哈哈~

對于不會解決的問題,網上多查查,肯定有很多人在自己之前就踩過坑了。肯定能找到解決的方法的,別放棄!

總結

以上是生活随笔為你收集整理的起点中文网爬虫的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。