日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬取四大名著

發布時間:2023/12/16 编程问答 38 豆豆
生活随笔 收集整理的這篇文章主要介紹了 爬取四大名著 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
'''詩詞名句網1. 爬取固定書籍2. 爬取書名3. 爬取本部書的章回目錄4. 靈活處理,爬取任意書籍的章回目錄5. 加入異常處理6. 爬取任意整本書 '''import requests import redef bookSpider(oldurl,bookName):url=oldurl+".html"html=loadPage(url)try:with open("demo.txt",'w',encoding='utf-8') as f:f.write(html)except:print("FILE OPERATION ERROR")findTitle("demo.txt",bookName)cnt=findTileOfPages("demo.txt",bookName)getWholeBook(oldurl,bookName,cnt)def findTitle(filename,bookName):try:f=open(filename,encoding='utf-8')book=open("book.txt",'w',encoding='utf-8')except:print("FILE OPERATION ERROR")while True:line=f.readline()#print("READ:"+line)if not line:breakpattern=re.compile(r'<title>《.{0,10}》')bookName=re.search(pattern,line)flag=Falseif bookName:print("書名:",end="")for ch in str(bookName):if ch == '':flag=Trueif ch == '':flag=Falseprint("")book.write(''+'\n')if flag:print(ch,end="")book.write(ch)def findTileOfPages(filename,bookName):cnt=0try:f = open(filename,encoding='utf-8')book = open("book.txt",'a', encoding='utf-8')except:print("FILE OPERATION ERROR")book.write("目錄:\n")while True:line = f.readline()# print("READ:"+line)if not line:breakpattern = re.compile(r'<li><a href="/book/'+bookName+'/\d+.html">.{10,40}</a></li>')titleOfpages = pattern.findall(line)flag = Falseif titleOfpages:for i in range(0,len(titleOfpages)):cnt+=1for j in range(0,len(titleOfpages[i])):if titleOfpages[i][j] == '':flag=Trueif titleOfpages[i][j] == '<':flag=Falseif flag:print(titleOfpages[i][j],end="")book.write(titleOfpages[i][j])print()book.write('\n')return cntdef getWholeBook(url,bookName,cnt):print("正在下載全本書,請稍后...")for i in range(1,cnt+1):newUrl=url+'/'+str(i)+".html"print(newUrl)html=loadPage(newUrl)try:with open("bookHtml.txt", 'w', encoding='utf-8') as f:f.write(html)except:print("FILE OPERATION ERROR")f = open('bookHtml.txt', 'r', encoding='utf-8')bookContent = open('book.txt', 'a', encoding='utf-8')while True:line = f.readline()# print("READ:"+line)if not line:breakpattern = re.compile(r'<p>&nbsp;&nbsp;&nbsp;&nbsp;.+</p>')content = re.findall(pattern, line)patternOfTitle=re.compile(r'<h1>.+</h1>')contentOfTitle = re.findall(patternOfTitle, line)flag=Falsefor i in range(0, len(contentOfTitle)):for j in range(0, len(contentOfTitle[i])):if contentOfTitle[i][j] == '>':flag=Truecontinueif contentOfTitle[i][j] == '<':flag=Falsecontinueif flag:bookContent.write(contentOfTitle[i][j])bookContent.write('\n')flag = Falsefor i in range(0, len(content)):for j in range(0, len(content[i])):if content[i][j] == '<':flag=Falsecontinueif content[i][j] == ';' and content[i][j - 1] == 'p' and content[i][j + 1] != '&':flag = Truecontinueif flag:bookContent.write(content[i][j])bookContent.write('\n')f.close()bookContent.close()def loadPage(url):try:header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}response = requests.get(url, headers=header)return response.content.decode('utf-8')except:print("PAGE LOAD ERROR")if __name__ == "__main__":bookName=input("請輸入想看的書名:(全拼)")url = "http://www.shicimingju.com/book/"+bookNamebookSpider(url,bookName)

?

轉載于:https://www.cnblogs.com/TheSilverMoon/p/11143203.html

總結

以上是生活随笔為你收集整理的爬取四大名著的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。