當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

Python3爬取知网文章

發(fā)布時(shí)間：2024/1/1 python 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python3爬取知网文章小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

分析

首先，我們看一下入口的網(wǎng)站，在輸入關(guān)鍵詞搜索之前和之后它的網(wǎng)址并沒有什么變化,所以我們不能通過直接請(qǐng)求它來得到文章。
搜索前

搜索后

所以，我們應(yīng)該換一種思路。打開開發(fā)者工具后，我們可以看到如下的內(nèi)容

通過對(duì)比，我們可以基本上確定這個(gè)網(wǎng)址就是我們要爬取的網(wǎng)站了。它的url是
https://kns.cnki.net/kns/brief/brief.aspx?pagename=ASP.brief_default_result_aspx&isinEn=1&dbPrefix=SCDB&dbCatalog=%e4%b8%ad%e5%9b%bd%e5%ad%a6%e6%9c%af%e6%96%87%e7%8c%ae%e7%bd%91%e7%bb%9c%e5%87%ba%e7%89%88%e6%80%bb%e5%ba%93&ConfigFile=SCDBINDEX.xml&research=off&t=1572329280069&keyValue=%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%9B%BE%E5%BD%A2%E5%AD%A6&S=1&sorttype=

參數(shù)如下

那我們是不是通過構(gòu)造這些參數(shù)就可以訪問了我們要的文章呢？這并不一定，點(diǎn)擊剛才的鏈接，我們可能會(huì)得到這樣的信息

而要想通過訪問鏈接就直接得到內(nèi)容的話，你就需要像正常訪問知網(wǎng)一樣，輸入關(guān)鍵詞進(jìn)行搜索。這時(shí)，知網(wǎng)的服務(wù)器才會(huì)認(rèn)為服務(wù)器上存在了用戶，才會(huì)給你數(shù)據(jù)。那這樣的話，我們就應(yīng)該找POST方法，并向它傳遞包括關(guān)鍵詞在內(nèi)的一系列參數(shù)就可以了。同樣在開發(fā)者工具下，我們找到了這樣的內(nèi)容

通過名字以及傳輸?shù)淖侄挝覀兛梢院芮宄刂肋@個(gè)就是我們要進(jìn)行post的網(wǎng)址了。

代碼實(shí)現(xiàn)

1. 發(fā)送post請(qǐng)求

知網(wǎng)的formdata zhiwangFormdata={'action':'','ua': '1.11','isinEn': '1','PageName': 'ASP.brief_default_result_aspx','DbPrefix': 'SCDB','DbCatalog': '中國(guó)學(xué)術(shù)文獻(xiàn)網(wǎng)絡(luò)出版總庫','ConfigFile': 'SCDBINDEX.xml','db_opt': 'CJFQ,CDFD,CMFD,CPFD,IPFD,CCND,CCJD','txt_1_sel': 'SU$%=|','txt_1_value1': '','txt_1_special1':'%','his': '0' } 發(fā)送POST session=requests.session() post_url='https://kns.cnki.net/kns/request/SearchHandler.ashx' zhiwangFormdata['txt_1_value1']=keyWord response=session.post(post_url,data=zhiwangFormdata)

2. 通過Requests獲得首頁文章信息

注：這里的pq用的是pyquery url='https://kns.cnki.net/kns/brief/brief.aspx?pagename='+response.text+'&keyValue='+quote(keyWord)+'&S=1&sorttype=' response=session.get(url) html=pq(response.text) totalNumber=html.find('div.pageBar_min>div.pagerTitleCell').text().replace('找到','').replace('條結(jié)果','').strip() #沒有論文 if totalNumber=='0':print('沒有找到')return print('正在爬取第1頁') get_detail(html) 獲得文章詳細(xì)信息 def get_detail(html):allItems=html.find('table.GridTableContent>tr').items()count=0for item in allItems:if count==0:count=1continuen_l=item('td:nth-child(2)>a')#獲取文章名字及鏈接dict={'name':n_l.text().replace('\n',''),'link':'https://kns.cnki.net'+n_l.attr('href').replace('/kns','/KCMS')}print(dict)

3.構(gòu)造剩余頁面url并爬取

在爬取到第一頁后，得到的源代碼中還有下一頁的鏈接，通過提取以及構(gòu)造，我們就可以一直爬取下去 #找出總的頁數(shù)totalPage = html.find('span.countPageMark').text().split('/')[1]totalPage=int(totalPage)#現(xiàn)在從第二頁開始爬，一直爬到totalPage就行count=2#獲得通用的linknextUrl='https://kns.cnki.net/kns/brief/brief.aspx'+html.find('div.TitleLeftCell>a').attr('href')while count<=totalPage:print('正在爬取第'+str(count)+'頁')#構(gòu)造URLtrueUrl=nextUrl.replace('curpage=2','curpage='+str(count))response=session.get(trueUrl)#出現(xiàn)驗(yàn)證碼，對(duì)驗(yàn)證碼進(jìn)行處理if response.url!=trueUrl:yanzhenma(session,response.url)continuehtml=pq(response.text)get_detail(html)count += 1

4. 驗(yàn)證碼處理

在爬取知網(wǎng)的過程中，我們還要對(duì)付驗(yàn)證碼。這個(gè)地方我選擇的是把驗(yàn)證碼圖片下載下來，在本地進(jìn)行識(shí)別（我使用的是百度的智能云），在之后在向服務(wù)器發(fā)送一個(gè)帶有識(shí)別結(jié)果的POST請(qǐng)求就行。獲取驗(yàn)證碼圖片，我們通過構(gòu)造一個(gè)含有一個(gè)隨機(jī)數(shù)的網(wǎng)址，發(fā)送請(qǐng)求，獲得圖片而知網(wǎng)的POST驗(yàn)證碼POST網(wǎng)頁構(gòu)造比較簡(jiǎn)單，直接將驗(yàn)證碼網(wǎng)址加上你識(shí)別出來的結(jié)果就是你要請(qǐng)求網(wǎng)址了 #驗(yàn)證碼處理函數(shù) def yanzhenma(session,url):print('識(shí)別驗(yàn)證碼中......')session.get(url)response = session.get('https://kns.cnki.net/kns/checkcode.aspx?t=' + quote("'" + str(random())))image = open('image.jpg', 'wb')image.write(response.content)image.close()result =verify('image.jpg').lower()#或者查看image.jpg手動(dòng)輸入#result=input().lower()print('驗(yàn)證碼為'+result)requestUrl=url+'&vericode='+quote(result)session.get(requestUrl) 驗(yàn)證碼識(shí)別函數(shù) def convertimg(path):img = Image.open(path)width, height = img.sizewhile(width*height > 4000000): # 該數(shù)值壓縮后的圖片大約兩百多kwidth = width // 2height = height // 2new_img=img.resize((width, height),Image.BILINEAR)format=path.split('.')[1]new_img.convert('RGB').save('temp.'+format) def baiduOCR(path):APP_ID = '你的APP_ID'API_KEY = '你的API_KEY'SECRECT_KEY = '你的SECRECT_KEY'client = AipOcr(APP_ID, API_KEY, SECRECT_KEY)format = path.split('.')[1]i = open('temp.'+format, 'rb')img = i.read()message = client.basicGeneral(img) # 通用文字識(shí)別，每天 50 000 次免費(fèi)#message = client.basicAccurate(img) # 通用文字高精度識(shí)別，每天 800 次免費(fèi)i.close()if len(message.get('words_result'))==0:return ''return message.get('words_result')[0]["words"] def verify(path):convertimg(path)result=baiduOCR(path)return result

最后，附上完整代碼

#知網(wǎng) from pyquery import PyQuery as pq from urllib.parse import quote from random import random from aip import AipOcr from PIL import Image import requests import settings zhiwangFormdata={'action':'','ua': '1.11','isinEn': '1','PageName': 'ASP.brief_default_result_aspx','DbPrefix': 'SCDB','DbCatalog': '中國(guó)學(xué)術(shù)文獻(xiàn)網(wǎng)絡(luò)出版總庫','ConfigFile': 'SCDBINDEX.xml','db_opt': 'CJFQ,CDFD,CMFD,CPFD,IPFD,CCND,CCJD','txt_1_sel': 'SU$%=|','txt_1_value1': '','txt_1_special1':'%','his': '0' } def convertimg(path):img = Image.open(path)width, height = img.sizewhile(width*height > 4000000): # 該數(shù)值壓縮后的圖片大約兩百多kwidth = width // 2height = height // 2new_img=img.resize((width, height),Image.BILINEAR)format=path.split('.')[1]new_img.convert('RGB').save('temp.'+format) def baiduOCR(path):APP_ID = '你的APP_ID'API_KEY = '你的API_KEY'SECRECT_KEY = '你的SECRECT_KEY'client = AipOcr(APP_ID, API_KEY, SECRECT_KEY)format = path.split('.')[1]i = open('temp.'+format, 'rb')img = i.read()message = client.basicGeneral(img) # 通用文字識(shí)別，每天 50 000 次免費(fèi)#message = client.basicAccurate(img) # 通用文字高精度識(shí)別，每天 800 次免費(fèi)i.close()if len(message.get('words_result'))==0:return ''return message.get('words_result')[0]["words"] def verify(path):convertimg(path)result=baiduOCR(path)return result def get_all(keyWord):session=requests.session()post_url='https://kns.cnki.net/kns/request/SearchHandler.ashx'settings.zhiwangFormdata['txt_1_value1']=keyWordresponse=session.post(post_url,data=zhiwangFormdata)url='https://kns.cnki.net/kns/brief/brief.aspx?pagename='+response.text+'&keyValue='+quote(keyWord)+'&S=1&sorttype='response=session.get(url)html=pq(response.text)totalNumber=html.find('div.pageBar_min>div.pagerTitleCell').text().replace('找到','').replace('條結(jié)果','').strip()#沒有論文if totalNumber=='0':print('沒有找到')returnprint('正在爬取第1頁')get_detail(html)totalPage = html.find('span.countPageMark').text().split('/')[1]totalPage=int(totalPage)#現(xiàn)在從第二頁開始爬，一直爬到totalPage就行count=2#獲得通用的linknextUrl='https://kns.cnki.net/kns/brief/brief.aspx'+html.find('div.TitleLeftCell>a').attr('href')while count<=totalPage:print('正在爬取第'+str(count)+'頁')#構(gòu)造URLtrueUrl=nextUrl.replace('curpage=2','curpage='+str(count))response=session.get(trueUrl)if response.url!=trueUrl:yanzhenma(session,response.url)continuehtml=pq(response.text)get_detail(html)count += 1 def yanzhenma(session,url):print('識(shí)別驗(yàn)證碼中......')session.get(url)response = session.get('https://kns.cnki.net/kns/checkcode.aspx?t=' + quote("'" + str(random())))image = open('image.jpg', 'wb')image.write(response.content)image.close()#result =yanzhen.verify('image.jpg').lower()# 或者查看image.jpg手動(dòng)輸入result=input().lower()print('驗(yàn)證碼為'+result)requestUrl=url+'&vericode='+quote(result)session.get(requestUrl) def get_detail(html):allItems=html.find('table.GridTableContent>tr').items()count=0for item in allItems:if count==0:count=1continuen_l=item('td:nth-child(2)>a')dict={'name':n_l.text().replace('\n',''),'link':'https://kns.cnki.net'+n_l.attr('href').replace('/kns','/KCMS')}print(dict) def run():keyWord=input('請(qǐng)輸入搜索詞')get_all(keyWord) run()

測(cè)試結(jié)果

總結(jié)

以上是生活随笔為你收集整理的Python3爬取知网文章的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：移动警务考勤打卡组合定位实现
下一篇： 3C数码行业采购商城系统优化采购渠道，降