當前位置：首頁 > 编程语言 > python >内容正文

python

[python爬虫] BeautifulSoup和Selenium简单爬取知网信息测试

發布時間：2024/5/28 python 117 豆豆

生活随笔收集整理的這篇文章主要介紹了 [python爬虫] BeautifulSoup和Selenium简单爬取知网信息测试小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者最近在研究復雜網絡和知識圖譜內容，準備爬取知網論文相關信息進行分析，包括標題、摘要、出版社、年份、下載數和被引用數、作者信息等。但是在爬取知網論文時，遇到問題如下：
? 1.爬取內容總為空，其原因是采用動態加載的數據，無法定位，然后作者重新選取了CNKI3.0知網進行了爬取；
? 2.但卻不含作者信息，需要定位到詳情頁面，再依次獲取作者信息，但是又遇到了新的問題。

一. 網站定位分析

知網網站如下：http://nvsm.cnki.net/kns/brief/default_result.aspx
比如搜索Python關鍵字，網頁反饋內容如下所示，2681篇文章。

但是使用Selenium定位爬取的論文內容總為空，后來網上看到qiuqingyun大神的博客，發現另一個知網接口（CNKI3.0 知識搜索：http://search.cnki.net/）。
強烈推薦大家閱讀他的原文：http://qiuqingyu.cn/2017/04/27/python實現CNKI知網爬蟲/
搜索python的的圖片如下，共3428篇論文。

接下來簡單講述分析的過程，方法都類似，通過DOM樹節點分析定位元素。右鍵瀏覽器審查元素如下所示，每頁包括15篇論文，標簽位于<div class="wz_tab">下。

點擊具體一條內容，如下所示，定位方法如下：
? 1.標題定位<div class="wz_content">下的<h3>標簽，并且可以獲取URL；
? 2.摘要定位<div class="width715">內容；
? 3.出處定位<span class="year-count">節點下的title，年份通過正則表達式提取數據；
? 4.下載次數和被引用數定位<span class="count">，提取數字第一個和第二個。

接下來直接講述BeautifulSoup和Selenium兩種方式的爬蟲。

二. BeautifulSoup爬蟲

BeautifulSoup完整代碼如下：

# -*- coding: utf-8 -*- import time import re import urllib from bs4 import BeautifulSoup#主函數 if __name__ == '__main__':url = "http://search.cnki.net/Search.aspx?q=python&rank=relevant&cluster=all&val=&p=0"content = urllib.urlopen(url).read()soup = BeautifulSoup(content,"html.parser")#定位論文摘要wz_tab = soup.find_all("div",class_="wz_tab")num = 0for tab in wz_tab:#標題title = tab.find("h3")print title.get_text()urls = tab.find("h3").find_all("a")#詳情超鏈接flag = 0for u in urls:if flag==0: #只獲取第一個URLprint u.get('href')flag += 1#摘要abstract = tab.find(attrs={"class":"width715"}).get_text()print abstract#獲取其他信息other = tab.find(attrs={"class":"year-count"})content = other.get_text().split("\n")"""由于無法分割兩個空格，如：《懷化學院學報》??2017年第09期故采用獲取標題titile內容為出版雜志<span title="北方文學(下旬)">《北方文學(下旬)》??2017年第06期</span>"""#出版雜志+年份cb_from = other.find_all("span")flag = 0 for u in cb_from:if flag==0: #獲取標題print u.get("title")flag += 1mode = re.compile(r'\d+\.?\d*')number = mode.findall(content[0])print number[0] #年份#下載次數被引次數mode = re.compile(r'\d+\.?\d*')number = mode.findall(content[1])if len(number)==1:print number[0]elif len(number)==2:print number[0], number[1]num = num + 1 輸出如下圖所示：

但是爬取的URL無法跳轉，總是顯示登錄頁面，比如“http://epub.cnki.net/kns/detail/detail.aspx?filename=DZRU2017110705G&dbname=CAPJLAST&dbcode=cjfq”，而能正確顯示的是的“http://www.cnki.net/KCMS/detail/detail.aspx?filename=DZRU2017110705G&
dbname=CAPJLAST&dbcode=CJFQ&urlid=&yx=&v=MTc2ODltUm42ajU3VDN
mbHFXTTBDTEw3UjdxZVlPZHVGeTdsVXJ6QUpWZz1JVGZaZbzlDWk81NFl3OU16”。
顯示如下圖所示：

解決方法：這里我準備采用Selenium技術定位超鏈接，再通過鼠標點擊進行跳轉，從而去到詳情頁面獲取作者或關鍵詞信息。

三. Selenium爬蟲

爬取代碼如下：

# -*- coding: utf-8 -*- import time import re import sys import codecs import urllib from selenium import webdriver from selenium.webdriver.common.keys import Keys #主函數 if __name__ == '__main__':url = "http://search.cnki.net/Search.aspx?q=python&rank=relevant&cluster=all&val=&p=0"driver = webdriver.Firefox()driver.get(url)#標題content = driver.find_elements_by_xpath("//div[@class='wz_content']/h3")#摘要abstracts = driver.find_elements_by_xpath("//div[@class='width715']")#出版雜志+年份other = driver.find_elements_by_xpath("//span[@class='year-count']/span[1]")mode = re.compile(r'\d+\.?\d*')#下載次數被引次數num = driver.find_elements_by_xpath("//span[@class='count']")#獲取內容i = 0for tag in content:print tag.textprint abstracts[i].textprint other[i].get_attribute("title")number = mode.findall(other[i].text)print number[0] #年份number = mode.findall(num[i].text)if len(number)==1: #由于存在數字確實如(100) ()print number[0]elif len(number)==2:print number[0],number[1]print ''i = i + 1tag.click()time.sleep(1) 輸出如下所示：
>>> 網絡資源輔助下的Python程序設計教學本文對于Python學習網絡資源做了歸納分類,說明了每類資源的特點,具體介紹了幾個有特色的學習網站,就網絡資源輔助下的Python學習進行了討論,闡釋了利用優質網絡資源可以提高課堂教學效果,增加教學的生動性、直觀性和交互性。同時說明了這些資源的利用能夠方便學生的編程訓練,使學生有更多的時間和機會動手編程,實現編程教學中... 電子技術與軟件工程 2017 11 0Python虛擬機內存管理的研究動態語言的簡潔性,易學性縮短了軟件開發人員的開發周期,所以深受研發人員的喜愛。其在機器學習、科學計算、Web開發等領域都有廣泛的應用。在眾多的動態語言中,Python是用戶數量較大的動態語言之一。本文主要研究Python對內存資源的管理。Python開發效率高,但是運行效率常為人詬病,主要原因在于一切皆是對象的語言實現... 南京大學 2014 156 0 接下來是點擊詳情頁面，窗口轉化捕獲信息，代碼如下：
# -*- coding: utf-8 -*- import time import re import sys import codecs import urllib from selenium import webdriver from selenium.webdriver.common.keys import Keys #主函數 if __name__ == '__main__':url = "http://search.cnki.net/Search.aspx?q=python&rank=relevant&cluster=all&val=&p=0"driver = webdriver.Firefox()driver.get(url)#標題content = driver.find_elements_by_xpath("//div[@class='wz_content']/h3")#摘要abstracts = driver.find_elements_by_xpath("//div[@class='width715']")#出版雜志+年份other = driver.find_elements_by_xpath("//span[@class='year-count']/span[1]")mode = re.compile(r'\d+\.?\d*')#下載次數被引次數num = driver.find_elements_by_xpath("//span[@class='count']")#獲取當前窗口句柄 now_handle = driver.current_window_handle#獲取內容i = 0for tag in content:print tag.textprint abstracts[i].textprint other[i].get_attribute("title")number = mode.findall(other[i].text)print number[0] #年份number = mode.findall(num[i].text)if len(number)==1: #由于存在數字確實如(100) ()print number[0]elif len(number)==2:print number[0],number[1]print ''i = i + 1tag.click()time.sleep(2)#跳轉獲取所有窗口句柄 all_handles = driver.window_handles #彈出兩個界面,跳轉到不是主窗體界面 for handle in all_handles: if handle!=now_handle: #輸出待選擇的窗口句柄 print handle driver.switch_to_window(handle) time.sleep(1) print u'彈出界面信息' print driver.current_url print driver.title #獲取登錄連接信息 elem_sub = driver.find_element_by_xpath("//div[@class='summary pad10']") print u"作者", elem_sub.text print '' #關閉當前窗口 driver.close() #輸出主窗口句柄 print now_handle driver.switch_to_window(now_handle) #返回主窗口開始下一個跳轉但部分網站還是出現無法訪問的問題，如下所示：

最后作者擬爬取萬方數據進行分析。
最后希望文章對你有所幫助，如果錯誤或不足之處，請海涵~
(By:Eastmount 2017-11-17 深夜12點??http://blog.csdn.net/eastmount/?)

與50位技術專家面對面20年技術見證，附贈技術全景圖

總結

以上是生活随笔為你收集整理的[python爬虫] BeautifulSoup和Selenium简单爬取知网信息测试的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【python数据挖掘课程】十六.逻辑回
下一篇：【python数据挖掘课程】十八.线性回