生活随笔
收集整理的這篇文章主要介紹了
爬取起点小说月票榜
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄 踩點 獲取網頁文本 XPath提取信息 破解字體反爬 獲取并保存信息 獲取所有頁面 總代碼(撒花)
踩點
首先進入起點月票榜的頁面進行踩點 https://www.qidian.com/rank/yuepiao,進入后界面如下,首先我們需要知道自己要獲取什么,這里我們提取小說名、作者、小說類型、小說狀態、簡介、最近更新、更新時間、以及月票數。
在知道要獲取什么信息后,右鍵檢查(F12),進入如下界面: 點擊選擇按鈕,定位一下小說標題位置: 然后我們發現所有信息都在這里面
獲取網頁文本
先調用 requests.get 獲取一下網頁代碼,寫入文件中
headers
= { 'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
response
= requests
. get
( 'https://www.qidian.com/rank/yuepiao' , headers
= headers
)
f
= open ( "M:/a.txt" , 'w' )
f
. write
( response
. text
)
f
. close
( )
按ctrl+f 搜索一下文本,發現信息全部都在 將獲取網頁的代碼寫成一個函數
def getHtml ( url
) : headers
= { 'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' } response
= requests
. get
( url
, headers
= headers
) if response
. status_code
== 200 : return response
. text
return None
XPath提取信息
html
= getHtml
( 'https://www.qidian.com/rank/yuepiao' )
html
= etree
. HTML
( html
)
html
= etree
. tostring
( html
)
html
= etree
. fromstring
( html
)
name
= html
. xpath
( '//li//div[@class="book-mid-info"]//h4//a[@data-eid="qd_C40"]//text()' )
print ( len ( name
) , name
)
author
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C41"]//text()' )
print ( len ( author
) , author
)
types
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()' )
print ( len ( types
) , types
)
status
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//span/text()' )
print ( len ( status
) , status
)
intro
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="intro"]//text()' )
intro
= [ i
. strip
( ) for i
in intro
]
print ( len ( intro
) , intro
)
update
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()' )
update
= [ i
. strip
( ) for i
in update
]
print ( len ( update
) , update
)
date
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()' )
print ( len ( date
) , date
)
打印的結果如下,這說明我們已經獲取到了我們想要的數據!
破解字體反爬
然后就是獲取月票,但是!在查看月票數的時候,發現代碼里面是亂碼
網頁上面顯示小框框,我們看不出來到底是什么,我們去剛剛保存的網頁代碼文件里面找找。
我們看到了一些&#....;的東西,一般遇到這種情況,意味著這是字體反爬,小說排行榜還有反爬是我沒想到的 ,既然遇到了,那就淦了它。 往上翻一翻,嗯?才點一下就找到了。。 我們在這個@font-face 中看到了幾個網址,沒錯,這就是字體,復制網址https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.woff 在新標簽頁打開,直接下載(也可以復制.ttf結尾的 https://qidian.gtimg.com/qd_anti_spider/jUlcIiMg.ttf,這里我兩個全都下載了) 獲取了字體后,我們先去這個網站 http://fontstore.baidu.com/static/editor/index.html ,把.ttf的文件在網站中打開 我們可以看到這個字體就是0-9,然后使用一個Python的庫fontTools 來處理這個字體文件,使用pip install fontTools即可安裝
from fontTools
. ttLib
import TTFontfont
= TTFont
( 'M:/jUlcIiMg.woff' )
font
. saveXML
( 'M:/font.xml' )
利用上面的代碼可以將.woff / .ttf 轉為 .xml 格式的文件,然后我們在瀏覽器中打開xml文件 我們發現這個東西跟剛才字體解析網站解析的一模一樣,那就是它了!我們用 fontTools 的 getBestCmap() 函數,獲取映射。
from fontTools
. ttLib
import TTFontfont
= TTFont
( 'M:/jUlcIiMg.woff' )
font
. saveXML
( 'M:/font.xml' )
print ( font
. getBestCmap
( ) )
輸出
{100293: ‘eight’, 100295: ‘four’, 100296: ‘three’, 100297: ‘one’, 100298: ‘period’, 100299: ‘two’, 100300: ‘nine’, 100301: ‘five’, 100302: ‘zero’, 100303: ‘six’, 100304: ‘seven’}
在疑惑為什么跟你看到的不一樣?其實剛剛在字體解析網站以及xml中看到的是十六進制的(以0x開頭),而fontTools輸出的是十進制,不信可以用計算器敲一下。 獲取到映射后,我們再人工進行一下轉換,將英文數字轉為中文,并且剔除掉沒有用的100298: 'period' ,注意到網頁代碼中的字體是以&#××××××;形式的,為了方便替換,我們也將鍵更改為這個形式:
font
= TTFont ( 'M:/jUlcIiMg.woff' )
font
. saveXML ( 'M:/font.xml' )
print ( font
. getBestCmap ( ) )
# 建立英文到數字的字典
camp
= { 'zero' : 0 , 'one' : 1 , 'two' : 2 , 'three' : 3 , 'four' : 4 , 'five' : 5 , 'six' : 6 , 'seven' : 7 , 'eight' : 8 , 'nine' : 9 }
cp
= { }
for k
, v in font
. getBestCmap ( ) . items ( ) : try : # 過濾掉非阿拉伯數字的
100298 : 'periodcp
[ '&#' + str ( k
) + ';' ] = camp
[ v
] except KeyError as e
: pass
print ( cp
)
輸出:
{’𘟅’: 8, ‘𘟇’: 4, ‘𘟈’: 3, ‘𘟉’: 1, ‘𘟋’: 2, ‘𘟌’: 9, ‘𘟍’: 5, ‘𘟎’: 0, ‘𘟏’: 6, ‘𘟐’: 7}
至此我們已經將字體映射關系找到,然后就可以直接用正則替換將獲取到的網頁代碼中的這些字體,根據映射關系替換為正常的阿拉伯數字:
def getHtml ( url
) : headers
= { 'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36' } response
= requests
. get
( url
, headers
= headers
) if response
. status_code
== 200 : return response
. text
return None font
= TTFont
( 'M:/jUlcIiMg.woff' )
font
. saveXML
( 'M:/font.xml' )
print ( font
. getBestCmap
( ) ) camp
= { 'zero' : 0 , 'one' : 1 , 'two' : 2 , 'three' : 3 , 'four' : 4 , 'five' : 5 , 'six' : 6 , 'seven' : 7 , 'eight' : 8 , 'nine' : 9 }
cp
= { }
for k
, v
in font
. getBestCmap
( ) . items
( ) : try : cp
[ '&#' + str ( k
) + ';' ] = camp
[ v
] except KeyError
as e
: pass
print ( cp
)
html
= getHtml
( 'https://www.qidian.com/rank/yuepiao' )
f
= open ( 'M:/html.txt' , 'w' )
f
. write
( html
)
f
. close
( )
for key
in cp
. keys
( ) : html
= re
. sub
( key
, str ( cp
[ key
] ) , html
)
f
= open ( 'M:/html_change.txt' , 'w' )
f
. write
( html
)
f
. close
( )
誒,為什么沒有替換成功,難道是re.sub寫錯了? 不對,我們發現這里的字體與剛剛獲取到的映射鍵一個都不一樣 我們向上查看一下@font-face的內容,發現字體變了!我們剛才用的字體是jUlcIiMg.woff,而這里變成了OMkqwDTS.woff,看來每次訪問的字體都不一樣,既然如此,我們就不能直接下載單獨的woff文件。 每次獲取網址代碼時,我們用正則將字體網址取出來,然后下載,再對字體文件進行解析,替換!為此我們將獲取網址的函數改成下面這個樣子,在獲取網址后,直接提取字體網址,然后下載保存為 font.woff
headers
= { 'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
} def getHtml ( url
) : response
= requests
. get
( url
, headers
= headers
) if response
. status_code
!= 200 : return None woff
= re
. search
( "format\('eot'\); src: url\('(.+?)'\) format\('woff'\)" , response
. text
, re
. S
) fontfile
= requests
. get
( woff
. group
( 1 ) , headers
= headers
) if fontfile
. status_code
!= 200 : return None f
= open ( 'M:/font.woff' , 'wb' ) f
. write
( fontfile
. content
) f
. close
( ) return response
. text
headers
= { 'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
} def getHtml ( url
) : response
= requests
. get
( url
, headers
= headers
) if response
. status_code
!= 200 : return None woff
= re
. search
( "format\('eot'\); src: url\('(.+?)'\) format\('woff'\)" , response
. text
, re
. S
) fontfile
= requests
. get
( woff
. group
( 1 ) , headers
= headers
) if fontfile
. status_code
!= 200 : return None f
= open ( 'M:/font.woff' , 'wb' ) f
. write
( fontfile
. content
) f
. close
( ) return response
. textfont
= TTFont
( 'M:/font.woff' )
print ( font
. getBestCmap
( ) ) camp
= { 'zero' : 0 , 'one' : 1 , 'two' : 2 , 'three' : 3 , 'four' : 4 , 'five' : 5 , 'six' : 6 , 'seven' : 7 , 'eight' : 8 , 'nine' : 9 }
cp
= { }
for k
, v
in font
. getBestCmap
( ) . items
( ) : try : cp
[ '&#' + str ( k
) + ';' ] = camp
[ v
] except KeyError
as e
: pass
print ( cp
) html
= getHtml
( 'https://www.qidian.com/rank/yuepiao' )
f
= open ( 'M:/html.txt' , 'w' )
f
. write
( html
)
f
. close
( ) for key
in cp
. keys
( ) : html
= re
. sub
( key
, str ( cp
[ key
] ) , html
)
f
= open ( 'M:/html_change.txt' , 'w' )
f
. write
( html
)
f
. close
( )
字體成功獲取 替換成功!!!
我們將處理字體的代碼寫成一個函數,使其看起來更加美觀。
def fontProc ( text
) : font
= TTFont
( 'M:/font.woff' ) camp
= { 'zero' : 0 , 'one' : 1 , 'two' : 2 , 'three' : 3 , 'four' : 4 , 'five' : 5 , 'six' : 6 , 'seven' : 7 , 'eight' : 8 , 'nine' : 9 } cp
= { } for k
, v
in font
. getBestCmap
( ) . items
( ) : try : cp
[ '&#' + str ( k
) + ';' ] = camp
[ str ( v
) ] except KeyError
as e
: pass for key
in cp
. keys
( ) : text
= re
. sub
( key
, str ( cp
[ key
] ) , text
) return text
獲取并保存信息
在字體替換成功后,我們就可以用XPath將月票數提取出來,至此,我們的提取信息函數寫成:
def getBook ( html
) : html
= etree
. HTML
( html
) html
= etree
. tostring
( html
) html
= etree
. fromstring
( html
) name
= html
. xpath
( '//li//div[@class="book-mid-info"]//h4//a//text()' ) author
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()' ) types
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()' ) status
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()' ) intro
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="intro"]//text()' ) intro
= [ i
. strip
( ) for i
in intro
] update
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()' ) update
= [ i
. strip
( ) for i
in update
] date
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()' ) tickets
= html
. xpath
( '//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()' ) book
= zip ( name
, author
, types
, status
, intro
, update
, date
, tickets
) return book
def saveInfo ( url
) : html
= getHtml
( url
) html
= fontProc
( html
) book
= getBook
( html
) for name
, author
, types
, status
, intro
, update
, date
, tickets
in book
: with open ( 'M:/novels.txt' , 'a+' ) as f
: f
. write
( '小說名:' + name
+ '\n' ) f
. write
( '作者:' + author
+ ' 小說類型:' + types
+ ' 當前狀態:' + status
+ '\n' ) f
. write
( '小說簡介:' + intro
+ '\n' ) f
. write
( update
+ ' 更新時間:' + date
+ '\n' ) f
. write
( '月票數:' + tickets
+ '\n' ) f
. write
( '\n\n' )
saveInfo
( 'https://www.qidian.com/rank/yuepiao' )
獲取所有頁面
經過上面的分析與操作,我們已經獲取到了所有信息,但是不難發現只獲取到了一頁,我們準備把所有頁面都爬下來。 我們點一下頁碼2,發現網址變成了 https://www.qidian.com/rank/yuepiao?page=2, 再點一下頁碼3,發現網址變成了 https://www.qidian.com/rank/yuepiao?page=3。 已經發現了規律,第幾頁page參數就是幾,因為總共只有五頁,所以寫成:
for page
in range ( 1 , 5 + 1 ) : url
= 'https://www.qidian.com/rank/yuepiao?page=%d' % pagesaveInfo
( url
)
運行一下發現出了問題,\xa0 是 latin1 中的擴展字符集字符,代表空白符  我們將其替換為空白字符即可
將 getBook() 函數中的:
update
= [ i
. strip
( ) for i
in update
]
改為:
update
= [ i
. strip
( ) . replace
( '\xa0' , ' ' ) for i
in update
]
同理,將getBook函數中的
intro
= [ i
. strip
( ) for i
in intro
]
改為:
intro
= [ i
. strip
( ) . replace
( '\u2022' , ' ' ) for i
in intro
]
再次替換:
intro
= [ i
. strip
( ) . replace
( '\u2022' , ' ' ) . replace
( '\u2003' , ' ' ) for i
in intro
]
總代碼(撒花)
import requests
from lxml
import etree
from fontTools
. ttLib
import TTFont
import reheaders
= { 'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'
}
woffDir
= './font.woff'
novelsDir
= './novels.txt' def getHtml ( url
) : response
= requests
. get
( url
, headers
= headers
) if response
. status_code
!= 200 : return None woff
= re
. search
( "format\('eot'\); src: url\('(.+?)'\) format\('woff'\)" , response
. text
, re
. S
) fontfile
= requests
. get
( woff
. group
( 1 ) , headers
= headers
) if fontfile
. status_code
!= 200 : return None f
= open ( woffDir
, 'wb' ) f
. write
( fontfile
. content
) f
. close
( ) response
. encoding
= response
. apparent_encoding
return response
. text
def fontProc ( text
) : font
= TTFont
( woffDir
) camp
= { 'zero' : 0 , 'one' : 1 , 'two' : 2 , 'three' : 3 , 'four' : 4 , 'five' : 5 , 'six' : 6 , 'seven' : 7 , 'eight' : 8 , 'nine' : 9 } cp
= { } for k
, v
in font
. getBestCmap
( ) . items
( ) : try : cp
[ '&#' + str ( k
) + ';' ] = camp
[ str ( v
) ] except KeyError
as e
: pass for key
in cp
. keys
( ) : text
= re
. sub
( key
, str ( cp
[ key
] ) , text
) return text
def getBook ( html
) : html
= etree
. HTML
( html
) html
= etree
. tostring
( html
) html
= etree
. fromstring
( html
) name
= html
. xpath
( '//li//div[@class="book-mid-info"]//h4//a//text()' ) author
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//a[@class="name"]//text()' ) types
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//a[@data-eid="qd_C42"]//text()' ) status
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="author"]//span//text()' ) intro
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="intro"]//text()' ) intro
= [ i
. strip
( ) . replace
( '\u2022' , ' ' ) . replace
( '\u2003' , ' ' ) for i
in intro
] update
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="update"]//a[@data-eid="qd_C43"]//text()' ) update
= [ i
. strip
( ) . replace
( '\xa0' , ' ' ) for i
in update
] date
= html
. xpath
( '//li//div[@class="book-mid-info"]//p[@class="update"]//span//text()' ) tickets
= html
. xpath
( '//li//div[@class="book-right-info"]//div[@class="total"]//p//span//span//text()' ) book
= zip ( name
, author
, types
, status
, intro
, update
, date
, tickets
) return book
def saveInfo ( url
) : html
= getHtml
( url
) html
= fontProc
( html
) book
= getBook
( html
) for name
, author
, types
, status
, intro
, update
, date
, tickets
in book
: with open ( novelsDir
, 'a+' ) as f
: f
. write
( '小說名:' + name
+ '\n' ) f
. write
( '作者:' + author
+ ' 小說類型:' + types
+ ' 當前狀態:' + status
+ '\n' ) f
. write
( '小說簡介:' + intro
+ '\n' ) f
. write
( update
+ ' 更新時間:' + date
+ '\n' ) f
. write
( '月票數:' + tickets
+ '\n' ) f
. write
( '\n\n' ) for page
in range ( 1 , 5 + 1 ) : url
= 'https://www.qidian.com/rank/yuepiao?page=%d' % pagesaveInfo
( url
)
總結
以上是生活随笔 為你收集整理的爬取起点小说月票榜 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。