當(dāng)前位置：首頁(yè) >

浅谈天涯社区“工薪一族”爬虫

發(fā)布時(shí)間：2024/3/13 47 豆豆

生活随笔收集整理的這篇文章主要介紹了浅谈天涯社区“工薪一族”爬虫小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

淺談天涯社區(qū)“工薪一族”爬蟲(chóng)

1. 確定數(shù)據(jù)結(jié)構(gòu)

首先，明確一個(gè)問(wèn)題：要存什么。

以下是我最終代碼的數(shù)據(jù)結(jié)構(gòu)

{"time": "2022-08-04 10:25:07", // 開(kāi)始爬取的時(shí)間"pages": 3, // 爬取頁(yè)面數(shù)"posts": [ //大列表,記錄各個(gè)帖子{"page": 1, //記錄以下是哪個(gè)頁(yè)面"posts": [ //列表記錄該頁(yè)帖子{"title": "歷史學(xué)習(xí)記錄", //標(biāo)題"post_time": "2022-08-04 03:37:49", //發(fā)送時(shí)間"author_id": "潘妮sun", //作者"url": "http://bbs.tianya.cn/post-170-917565-1.shtml", //帖子鏈接"author_url": "http://www.tianya.cn/112795571", //作者鏈接"read_num": "8", //閱讀數(shù)"reply_num": "4", //回復(fù)數(shù)"content": "黃帝和炎帝其實(shí)并不是皇帝，而是古書(shū)記載中黃河流域遠(yuǎn)古..."//帖子內(nèi)容(文本過(guò)長(zhǎng),這里只展示一部分)},......]}] }

由此可見(jiàn),我們要存的東西如下:

爬取時(shí)間,頁(yè)數(shù)
帖子標(biāo)題&鏈接
帖子發(fā)送時(shí)間
帖子作者&鏈接
閱讀數(shù)&回復(fù)數(shù)
帖子內(nèi)容

2. 頁(yè)面分析

2.1 目錄頁(yè)面分析

打開(kāi)目標(biāo)頁(yè)面：http://bbs.tianya.cn/list.jsp?item=170

按下f12，打開(kāi)開(kāi)發(fā)者工具，分析頁(yè)面結(jié)構(gòu)。

主體頁(yè)面由9個(gè)tbody構(gòu)成，其中第一個(gè)為表格標(biāo)題，其余八個(gè)內(nèi)部各有10個(gè)帖子，共80個(gè)

每個(gè)tbody內(nèi)由10個(gè)tr構(gòu)成，記錄了帖名和鏈接、作者和鏈接、點(diǎn)擊量、回復(fù)量、最后回復(fù)時(shí)間

每頁(yè)最后會(huì)有一個(gè)鏈接指向下一頁(yè)，如同鏈表的指針

這里注意，第一頁(yè)的下一頁(yè)按鈕是第二個(gè)，其余頁(yè)是第三個(gè)

2.2 帖子頁(yè)面分析

隨便打開(kāi)一條帖子, 如http://bbs.tianya.cn/post-170-878768-1.shtml

按下f12，打開(kāi)開(kāi)發(fā)者工具，分析頁(yè)面結(jié)構(gòu)。

html的head標(biāo)簽內(nèi)有文章題目(后面會(huì)提到為啥要說(shuō)這個(gè))

發(fā)帖時(shí)間有兩種

一種為div內(nèi)單獨(dú)span標(biāo)簽內(nèi),以純文本形式存儲(chǔ)

另一種為和點(diǎn)擊和回復(fù)一起整體保存

帖子內(nèi)容保存在"bbs-content"的div里,以<br>分段

3. 確定工具

爬取html這里選用request庫(kù)

解析提取html這里選用xpath庫(kù)

文本格式化存儲(chǔ)要用到j(luò)son庫(kù)

記錄時(shí)間要用到time庫(kù)

提取文本數(shù)據(jù)可能要用到正則表達(dá)式,導(dǎo)入re庫(kù)(可選)

*注: 這里可以先記錄下瀏覽器的User-Agent, 構(gòu)造headers

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49' }

4. 開(kāi)始提取

4.1 提取頁(yè)面

import requests from lxml import etreeurl = ‘http://bbs.tianya.cn/list.jsp?item=170’ headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49' } posts = [] # 保存帖子用 next = ‘’ # 保存下一頁(yè)鏈接用raw = requests.get(url, headers=headers) # 爬取頁(yè)面 html = etree.HTML(raw.text) # 轉(zhuǎn)換為xml給xpath解析 # 取下一頁(yè)鏈接 next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/@href')[0] # 判斷第二個(gè)是否是下一頁(yè)按鈕，若不是則為第三個(gè)按鈕 # 第一頁(yè)以外是a[3]不是a[2]（兩個(gè)條件不能換順序，否則第一頁(yè)會(huì)報(bào)錯(cuò)） if html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/text()')[0] != '下一頁(yè)':next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[3]/@href')[0] tbodys = html.xpath('//*[@id="main"]/div[@class="mt5"]/table/tbody') # 提取9個(gè)tbody tbodys.remove(tbodys[0]) # 移除頁(yè)首表頭（標(biāo)題作者點(diǎn)擊回復(fù) 回復(fù)時(shí)間） for tbody in tbodys:items = tbody.xpath("./tr")for item in items:title = item.xpath("./td[1]/a/text()")[0].replace('\r', '').replace('\n', '').replace('\t', '') # 帖子題目會(huì)有換行符等符號(hào)，需要去除post_url = "http://bbs.tianya.cn" + item.xpath("./td[1]/a/@href")[0] # 帖子鏈接author_id = item.xpath("./td[2]/a/text()")[0] # 作者idauthor_url = item.xpath("./td[2]/a/@href")[0] # 作者鏈接read_num = item.xpath("./td[3]/text()")[0] # 閱讀數(shù)reply_num = item.xpath("./td[4]/text()")[0] # 回復(fù)數(shù)post = {'title': title,'author_id': author_id,'url': post_url,'author_url': author_url,'read_num': read_num,'reply_num': reply_num,}posts.append(post)print(post) # 展示輸出結(jié)果調(diào)試用

4.2 提取單個(gè)帖子

post_time = '' # 保存發(fā)帖時(shí)間 post_content = '' # 保存發(fā)帖內(nèi)容 post_url = ‘http://bbs.tianya.cn/post-170-917511-1.shtml’ postraw = requests.get(posturl, headers=headers) posthtml = etree.HTML(postraw.text) # 天涯社區(qū)的時(shí)間有兩種保存格式，這里分別適配 try:posttimeraw = posthtml.xpath('//*[@id="post_head"]/div[2]/div[2]/span[2]/text()')[0] # 發(fā)帖時(shí)間 except:posttimeraw = posthtml.xpath('//*[@id="container"]/div[2]/div[3]/span[2]/text()[2]')[0] # 發(fā)帖時(shí)間 # 利用正則進(jìn)行時(shí)間文本格式化 YYYY-MM-DD HH:mm:ss post_time = re.findall(r'\d+-\d+-\d+ \d+:\d+:\d+', posttimeraw)[0] if len(title) == 0: # 處理部分因格式特殊取不到標(biāo)題的帖子title = posthtml.xpath('/html/head/title/text()')[0].replace('_工薪一族_論壇_天涯社區(qū)', '') contents = posthtml.xpath('//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()') # 帖子內(nèi)容(列表形式，一段一項(xiàng)) post_content = '' for string in contents: # 提取正文每一段string = string.replace('\r', '').replace('\n', '').replace('\t', '').replace('\u3000', '') + '\n' # 去除換行符等符號(hào)，并加上段間換行符post_content += string # 將每段內(nèi)容拼接起來(lái)

4.3 構(gòu)造函數(shù)

這里的目的是為了拼接單帖和頁(yè)面代碼，實(shí)現(xiàn)單頁(yè)內(nèi)全部數(shù)據(jù)的提取（包括題目，內(nèi)容和數(shù)據(jù)）

下文為我的實(shí)現(xiàn)函數(shù)，入?yún)轫?yè)面網(wǎng)址url和headers，出參為構(gòu)造的單頁(yè)面所有數(shù)據(jù)構(gòu)成的列表posts和下一頁(yè)的鏈接next

def get_posts(url, headers):raw = requests.get(url, headers=headers)code = raw.status_codeposts = []next = '' # 加載失敗直接返回空，避免報(bào)錯(cuò)if code == 200:html = etree.HTML(raw.text)# 取下一頁(yè)鏈接next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/@href')[0]# 第一頁(yè)以外是a[3]不是a[2]（兩個(gè)條件不能換順序，否則第一頁(yè)會(huì)報(bào)錯(cuò)）if html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/text()')[0] != '下一頁(yè)': # 判斷第二個(gè)按鈕是否是下一頁(yè)按鈕next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[3]/@href')[0] tbodys = html.xpath('//*[@id="main"]/div[@class="mt5"]/table/tbody')tbodys.remove(tbodys[0]) # 移除頁(yè)首表頭（標(biāo)題作者點(diǎn)擊回復(fù) 回復(fù)時(shí)間）for tbody in tbodys:items = tbody.xpath("./tr")for item in items:title = item.xpath("./td[1]/a/text()")[0].replace('\r', '').replace('\n', '').replace('\t', '') # 帖子題目會(huì)有換行符等符號(hào)，需要去除url = "http://bbs.tianya.cn" + item.xpath("./td[1]/a/@href")[0] # 帖子鏈接author_id = item.xpath("./td[2]/a/text()")[0] # 作者idauthor_url = item.xpath("./td[2]/a/@href")[0] # 作者鏈接read_num = item.xpath("./td[3]/text()")[0] # 閱讀數(shù)reply_num = item.xpath("./td[4]/text()")[0] # 回復(fù)數(shù)# 獲取帖子內(nèi)容postraw = requests.get(url, headers=headers) postcode = postraw.status_codeif postcode == 200:posthtml = etree.HTML(postraw.text)try:posttimeraw = posthtml.xpath('//*[@id="post_head"]/div[2]/div[2]/span[2]/text()')[0] # 發(fā)帖時(shí)間except:posttimeraw = posthtml.xpath('//*[@id="container"]/div[2]/div[3]/span[2]/text()[2]')[0] # 發(fā)帖時(shí)間post_time = re.findall(r'\d+-\d+-\d+ \d+:\d+:\d+', posttimeraw)[0]if len(title) == 0: # 處理部分因格式特殊取不到標(biāo)題的帖子title = posthtml.xpath('/html/head/title/text()')[0].replace('_工薪一族_論壇_天涯社區(qū)', '')contents = posthtml.xpath('//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()') # 帖子內(nèi)容(列表形式，一段一項(xiàng))post_content = ''for string in contents:string = string.replace('\r', '').replace('\n', '').replace('\t', '').replace('\u3000', '') + '\n' # 去除換行符等符號(hào)，并加上段間換行符post_content += string # 將每段內(nèi)容拼接起來(lái)post = {'title': title,'post_time': post_time,'author_id': author_id,'url': url,'author_url': author_url,'read_num': read_num,'reply_num': reply_num,'content': post_content}posts.append(post)print(title) # 輸出帖子題目調(diào)試用return posts, next

4.4 保存數(shù)據(jù)

本項(xiàng)目標(biāo)：構(gòu)造主函數(shù)，實(shí)現(xiàn)json格式化保存

def main():url = 'http://bbs.tianya.cn/list.jsp?item=170'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49'}postss = {'time': time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),'pages': 0,'posts': []}for i in range(3): # 只爬取前三頁(yè)print("page: " + str(i + 1)) # 輸出頁(yè)碼調(diào)試用posts, next = get_posts(url, headers)pages = {'page': i + 1,'posts': posts}postss['posts'].append(pages)url = nextpostss['pages'] += 1# 每獲取一頁(yè)保存一次，容災(zāi)with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4)with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4) # indent=4 是為了格式化json

5. 注意事項(xiàng)

直接從頁(yè)面提取文本標(biāo)題會(huì)有一些干擾符號(hào)，需要去除

頁(yè)面中部分標(biāo)題有特殊樣式，無(wú)法提取，需要進(jìn)入該帖后利用head中的題目提取存入

6. 成品代碼

import requests from lxml import etree import json import re import timedef get_posts(url, headers):raw = requests.get(url, headers=headers)code = raw.status_codeposts = []next = '' # 加載失敗直接返回空，避免報(bào)錯(cuò)if code == 200:html = etree.HTML(raw.text)# 取下一頁(yè)鏈接next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/@href')[0]# 第一頁(yè)以外是a[3]不是a[2]（兩個(gè)條件不能換順序，否則第一頁(yè)會(huì)報(bào)錯(cuò)）if html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[2]/text()')[0] != '下一頁(yè)': # 判斷第二個(gè)按鈕是否是下一頁(yè)按鈕next = "http://bbs.tianya.cn" + html.xpath('//*[@id="main"]/div[@class="short-pages-2 clearfix"]/div/a[3]/@href')[0] tbodys = html.xpath('//*[@id="main"]/div[@class="mt5"]/table/tbody')tbodys.remove(tbodys[0]) # 移除頁(yè)首表頭（標(biāo)題作者點(diǎn)擊回復(fù) 回復(fù)時(shí)間）for tbody in tbodys:items = tbody.xpath("./tr")for item in items:title = item.xpath("./td[1]/a/text()")[0].replace('\r', '').replace('\n', '').replace('\t', '') # 帖子題目會(huì)有換行符等符號(hào)，需要去除url = "http://bbs.tianya.cn" + item.xpath("./td[1]/a/@href")[0] # 帖子鏈接author_id = item.xpath("./td[2]/a/text()")[0] # 作者idauthor_url = item.xpath("./td[2]/a/@href")[0] # 作者鏈接read_num = item.xpath("./td[3]/text()")[0] # 閱讀數(shù)reply_num = item.xpath("./td[4]/text()")[0] # 回復(fù)數(shù)# 獲取帖子內(nèi)容postraw = requests.get(url, headers=headers) postcode = postraw.status_codeif postcode == 200:posthtml = etree.HTML(postraw.text)try:posttimeraw = posthtml.xpath('//*[@id="post_head"]/div[2]/div[2]/span[2]/text()')[0] # 發(fā)帖時(shí)間except:posttimeraw = posthtml.xpath('//*[@id="container"]/div[2]/div[3]/span[2]/text()[2]')[0] # 發(fā)帖時(shí)間post_time = re.findall(r'\d+-\d+-\d+ \d+:\d+:\d+', posttimeraw)[0]if len(title) == 0: # 處理部分因格式特殊取不到標(biāo)題的帖子title = posthtml.xpath('/html/head/title/text()')[0].replace('_工薪一族_論壇_天涯社區(qū)', '')contents = posthtml.xpath('//*[@id="bd"]/div[4]/div[1]/div/div[2]/div[1]/text()') # 帖子內(nèi)容(列表形式，一段一項(xiàng))post_content = ''for string in contents:string = string.replace('\r', '').replace('\n', '').replace('\t', '').replace('\u3000', '') + '\n' # 去除換行符等符號(hào)，并加上段間換行符post_content += string # 將每段內(nèi)容拼接起來(lái)post = {'title': title,'post_time': post_time,'author_id': author_id,'url': url,'author_url': author_url,'read_num': read_num,'reply_num': reply_num,'content': post_content}posts.append(post)print(title) # 輸出帖子題目調(diào)試用return posts, nextdef main():url = 'http://bbs.tianya.cn/list.jsp?item=170'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.114 Safari/537.36 Edg/103.0.1264.49'}postss = {'time': time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()),'pages': 0,'posts': []}for i in range(3): # 只爬取前三頁(yè)print("page: " + str(i + 1)) # 輸出頁(yè)碼調(diào)試用posts, next = get_posts(url, headers)pages = {'page': i + 1,'posts': posts}postss['posts'].append(pages)url = nextpostss['pages'] += 1# 每獲取一頁(yè)保存一次，容災(zāi)with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4)with open('tianya.json', 'w', encoding='utf-8') as f:json.dump(postss, f, ensure_ascii=False, indent=4) # indent=4 是為了格式化jsonif __name__ == '__main__':main()

總結(jié)

以上是生活随笔為你收集整理的浅谈天涯社区“工薪一族”爬虫的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Java 多个pdf合并成一个pdf
下一篇：用echarts做如图,x轴左右都是正数