當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

“百度百科六度分隔理论”（简单版）

發布時間：2023/12/20 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了 “百度百科六度分隔理论”（简单版）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

“百度百科六度分隔理論”（簡單版）

相信大家都聽說過“維基百科六度分隔理論”，本文在此只研究該理論的前期過程，即構建一個從一個頁面到另一個頁面的爬蟲。本文選用百度百科的金融詞條進行測驗。

前期準備

解決url亂碼問題：百度百科的url顯示出來會出現亂碼，以下為解決辦法。

#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860 from urllib.parse import unquote url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860' def new_url(url):new_url=unquote(url,'utf8')return new_url

實踐

先查找所有鏈接，發現鏈接在a標簽中。

from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.parse import unquote url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860' def new_url(url):new_url=unquote(url,'utf8')return new_url html=urlopen('https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860') bs=BeautifulSoup(html,'html.parser') for link in bs.find_all('a'):if 'href' in link.attrs:print(link.attrs['href'])#發現符合要求的鏈接和不符合要求的鏈接都被選出，需要進行下一步篩選

進一步篩選合適的詞條鏈接，發現詞條鏈接的共同點：

詞條鏈接都是類似于：/item/%E4%BC%9A%E8%AE%A1/88436這樣的形式

利用正則表達式，篩選鏈接：

#^(/item/).*?/[0-9]*$ #https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.parse import unquote import re url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860' def new_url(url):new_url=unquote(url,'utf8')return new_url html=urlopen('https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860') bs=BeautifulSoup(html,'html.parser') for link in bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')):if 'href' in link.attrs:print(link.attrs['href'])

創建函數，優化結構

def getLinks(articleUrl):html = urlopen('https://baike.baidu.com{}'.format(articleUrl))bs = BeautifulSoup(html, 'html.parser')return bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')) links=getLinks('/item/%E9%87%91%E8%9E%8D/860') while len(links)>0:newArticle=links[random.randint(0,len(links)-1)].attrs['href']print(newArticle)links=getLinks(newArticle)

5.總的代碼：

#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.parse import unquote import datetime import random import re random.seed(datetime.datetime.now()) def new_url(url):new_url=unquote(url,'utf8')return new_url def getLinks(articleUrl):html = urlopen('https://baike.baidu.com{}'.format(articleUrl))bs = BeautifulSoup(html, 'html.parser')return bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')) links=getLinks('/item/%E9%87%91%E8%9E%8D/860') while len(links)>0:newArticle=links[random.randint(0,len(links)-1)].attrs['href']print(newArticle)links=getLinks(newArticle)

總結

以上是生活随笔為你收集整理的“百度百科六度分隔理论”（简单版）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：子类初始化列表不能初始化父类元素 --
下一篇： hdu 6638 Snowy Smile