【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论
生活随笔
收集整理的這篇文章主要介紹了
【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
【Python】《Python網絡爬蟲權威指南》第三章任務:驗證六度分隔理論
任務描述
是否能夠通過一個wiki頁面上的站內鏈接,經過最多六次跳轉,到達另一個wiki頁面,對于本書,我們的任務是從https://en.wikipedia.org/wiki/Eric_Idle跳轉到https://en.wikipedia.org/wiki/Kevin_Bacon。
完成思路
書上都寫了,不講了
過程記錄
反正疫情在家閑著也是閑著,讓筆記本開著跑了三天,最后的結果是:
- 爬取了超過80,000個頁面并保存到本地,大小10GB+;
- 分析了超過200,000個站內鏈接;
- 找到了十幾種可行路徑;
- 實際上沒有找到所有可行路徑,最后不想跑下去了;
代碼
獲取一個wiki頁面并保存到本地(畢竟有wall,方便出錯了重新跑)
from urllib.request import urlopen from urllib.error import HTTPError, URLError from http.client import HTTPResponseimport timestorage_directory = 'D:/MyResources/爬蟲數據/Wiki Pages'def process_filename(filename: str) -> str:hash_res = hash(filename)filename = filename.replace('"', '')\.replace('?', '')\.replace('*', '')\.replace('<', '')\.replace('>', '')\.replace(':', '')\.replace('/', '')\.replace('\\', '')\.replace('|', '')if len(filename) == 0 or len(filename) == filename.count('.'):filename = str(hash_res)return storage_directory + '/' + filenamedef get_and_store_page(url: str, filename: str) -> bool:try:response = urlopen(url) # type: HTTPResponseexcept HTTPError as e:print(f'HTTPError: {e}')return Falseexcept URLError as e:print(f'URLError: {e}')return Falseelse:html = response.read().decode(encoding='utf-8')try:filename = process_filename(filename)f = open(file=filename, mode='w', encoding='utf-8')except FileNotFoundError as e:print(f'check your file name: {e}')return Falseelse:f.write(html)f.close()time.sleep(1)return Truedef load_stored_html(filename: str) -> (str, bool):filename = process_filename(filename)try:f = open(file=filename, mode='r', encoding='utf-8')except FileNotFoundError as e:print(f'check your filename: {e}')return '', Falseelse:res = f.read()f.close()return res, Trueif __name__ == '__main__':if get_and_store_page(url='https://en.wikipedia.org/wiki/Kevin_Bacon', filename='Kevin_Bacon.html'):print('success: https://en.wikipedia.org/wiki/Kevin_Bacon')else:print('fail: https://en.wikipedia.org/wiki/Kevin_Bacon')if get_and_store_page(url='https://en.wikipedia.org/wiki/Eric_Idle', filename='Eric_Idle.html'):print('success: https://en.wikipedia.org/wiki/Eric_Idle')else:print('fail: https://en.wikipedia.org/wiki/Eric_Idle')驗證六度分隔理論
from bs4 import BeautifulSoup from bs4.element import Tag from CH3_GetWikipedia import load_stored_html, get_and_store_pageimport re import time import copyhost = 'https://en.wikipedia.org' visited_url = dict() jump_path = ['', '', '', '', '', '', ''] results = []def find_kevin_bacon(path: str, jumps: int) -> None:global host, visited_url, jump_path, resultsjump_path[jumps] = host + pathif path.split('/')[-1] == 'Kevin_Bacon':print(f'!!!! it\'s found!')results.append(copy.deepcopy(jump_path))with open(file='./result.txt', mode='a', encoding='utf-8') as f:for u in jump_path:print(u)f.write(u + '\n')print(host + '/wiki/Kevin_Bacon')f.write('--------------------\n')returnif path in visited_url:if visited_url[path] > jumps:visited_url[path] = jumpselse:returnelse:visited_url[path] = jumpsnow = time.localtime(time.time())hour = now.tm_hourminute = now.tm_minsecond = now.tm_secprint(f'---> {hour}:{minute}:{second} jump time: {jumps}, visited: {len(visited_url)}, now visit: {path}.')if jumps >= 6:returnhtml, success = load_stored_html(filename=path.split('/')[-1] + '.html')if not success:success = get_and_store_page(url=host + path, filename=path.split('/')[-1] + '.html')if not success:returnelse:html, success = load_stored_html(filename=path.split('/')[-1] + '.html')bs = BeautifulSoup(markup=html, features='html.parser')links = bs.find(name='div', attrs={'id': 'bodyContent'}).\find_all(name='a', attrs={'href': re.compile('^(/wiki/)((?!:).)*$')})for link in links: # type: Tagfind_kevin_bacon(path=link['href'], jumps=jumps + 1)if __name__ == '__main__':find_kevin_bacon(path='/wiki/Eric_Idle', jumps=0)print(f'一共找到{len(results)}種方案:')for res in results:for p in res:print(f'{p} -> ', end='')print('/wiki/Kevin_Bacon')我找到的可行路徑
-------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Time_zone https://en.wikipedia.org/wiki/Nome,_Alaska https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/England https://en.wikipedia.org/wiki/Michael_Caine https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/England https://en.wikipedia.org/wiki/Gary_Oldman https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/England https://en.wikipedia.org/wiki/Daniel_Day-Lewis https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/New_town https://en.wikipedia.org/wiki/Edmund_Bacon_(architect) https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Stoke-on-Trent https://en.wikipedia.org/wiki/Hugh_Dancy https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Coventry https://en.wikipedia.org/wiki/Bon_Jovi https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Blackpool https://en.wikipedia.org/wiki/Pleasure_Beach_Blackpool https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Blackpool https://en.wikipedia.org/wiki/Blackpool_Pleasure_Beach https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Blackpool https://en.wikipedia.org/wiki/Frasier https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Brighton_and_Hove https://en.wikipedia.org/wiki/Lewes https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/Isle_of_Wight https://en.wikipedia.org/wiki/Jeremy_Irons https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Telford_and_Wrekin https://en.wikipedia.org/wiki/South_Gloucestershire https://en.wikipedia.org/wiki/EE_Limited https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Metropolitan_county https://en.wikipedia.org/wiki/Conservative_Party_(UK) https://en.wikipedia.org/wiki/Early_1990s https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Metropolitan_county https://en.wikipedia.org/wiki/Margaret_Thatcher https://en.wikipedia.org/wiki/Meryl_Streep https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Metropolitan_county https://en.wikipedia.org/wiki/History_of_local_government_in_England https://en.wikipedia.org/wiki/Cleveland https://en.wikipedia.org/wiki/Kevin_Bacon -------------------- https://en.wikipedia.org/wiki/Eric_Idle https://en.wikipedia.org/wiki/South_Shields https://en.wikipedia.org/wiki/Tyne_and_Wear https://en.wikipedia.org/wiki/Metropolitan_county https://en.wikipedia.org/wiki/Urban_area https://en.wikipedia.org/wiki/Empire_State_Building https://en.wikipedia.org/wiki/Kevin_Bacon --------------------感想
2020見證歷史!
總結
以上是生活随笔為你收集整理的【Python】《Python网络爬虫权威指南》第三章任务:验证六度分隔理论的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Mobileye转变与芯片技术
- 下一篇: 秋天的植物园