當前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫，记录爬取全球所有国家-首都的简单爬虫

發布時間：2023/12/20 python 44 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫，记录爬取全球所有国家-首都的简单爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python爬蟲，記錄爬取全球所有國家-首都的簡單爬蟲

本來以為簡單至極，沒想到獲取數據還是花費了大把功夫。先上圖

看起來很簡單吧？總共有6個表格，217國家-首都。所有的關鍵數據都在中。只要循環就能拿到數據，關鍵是怎么拿。一下是我的方法。

def get_data(response):#解析網頁soup = BeautifulSoup(response, 'lxml')result = soup.find_all('h2')state = []for s in result:'''有大量帶'\t' '\n'字符串，直接替換。'''state.append(s.string.replace("\t", '').replace("\n", '').replace(" ", ''))'''所有的國家都在<table>標簽內，<table>子標簽有<tr><td><td>中的內容：1 中國北京 18 以色列耶路撒冷 35 巴勒斯坦耶路撒冷 2 韓國首爾 19 也門薩那 36 卡塔爾多哈3 朝鮮平壤 20 菲律賓馬尼拉 37 黎巴嫩貝魯特費了不少功夫才測出以下的獲取的方法'''def get_data(response):#解析網頁soup = BeautifulSoup(response, 'lxml')result = soup.find_all('h2')state = [] #獲取州數據for s in result:'''有大量帶'\t' '\n'字符串，直接替換。'''state.append(s.string.replace("\t", '').replace("\n", '').replace(" ", ''))'''所有的國家都在<table>標簽內，<table>子標簽有<tr><td><td>中的內容：1 中國北京 18 以色列耶路撒冷 35 巴勒斯坦耶路撒冷 2 韓國首爾 19 也門薩那 36 卡塔爾多哈3 朝鮮平壤 20 菲律賓馬尼拉 37 黎巴嫩貝魯特費了不少功夫才測出以下的獲取的方法'''country = {}filter_string = ["序號", "國家", "首都", "首都或首府", "國家或地區"]table = soup.find_all('table')index = 0for tag in table:td_data = []for tr in tag.find_all("tr"):for td in tr.find_all("td"):#過濾包含序號，國家，首都的字段不保存,t同時過濾序號數字的value = td.get_text().replace("\t", '').replace("\n", '').replace("\xa0", '')if value in filter_string or value.isdigit():continueelse:if "" != value:td_data.append(value)country.update({state[index] : td_data})index = index + 1return country

country就是獲取到的州-國家-首都數據。整個爬蟲核心的就在這里，分析網頁結構花了很多功夫，實在想不出更好的方法了。各位大佬如果還有更好的方法，歡迎指導。

數據保存到csv

def save_csv(data):#數據進一步處理：因為現在把，所有的數據都拿到了，還沒有進行拆分變成#亞洲，中國，北京#亞洲，以色列, 耶路撒冷#打算遍歷字典，保存到新的列表中country = []for key, values in data.items():for index in range(0, len(values), 2):country.append([key, values[index], values[index+1]])#直接在test.csv中追加內容。with open('country.csv', 'w', newline='', encoding='utf-8') as f:filednames = ['州', '國家', '首都']writer = csv.DictWriter(f, fieldnames=filednames, restval='None')#如果是新建的文件，需要寫入表頭'A', 'B', 'C', 'D', 'E', '總結'，就使用writer.writeheader() #字典形式寫入需要帶表頭for item in country:writer.writerow({'州':item[0], '國家':item[1], '首都':item[2]}) for item in country:writer.writerow({'州':item[0], '國家':item[1], '首都':item[2]})

以上便是使用urllib+beautifulsoup庫做的一個小爬蟲，數據已經保存到csv。

總結

以上是生活随笔為你收集整理的python爬虫，记录爬取全球所有国家-首都的简单爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：网管培训教程之DOS篇
下一篇： websocket python爬虫_p