生活随笔
收集整理的這篇文章主要介紹了
爬虫项目(四)---采集从01月22日以来全国各省疫情数据
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
采集從03月02日以來全國各省疫情數(shù)據(jù)
當(dāng)然,數(shù)據(jù)來源仍然是丁香園新型冠狀病毒肺炎疫情實(shí)時(shí)動態(tài)首頁
url:https://ncov.dxy.cn/ncovh5/view/pneumonia
分析
確定01月22日以來全國各省疫情數(shù)據(jù)的URL
由項(xiàng)目(三)可以獲取全國各省疫情數(shù)據(jù)點(diǎn)擊可下載:https://file1.dxycdn.com/2020/0223/331/3398299755968040033-135.json
思路
1,采集從01月23日以來全國各省疫情數(shù)據(jù)
- 加載最近一日全國疫情信息
- 遍歷最近一日全國疫情信息,獲取各省疫情URL
- 發(fā)送請求,獲取各省疫情json字符串
- 解析各省疫情json字符串,并添加到列表中
- 以json格式保存疫情信息
2, 代碼重構(gòu)
- 抽取重復(fù)的代碼到新方法中,提高代碼復(fù)用
import requests
import re
import json
from bs4
import BeautifulSoup
from tqdm
import tqdm
class CoronaSpider(object):def __init__(self
):self
.home_url
= 'https://ncov.dxy.cn/ncovh5/view/pneumonia'def get_content_from_url(self
,url
):response
= requests
.get
(url
)return response
.content
.decode
()def parse_home_page(self
,home_page
,tag_id
): soup
= BeautifulSoup
(home_page
,'lxml')script
= soup
.find
(id=tag_id
)text
= script
.textjson_str
= re
.findall
(r'\[.+\]',text
)[0]data
= json
.loads
(json_str
)return data
def parse_corona_virus(self
,last_day_corona_virus_of_china
,desc
):corona_virus
= []for country
in tqdm
(last_day_corona_virus_of_china
,desc
):statustics_data_url
= country
['statisticsData']statustics_data_json_str
= self
.get_content_from_url
(statustics_data_url
)statustics_data
= json
.loads
(statustics_data_json_str
)['data']for one_day
in statustics_data
:one_day
['provinceName'] = country
['provinceName']if country
.get
('countryShortCode'):one_day
['countryShortCode'] = country
['countryShortCode']corona_virus
.extend
(statustics_data
)return corona_virus
def load(self
,path
):with open(path
) as fp
:data
= json
.load
(fp
)return data
def save(self
,data
,path
):with open(path
,'w') as fp
:json
.dump
(data
,fp
)'''def save(self,data):#5,以json格式保存最近一日各國疫情數(shù)據(jù)with open('yy1.json','w') as fp:json.dump(data,fp)#,ensure_ascii=False'''def crawl_last_day_corona_virus(self
):home_page
= self
.get_content_from_url
(self
.home_url
)last_data_corona_virus
= self
.parse_home_page
(home_page
,tag_id
='getListByCountryTypeService2true')self
.save
(last_data_corona_virus
,'E:\Jupyter_workspace\study\python\爬蟲\last_day_nature_num111.json')def crawl_corona_virus(self
):last_day_corona_virus
= self
.load
('E:\Jupyter_workspace\study\python\爬蟲\last_day_nature_num.json')corona_virus
= self
.parse_corona_virus
(last_day_corona_virus
,desc
='獲取從01月23號以來的世界各國疫情信息')'''corona_virus = []#2,遍歷各國疫情數(shù)據(jù),獲取從01月23號以來的世界各國疫情的URLfor country in tqdm(last_day_corona_virus,''):statustics_data_url = country['statisticsData']#3,發(fā)送請求,獲取從01月23號以來的世界各國疫情的json字符串statustics_data_json_str = self.get_content_from_url(statustics_data_url)#4,解析各個(gè)國家疫情的json字符串,轉(zhuǎn)化為Python類型數(shù)據(jù),添加到列表中statustics_data = json.loads(statustics_data_json_str)['data']#print(statustics_data)for one_day in statustics_data:#statustics_data這個(gè)數(shù)據(jù)里面沒有國家的一些信息,需要補(bǔ)充上去one_day['provinceName'] = country['provinceName']one_day['countryShortCode'] = country['countryShortCode']#print(statustics_data)corona_virus.extend(statustics_data)#把每個(gè)國家的疫情信息statustics_data,都添加到一個(gè)大的corona_virus列表里面'''self
.save
(corona_virus
,'E:\Jupyter_workspace\study\python\爬蟲\corona_virus.json')def craw_last_day_corona_virus_of_china(self
):home_page
= self
.get_content_from_url
(self
.home_url
)craw_last_day_corona_virus_of_china
= self
.parse_home_page
(home_page
,tag_id
='getAreaStat')'''#2,解析疫情首頁信息,獲取最近一日各省疫情數(shù)據(jù)soup = BeautifulSoup(home_page,'lxml')script = soup.find(id='getAreaStat')text = script.text#print(text)#從疫情數(shù)據(jù)中獲取json格式的字符串json_str = re.findall(r'\[.+\]',text)[0]#由于中括號是個(gè)特殊的字符,需要在前面加個(gè)轉(zhuǎn)義符;最后的結(jié)果會存在列表中,故使用[0]來獲取完整json格式#print(json_str)#把json格式的字符串轉(zhuǎn)換為Python類型data = json.loads(json_str)#print(last_day_nature_num)'''self
.save
(craw_last_day_corona_virus_of_china
,'E:\Jupyter_workspace\study\python\爬蟲\craw_last_day_corona_virus_of_china.json')def crawl_corona_virus_of_china(self
):last_day_corona_virus_of_china
= self
.load
('E:\Jupyter_workspace\study\python\爬蟲\craw_last_day_corona_virus_of_china.json')corona_virus
= self
.parse_corona_virus
(last_day_corona_virus_of_china
,'采集01月22號以來各省疫情信息')self
.save
(corona_virus
,'E:\Jupyter_workspace\study\python\爬蟲\last_day_corona_virus_of_china.json')def run(self
):self
.crawl_corona_virus
()self
.crawl_corona_virus_of_china
()if __name__
== '__main__':spider
= CoronaSpider
()spider
.run
()
由前面的四個(gè)爬蟲項(xiàng)目可以得出總結(jié):
1,每個(gè)爬蟲任務(wù)都是由發(fā)送請求,獲取響應(yīng),解析數(shù)據(jù),保存數(shù)據(jù)三部分組成
2,發(fā)送請求,獲取響應(yīng)的代碼和保存數(shù)據(jù)的代碼可以提取出來,提高代碼的復(fù)用性和可維護(hù)性
3,如果訪問的是相似的結(jié)構(gòu)網(wǎng)頁,解析數(shù)據(jù)代碼也可以進(jìn)行提取出來,提高代碼的復(fù)用性和可維護(hù)性
4,每一個(gè)功能封裝成一個(gè)方法,方便擴(kuò)展和維護(hù)
總結(jié)
以上是生活随笔為你收集整理的爬虫项目(四)---采集从01月22日以来全国各省疫情数据的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。