数据可视化:世界银行数据(1960-2017)
數(shù)據(jù)可視化:世界銀行數(shù)據(jù)(1960-2017)
我選擇并下載了數(shù)據(jù)集The World Bank Data by Indicators 1960-2017用于這次的作業(yè),并選擇 Jupyter Notebooks (Python) 作為我的可視化工具。
這個(gè)數(shù)據(jù)集十分龐大,但卻很結(jié)構(gòu)化(以表格的形式組織),并且這個(gè)數(shù)據(jù)集包含超過(guò)20個(gè)被清洗過(guò)的數(shù)據(jù)集。我選擇了climate-change和health這兩個(gè)數(shù)據(jù)集作為探索性分析所用的數(shù)據(jù)。如今,Python為使用者提供了很多優(yōu)秀的計(jì)算和可視化的工具,例如NumPy、Pandas、Matplotlib和Pyecharts。
1. 二氧化碳排放量
在文件climate-change.csv中包含了世界各國(guó)從1960年至2014年的各種溫室氣體的排放量。
為了探究二氧化碳這種溫室氣體的排放情況,我決定先將世界主要國(guó)家的二氧化碳排放量進(jìn)行可視化。
首先,先對(duì)數(shù)據(jù)進(jìn)行選擇和清洗,例如:選擇國(guó)家、修改或剔除明顯錯(cuò)誤的數(shù)據(jù):
import pandas as pd import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_data = pd.read_csv('climate-change.csv') dismiss_years = [1960, 1970, 1980, 1990]# 中國(guó)數(shù)據(jù) China_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'China',['Country Name', 'Year', 'CO2 emissions (kt)']] China_CO2_emission_data = China_CO2_emission_data.loc[China_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:China_CO2_emission_data = China_CO2_emission_data.loc[China_CO2_emission_data['Year'] != year] China_CO2_emission_data.sort_values('Year', inplace=True)# 美國(guó)數(shù)據(jù) US_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'United States',['Country Name', 'Year', 'CO2 emissions (kt)']] US_CO2_emission_data = US_CO2_emission_data.loc[US_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:US_CO2_emission_data = US_CO2_emission_data.loc[US_CO2_emission_data['Year'] != year] US_CO2_emission_data.sort_values('Year', inplace=True)# 印度數(shù)據(jù) India_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'India',['Country Name', 'Year', 'CO2 emissions (kt)']] India_CO2_emission_data = India_CO2_emission_data.loc[India_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:India_CO2_emission_data = India_CO2_emission_data.loc[India_CO2_emission_data['Year'] != year] India_CO2_emission_data.sort_values('Year', inplace=True)# 日本數(shù)據(jù) Japan_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'Japan',['Country Name', 'Year', 'CO2 emissions (kt)']] Japan_CO2_emission_data = Japan_CO2_emission_data.loc[Japan_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:Japan_CO2_emission_data = Japan_CO2_emission_data.loc[Japan_CO2_emission_data['Year'] != year] Japan_CO2_emission_data.sort_values('Year', inplace=True)# 英國(guó)數(shù)據(jù) UK_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'United Kingdom',['Country Name', 'Year', 'CO2 emissions (kt)']] UK_CO2_emission_data = UK_CO2_emission_data.loc[UK_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:UK_CO2_emission_data = UK_CO2_emission_data.loc[UK_CO2_emission_data['Year'] != year] UK_CO2_emission_data.sort_values('Year', inplace=True)# 法國(guó)數(shù)據(jù) France_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'France',['Country Name', 'Year', 'CO2 emissions (kt)']] France_CO2_emission_data = France_CO2_emission_data.loc[France_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:France_CO2_emission_data = France_CO2_emission_data.loc[France_CO2_emission_data['Year'] != year] France_CO2_emission_data.sort_values('Year', inplace=True)# 俄羅斯數(shù)據(jù) Russia_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'Russian Federation',['Country Name', 'Year', 'CO2 emissions (kt)']] Russia_CO2_emission_data = Russia_CO2_emission_data.loc[Russia_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:Russia_CO2_emission_data = Russia_CO2_emission_data.loc[Russia_CO2_emission_data['Year'] != year] Russia_CO2_emission_data.sort_values('Year', inplace=True)# 德國(guó)數(shù)據(jù) Germany_CO2_emission_data = raw_data.loc[raw_data['Country Name'] == 'Germany',['Country Name', 'Year', 'CO2 emissions (kt)']] Germany_CO2_emission_data = Germany_CO2_emission_data.loc[Germany_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:Germany_CO2_emission_data = Germany_CO2_emission_data.loc[Germany_CO2_emission_data['Year'] != year] Germany_CO2_emission_data.sort_values('Year', inplace=True)然后,對(duì)世界主要國(guó)家的二氧化碳排放量進(jìn)行可視化:
plt.close('all') plt.figure(figsize=(10.0, 8.0))China_line = plt.plot(China_CO2_emission_data.loc[:,['Year']],China_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='red', label='中國(guó)')US_line = plt.plot(US_CO2_emission_data.loc[:,['Year']],US_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='blue', label='美國(guó)')India_line = plt.plot(India_CO2_emission_data.loc[:,['Year']],India_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='green', label='印度')Russia_line = plt.plot(Russia_CO2_emission_data.loc[:,['Year']],Russia_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='pink', label='俄羅斯')Japan_line = plt.plot(Japan_CO2_emission_data.loc[:,['Year']],Japan_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='purple', label='日本')Germany_line = plt.plot(Germany_CO2_emission_data.loc[:,['Year']],Germany_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='black', label='德國(guó)')UK_line = plt.plot(UK_CO2_emission_data.loc[:,['Year']],UK_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='brown', label='英國(guó)')France_line = plt.plot(France_CO2_emission_data.loc[:,['Year']],France_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='orange', label='法國(guó)')# 從1961年到2014年 plt.title('世界主要國(guó)家的二氧化碳排放量') plt.xlabel('年份') plt.ylabel('二氧化碳排放量/kt') plt.xlim(1960, 2015) plt.ylim(0, 1.2e7) plt.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2) plt.legend(loc='upper left')plt.savefig('img1.jpg') plt.show()可視化結(jié)果:
可以發(fā)現(xiàn),有的西方國(guó)家的二氧化碳排放量比較穩(wěn)定,有的西方國(guó)家的二氧化碳排放量在上世紀(jì)逐漸增長(zhǎng),到了本世紀(jì)開(kāi)始趨于穩(wěn)定;除日本外的大部分亞洲國(guó)家的二氧化碳排放量一直在增長(zhǎng),尤其是到了本世紀(jì)開(kāi)始加速增長(zhǎng)。
中國(guó)的二氧化碳排放量從2000年以后開(kāi)始快速增長(zhǎng)。截至2014年,中國(guó)已成為二氧化碳排放量的第一大國(guó),其排放量是第二名——美國(guó)的將近兩倍。但從2012年開(kāi)始,中國(guó)的二氧化碳排放量的增速開(kāi)始顯著放緩。
經(jīng)濟(jì)的增長(zhǎng)是否就意味著二氧化碳排放量的增加?
由于數(shù)據(jù)集中沒(méi)有反映經(jīng)濟(jì)狀況的一個(gè)很重要指標(biāo)——GDP,所以我另外從世界銀行的官網(wǎng)上下載到了世界各國(guó)從1960年至2019年的GDP數(shù)據(jù)集。
經(jīng)過(guò)對(duì)數(shù)據(jù)的選擇與清洗,將世界主要國(guó)家的二氧化碳排放量和GDP在同一張圖里進(jìn)行可視化:
import pandas as pd import matplotlib.pyplot as plt import numpy as npplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_emission_data = pd.read_csv('climate-change.csv') dismiss_years = [1960, 1970, 1980, 1990]raw_gdp_data = pd.read_csv('gdp.csv') raw_gdp_data = raw_gdp_data.drop(['Country Code', 'Indicator Name', 'Indicator Code', '1960', '2015', '2016', '2017', '2018', '2019', '2020'], axis=1)# 中國(guó)數(shù)據(jù) China_gdp_data = raw_gdp_data.loc[raw_gdp_data['Country Name']=='China'].drop(['Country Name'], axis=1)# 美國(guó)數(shù)據(jù) US_gdp_data = raw_gdp_data.loc[raw_gdp_data['Country Name']=='United States'].drop(['Country Name'], axis=1)# 印度數(shù)據(jù) India_gdp_data = raw_gdp_data.loc[raw_gdp_data['Country Name']=='India'].drop(['Country Name'], axis=1)# 日本數(shù)據(jù) Japan_gdp_data = raw_gdp_data.loc[raw_gdp_data['Country Name']=='Japan'].drop(['Country Name'], axis=1)years = np.arange(1961, 2015)# 中國(guó)數(shù)據(jù) China_CO2_emission_data = raw_emission_data.loc[raw_emission_data['Country Name'] == 'China',['Country Name', 'Year', 'CO2 emissions (kt)']] China_CO2_emission_data = China_CO2_emission_data.loc[China_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:China_CO2_emission_data = China_CO2_emission_data.loc[China_CO2_emission_data['Year'] != year] China_CO2_emission_data.sort_values('Year', inplace=True)# 美國(guó)數(shù)據(jù) US_CO2_emission_data = raw_emission_data.loc[raw_emission_data['Country Name'] == 'United States',['Country Name', 'Year', 'CO2 emissions (kt)']] US_CO2_emission_data = US_CO2_emission_data.loc[US_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:US_CO2_emission_data = US_CO2_emission_data.loc[US_CO2_emission_data['Year'] != year] US_CO2_emission_data.sort_values('Year', inplace=True)# 印度數(shù)據(jù) India_CO2_emission_data = raw_emission_data.loc[raw_emission_data['Country Name'] == 'India',['Country Name', 'Year', 'CO2 emissions (kt)']] India_CO2_emission_data = India_CO2_emission_data.loc[India_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:India_CO2_emission_data = India_CO2_emission_data.loc[India_CO2_emission_data['Year'] != year] India_CO2_emission_data.sort_values('Year', inplace=True)# 日本數(shù)據(jù) Japan_CO2_emission_data = raw_emission_data.loc[raw_emission_data['Country Name'] == 'Japan',['Country Name', 'Year', 'CO2 emissions (kt)']] Japan_CO2_emission_data = Japan_CO2_emission_data.loc[Japan_CO2_emission_data['CO2 emissions (kt)'] != 0] for year in dismiss_years:Japan_CO2_emission_data = Japan_CO2_emission_data.loc[Japan_CO2_emission_data['Year'] != year] Japan_CO2_emission_data.sort_values('Year', inplace=True)plt.close('all')fig = plt.figure(figsize=(10.0, 8.0))ax1 = fig.add_subplot(111)China_CO2_emission_line = ax1.plot(China_CO2_emission_data.loc[:,['Year']],China_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='red', label='中國(guó)二氧化碳排放')US_CO2_emission_line = ax1.plot(US_CO2_emission_data.loc[:,['Year']],US_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='blue', label='美國(guó)二氧化碳排放')India_CO2_emission_line = ax1.plot(India_CO2_emission_data.loc[:,['Year']],India_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='green', label='印度二氧化碳排放')Japan_CO2_emission_line = ax1.plot(Japan_CO2_emission_data.loc[:,['Year']],Japan_CO2_emission_data.loc[:,['CO2 emissions (kt)']],lw=2, ls='-', color='purple', label='日本二氧化碳排放')ax1.set_xlabel('年份') ax1.set_xlim(1960, 2015) ax1.set_ylabel('二氧化碳排放量/kt') ax1.set_ylim(0, 1.6e7)ax1.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2) ax1.legend(loc='upper left')ax2 = ax1.twinx()China_gdp_line = ax2.plot(years,China_gdp_data.values.reshape(China_gdp_data.shape[1]),lw=1, ls='--', color='red', label='中國(guó)GDP')US_gdp_line = ax2.plot(years,US_gdp_data.values.reshape(US_gdp_data.shape[1]),lw=1, ls='--', color='blue', label='美國(guó)GDP')India_gdp_line = ax2.plot(years,India_gdp_data.values.reshape(India_gdp_data.shape[1]),lw=1, ls='--', color='green', label='印度GDP')Japan_gdp_line = ax2.plot(years,Japan_gdp_data.values.reshape(Japan_gdp_data.shape[1]),lw=1, ls='--', color='purple', label='日本GDP')ax2.set_ylabel('GDP/美元') ax2.set_ylim(0, 2.0e13)ax2.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2) ax2.legend(loc='upper left', bbox_to_anchor=(0.25,1))# 從1961年到2014年 plt.title('世界主要國(guó)家的二氧化碳排放量和GDP')plt.savefig('img2.jpg') plt.show()可視化結(jié)果:
可以發(fā)現(xiàn),中國(guó)與印度的二氧化碳排放量和GDP都在同時(shí)顯著地增長(zhǎng);而美國(guó)與日本的二氧化碳排放量雖然維持在比較穩(wěn)定的水平,但GDP仍在顯著地增長(zhǎng)。
結(jié)合歷史,我對(duì)此的解釋是:進(jìn)入本世紀(jì)以后,由于西方國(guó)家的勞動(dòng)力成本不斷增加,西方國(guó)家的大部分工業(yè)生產(chǎn)轉(zhuǎn)移到了擁有廉價(jià)勞動(dòng)力和資源的亞洲國(guó)家;而污染程度較小的服務(wù)業(yè)和高新技術(shù)產(chǎn)業(yè)逐漸成為了西方國(guó)家的經(jīng)濟(jì)支柱,這樣就造成了“發(fā)展中國(guó)家依靠有環(huán)境污染的工業(yè)發(fā)展經(jīng)濟(jì),而發(fā)達(dá)國(guó)家依靠服務(wù)業(yè)和高新技術(shù)產(chǎn)業(yè)發(fā)展經(jīng)濟(jì)”的現(xiàn)象。
2. 中國(guó)人口
在文件health.csv中包含了中國(guó)從1960年至2017年的男性和女性人口數(shù)量。
為了探究中國(guó)人口數(shù)量的增長(zhǎng)情況,我對(duì)數(shù)據(jù)進(jìn)行了清洗和可視化:
import pandas as pd import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_data = pd.read_csv('health.csv') change_years = [1970, 1980, 1990]# 中國(guó)人口數(shù)據(jù) China_population_data = raw_data.loc[raw_data['Country Name'] == 'China',['Year', 'Population, female', 'Population, male']] China_population_data = China_population_data.loc[China_population_data['Year'] != 1960]for year in change_years:pre_year_data = China_population_data.loc[China_population_data['Year'] == year-1, ['Population, female', 'Population, male']].valuesnext_year_data = China_population_data.loc[China_population_data['Year'] == year+1, ['Population, female', 'Population, male']].valuesChina_population_data.loc[China_population_data['Year'] == year, ['Population, female', 'Population, male']] = (pre_year_data+next_year_data)/2China_population_data.sort_values("Year", inplace=True)plt.close('all') plt.figure(figsize=(10.0, 8.0))years = China_population_data.loc[:, ['Year']].values.reshape(China_population_data.shape[0]) China_male_population = China_population_data.loc[:, ['Population, male']].values.reshape(China_population_data.shape[0]) China_female_population = China_population_data.loc[:, ['Population, female']].values.reshape(China_population_data.shape[0])width = 0.8 male_bar = plt.bar(years, China_male_population, width, color='royalblue', label='男性') female_bar = plt.bar(years, China_female_population, width, bottom=China_male_population, color='hotpink', label='女性')# 數(shù)據(jù)從1961年至2017年 plt.title('中國(guó)人口數(shù)量') plt.xlabel('年份') plt.ylabel('人口數(shù)量') plt.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2) plt.legend(loc='upper left')plt.savefig('img3.jpg') plt.show()可視化結(jié)果:
從圖中不難看出,中國(guó)人口數(shù)量的增長(zhǎng)存在著兩個(gè)明顯的轉(zhuǎn)折點(diǎn):一個(gè)在1970年至1980年之間,另一個(gè)在1990年至2000年之間。
通過(guò)查閱資料得知,政府在第四個(gè)五年計(jì)劃(從1970年至1975年)中提出“一個(gè)不少,兩個(gè)正好,三個(gè)多了”的口號(hào);從1995年起,政府提倡“晚婚晚育、少生優(yōu)生”;兩個(gè)政策的時(shí)間與圖中兩個(gè)轉(zhuǎn)折點(diǎn)的時(shí)間比較吻合。可見(jiàn),計(jì)劃生育政策對(duì)中國(guó)人口數(shù)量產(chǎn)生了巨大的影響。
3. 世界各國(guó)人口
在文件health.csv中包含了世界各國(guó)從1960年至2017年的人口數(shù)量。
為了將世界各國(guó)人口情況更好地展示出來(lái),我先利用直方圖統(tǒng)計(jì)了一下2017年世界各國(guó)人口數(shù)量的情況:
import pandas as pd import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_data = pd.read_csv('health.csv') country_codes = pd.read_csv('country code.csv')case = [] for code in raw_data['Country Code'].values:case.extend([code in country_codes.values])population_data = raw_data.loc[case].loc[raw_data['Year'] == 2017,['Population, total']] population_data = population_data.loc[population_data['Population, total']!=0]plt.hist(population_data.values, bins=10, edgecolor="black", facecolor="royalblue")plt.title('2017年世界各國(guó)人口數(shù)量直方圖') plt.xlabel('人口數(shù)量') plt.ylabel('頻數(shù)') plt.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2)plt.savefig('img6.jpg') plt.show()統(tǒng)計(jì)結(jié)果:
可以發(fā)現(xiàn),絕大部分國(guó)家的人口數(shù)量低于2億人(超過(guò)200個(gè)國(guó)家)。而且,最少的人口數(shù)量(瑙魯共和國(guó),13649人)與最多的人口數(shù)量(中國(guó),1386395000人)之間的差距超過(guò)十萬(wàn)倍(世界各國(guó)人口數(shù)量之間的差距極為懸殊)。
接著,我將前30個(gè)人口數(shù)量最多的國(guó)家與剩余的人口數(shù)量進(jìn)行可視化:
import pandas as pd import matplotlib.pyplot as plt import numpy as npplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_data = pd.read_csv('health.csv') country_codes = pd.read_csv('country code.csv')case = [] for code in raw_data['Country Code'].values:case.extend([code in country_codes.values])population_data = raw_data.loc[case].loc[raw_data['Year'] == 2017,['Country Name', 'Population, total']] population_data = population_data.loc[population_data['Population, total']!=0] population_data.sort_values('Population, total', ascending=False, inplace=True)plt.close('all') plt.figure(figsize=(15.0, 10.0))width = 0.8number = 30country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number] country_populations = population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number]country_names = np.append(country_names, 'Other countries') other_population = np.sum(population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[number:-1]) country_populations = np.append(country_populations, other_population)plt.barh(np.flipud(country_names), np.flipud(country_populations), width, color='royalblue')plt.gca().xaxis.set_ticks_position('top') plt.xticks(fontname='Arial', fontsize=10) plt.yticks(fontname='Arial', fontsize=10)plt.title('2017年世界主要國(guó)家的人口數(shù)量') plt.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2)plt.savefig('img4.jpg') plt.show()可視化結(jié)果:
可以看出,人口數(shù)量最多的兩個(gè)國(guó)家:中國(guó)和印度,都擁有超過(guò)13億人,而人口數(shù)量第三多的國(guó)家:美國(guó),擁有不到4億人,且與前兩者的差距超過(guò)四倍。不難想象,人口數(shù)量更少的國(guó)家與前兩者的差距還會(huì)進(jìn)一步地增大。
數(shù)據(jù)的極端懸殊性不利于數(shù)據(jù)的可視化(后面會(huì)講到如何解決)。
以上的條形圖雖然能直觀地對(duì)比前30個(gè)國(guó)家之間的人口數(shù)量,但各個(gè)國(guó)家的人口數(shù)量占世界人口數(shù)量的比重就不是那么直觀了。所以,我又將數(shù)據(jù)可視化成一個(gè)餅圖(為了更好的可視化效果,我選取了前10個(gè)國(guó)家):
import pandas as pd import matplotlib.pyplot as plt import numpy as npplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_data = pd.read_csv('health.csv') country_codes = pd.read_csv('country code.csv')case = [] for code in raw_data['Country Code'].values:case.extend([code in country_codes.values])population_data = raw_data.loc[case].loc[raw_data['Year'] == 2017,['Country Name', 'Population, total']] population_data = population_data.loc[population_data['Population, total']!=0] population_data.sort_values('Population, total', ascending=False, inplace=True)plt.close('all') plt.figure(figsize=(10.0, 10.0))number = 10country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number] country_populations = population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number]country_names = np.append(country_names, 'Other countries') other_population = np.sum(population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[number:-1]) country_populations = np.append(country_populations, other_population)colors = ['red', 'orange', 'royalblue', 'chocolate', 'green', 'gold', 'tomato', 'darkgreen', 'blue', 'peru', 'lightgray']explode = np.zeros(country_names.shape[0]) explode[0] = 0.1 explode[1] = 0.1patches, labels, percents = plt.pie(country_populations,colors=colors,labels=country_names,explode=explode,autopct='%1.1f%%',shadow=True)for label in labels:label.set_fontname('Arial')label.set_fontsize(12)for percent in percents:percent.set_fontname('Arial')percent.set_fontsize(10)plt.title('2017年世界主要國(guó)家的人口數(shù)量占比')plt.savefig('img5.jpg') plt.show()可視化結(jié)果:
這樣就可以很直觀地看出各個(gè)國(guó)家的人口數(shù)量占世界人口數(shù)量的比重。其中,中國(guó)和印度的人口數(shù)量都超過(guò)世界人口數(shù)量的1/6,二者總?cè)丝跀?shù)量超過(guò)世界人口數(shù)量的1/3。
最后,我使用Pyecharts將世界各國(guó)的人口用顏色在地圖上表示出來(lái)(代碼A):
但效果并不好:除了中國(guó)和印度為紅色,其他國(guó)家大多為藍(lán)綠色或綠色。
這正是由于數(shù)據(jù)過(guò)于極端而造成的:在顏色條上,中國(guó)和印度位于頂部,而其他國(guó)家則“擁擠”在底部,使得國(guó)家之間的顏色差距很小,中間的顏色并沒(méi)有得到很好的利用。
于是想到,如果將人口數(shù)量取對(duì)數(shù),則各個(gè)國(guó)家的數(shù)據(jù)差距就會(huì)極大地減小。
以下兩張圖展示的是將世界各國(guó)的人口數(shù)量取對(duì)數(shù)的前后對(duì)比(代碼B):
原數(shù)據(jù):
取對(duì)數(shù):
可以發(fā)現(xiàn),取對(duì)數(shù)后,各國(guó)的數(shù)據(jù)之間的差距明顯變得平滑。于是,我將取對(duì)數(shù)后的世界各國(guó)的人口數(shù)量進(jìn)行可視化:
這樣的可視化的效果比之前好很多。
但我覺(jué)得還有明顯不足的地方:地圖大部分為紅色、橙色、黃色等顏色,而藍(lán)色和綠色占地圖極少部分。
經(jīng)過(guò)分析,發(fā)現(xiàn)其原因?yàn)?#xff1a;大部分幅員遼闊的國(guó)家往往人口數(shù)量也眾多,所以這些國(guó)家的顏色往往為黃色、橙色或紅色,這也使得地圖看上去幾乎全是黃色、橙色或紅色。
改進(jìn)的方法是,取前幾十個(gè)國(guó)家的數(shù)據(jù)作為顏色條的分布,后面的國(guó)家的顏色則取為最小數(shù)據(jù)所對(duì)應(yīng)的顏色。雖然后面的國(guó)家的顏色會(huì)變?yōu)橐粯?#xff0c;但實(shí)際上他們的數(shù)量級(jí)往往在104~106之間(即人口數(shù)量為幾萬(wàn)至幾百萬(wàn)之間),相較于人口幾千萬(wàn)甚至上億的國(guó)家來(lái)說(shuō)分別不是很大,故具有一定的合理性。
可視化效果:
經(jīng)過(guò)反復(fù)比較,最終確定取前150個(gè)國(guó)家的數(shù)據(jù)作為顏色條的分布。這樣,地圖的可視化效果由進(jìn)一步得到了提高。
代碼A:
from pyecharts.charts import Map,Geo from pyecharts import options as opts import pandas as pd import numpy as np import mathraw_data = pd.read_csv('health.csv') country_codes = pd.read_csv('country code.csv')case = [] for code in raw_data['Country Code'].values:case.extend([code in country_codes.values])population_data = raw_data.loc[case].loc[raw_data['Year'] == 2017,['Country Name', 'Population, total']] population_data = population_data.loc[population_data['Population, total']!=0] population_data.sort_values('Population, total', ascending=False, inplace=True)number = -1country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number].tolist() country_populations = population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number].tolist()country_names[country_names.index('Russian Federation')] = 'Russia' country_names[country_names.index('Egypt, Arab Rep.')] = 'Egypt' country_names[country_names.index('Congo, Dem. Rep.')] = 'Dem. Rep. Congo' country_names[country_names.index('Iran, Islamic Rep.')] = 'Iran'country_names[country_names.index('Czech Republic')] = 'Czech Rep.' country_names[country_names.index('Slovak Republic')] = 'Slovakia' country_names[country_names.index('Yemen, Rep.')] = 'Yemen' country_names[country_names.index('Korea, Rep.')] = 'Korea' country_names[country_names.index('Korea, Dem. People’s Rep.')] = 'Dem. Rep. Korea' country_names[country_names.index('Kyrgyz Republic')] = 'Kyrgyzstan' country_names[country_names.index('Bosnia and Herzegovina')] = 'Bosnia and Herz.' country_names[country_names.index('Macedonia, FYR')] = 'Macedonia' country_names[country_names.index('South Sudan')] = 'S. Sudan' country_names[country_names.index('Central African Republic')] = 'Central African Rep.' country_names[country_names.index('Congo, Rep.')] = 'Congo' country_names[country_names.index('Venezuela, RB')] = 'Venezuela' country_names[country_names.index('Dominican Republic')] = 'Dominican Rep.' country_names[country_names.index('Syrian Arab Republic')] = 'Syria' country_names[country_names.index('Equatorial Guinea')] = 'Eq. Guinea' country_names[country_names.index("Cote d'Ivoire")] = "C?te d'Ivoire"data = zip(country_names, country_populations)m = (Map().add('', data, maptype='world', is_map_symbol_show = False).set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title=''),visualmap_opts=opts.VisualMapOpts(max_=country_populations[0])) )m.render('World.html')# -----------------------------第1次改進(jìn)-----------------------------number = -1country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number].tolist() country_populations = np.log10(population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number]).tolist()country_names[country_names.index('Russian Federation')] = 'Russia' country_names[country_names.index('Egypt, Arab Rep.')] = 'Egypt' country_names[country_names.index('Congo, Dem. Rep.')] = 'Dem. Rep. Congo' country_names[country_names.index('Iran, Islamic Rep.')] = 'Iran'country_names[country_names.index('Czech Republic')] = 'Czech Rep.' country_names[country_names.index('Slovak Republic')] = 'Slovakia' country_names[country_names.index('Yemen, Rep.')] = 'Yemen' country_names[country_names.index('Korea, Rep.')] = 'Korea' country_names[country_names.index('Korea, Dem. People’s Rep.')] = 'Dem. Rep. Korea' country_names[country_names.index('Kyrgyz Republic')] = 'Kyrgyzstan' country_names[country_names.index('Bosnia and Herzegovina')] = 'Bosnia and Herz.' country_names[country_names.index('Macedonia, FYR')] = 'Macedonia' country_names[country_names.index('South Sudan')] = 'S. Sudan' country_names[country_names.index('Central African Republic')] = 'Central African Rep.' country_names[country_names.index('Congo, Rep.')] = 'Congo' country_names[country_names.index('Venezuela, RB')] = 'Venezuela' country_names[country_names.index('Dominican Republic')] = 'Dominican Rep.' country_names[country_names.index('Syrian Arab Republic')] = 'Syria' country_names[country_names.index('Equatorial Guinea')] = 'Eq. Guinea' country_names[country_names.index("Cote d'Ivoire")] = "C?te d'Ivoire"data = zip(country_names, country_populations)m = (Map().add('', data, maptype='world', is_map_symbol_show = False).set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title=''),visualmap_opts=opts.VisualMapOpts(max_=country_populations[0], min_=country_populations[-1])) )m.render('World_improved_v1.html')# -----------------------------第2次改進(jìn)-----------------------------number = -1country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number].tolist() country_populations = np.log10(population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number]).tolist()country_names[country_names.index('Russian Federation')] = 'Russia' country_names[country_names.index('Egypt, Arab Rep.')] = 'Egypt' country_names[country_names.index('Congo, Dem. Rep.')] = 'Dem. Rep. Congo' country_names[country_names.index('Iran, Islamic Rep.')] = 'Iran'country_names[country_names.index('Czech Republic')] = 'Czech Rep.' country_names[country_names.index('Slovak Republic')] = 'Slovakia' country_names[country_names.index('Yemen, Rep.')] = 'Yemen' country_names[country_names.index('Korea, Rep.')] = 'Korea' country_names[country_names.index('Korea, Dem. People’s Rep.')] = 'Dem. Rep. Korea' country_names[country_names.index('Kyrgyz Republic')] = 'Kyrgyzstan' country_names[country_names.index('Bosnia and Herzegovina')] = 'Bosnia and Herz.' country_names[country_names.index('Macedonia, FYR')] = 'Macedonia' country_names[country_names.index('South Sudan')] = 'S. Sudan' country_names[country_names.index('Central African Republic')] = 'Central African Rep.' country_names[country_names.index('Congo, Rep.')] = 'Congo' country_names[country_names.index('Venezuela, RB')] = 'Venezuela' country_names[country_names.index('Dominican Republic')] = 'Dominican Rep.' country_names[country_names.index('Syrian Arab Republic')] = 'Syria' country_names[country_names.index('Equatorial Guinea')] = 'Eq. Guinea' country_names[country_names.index("Cote d'Ivoire")] = "C?te d'Ivoire"data = zip(country_names, country_populations)m = (Map().add('', data, maptype='world', is_map_symbol_show = False).set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title=''),visualmap_opts=opts.VisualMapOpts(max_=country_populations[0], min_=country_populations[150])) )m.render('World_improved_v2.html')代碼B:
import pandas as pd import matplotlib.pyplot as plt import numpy as npplt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = Falseraw_data = pd.read_csv('health.csv') country_codes = pd.read_csv('country code.csv')case = [] for code in raw_data['Country Code'].values:case.extend([code in country_codes.values])population_data = raw_data.loc[case].loc[raw_data['Year'] == 2017,['Country Name', 'Population, total']] population_data = population_data.loc[population_data['Population, total']!=0] population_data.sort_values('Population, total', ascending=False, inplace=True)plt.close('all')# number = 100 # plt.figure(figsize=(15.0, 30.0))number = -1 plt.figure(figsize=(15.0, 40.0))width = 0.8country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number] country_populations = population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number]country_names = np.append(country_names, 'Other countries') other_population = np.sum(population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[number:-1]) country_populations = np.append(country_populations, other_population)plt.barh(np.flipud(country_names), np.flipud(country_populations), width, color='royalblue')plt.gca().xaxis.set_ticks_position('top') plt.xticks(fontname='Arial', fontsize=12) plt.yticks(fontname='Arial', fontsize=12)plt.title('2017年世界主要國(guó)家的人口數(shù)量') plt.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2)plt.savefig('img7_1.jpg') plt.show()plt.close('all')# number = 100 # plt.figure(figsize=(15.0, 30.0))number = -1 plt.figure(figsize=(15.0, 40.0))width = 0.8country_names = population_data.loc[:, ['Country Name']].values.reshape(population_data.shape[0])[0:number] country_populations = population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[0:number]country_names = np.append(country_names, 'Other countries') other_population = np.sum(population_data.loc[:, ['Population, total']].values.reshape(population_data.shape[0])[number:-1]) country_populations = np.append(country_populations, other_population) country_populations = np.log10(country_populations)plt.barh(np.flipud(country_names), np.flipud(country_populations), width, color='royalblue')plt.gca().xaxis.set_ticks_position('top') plt.xticks(fontname='Arial', fontsize=12) plt.yticks(fontname='Arial', fontsize=12)plt.title('2017年世界主要國(guó)家的人口數(shù)量級(jí)') plt.grid(which='major', axis='both', color='black', linestyle='--', alpha=0.2)plt.savefig('img7_2.jpg') plt.show()總結(jié)
以上是生活随笔為你收集整理的数据可视化:世界银行数据(1960-2017)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: OpenCore黑苹果引导开机声音与图形
- 下一篇: 游侠原创:安全狗“服云”深度评测!