當(dāng)前位置：首頁 >

深入浅出统计学第一章数据的可视化

發(fā)布時間：2025/3/21 41 豆豆

生活随笔收集整理的這篇文章主要介紹了深入浅出统计学第一章数据的可视化小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

序言

在深入淺出統(tǒng)計(jì)學(xué)的第一張中一共出現(xiàn)了4類圖像:
1. 比較基本比例—>餅圖
2. 比較數(shù)值的高低條形圖（基本條形圖，堆積條形圖，分段條形圖）
3. 連續(xù)數(shù)據(jù)的對比(等距直方圖—>頻數(shù),非等距直方圖—>頻數(shù)密度)
4. 截止到某時間點(diǎn)的累計(jì)總量—>累積頻數(shù)圖

Python中是實(shí)現(xiàn)方式有兩種,matplotlib和Pandas,一般而言直接使用Pandas即可.此處我們先給出Pandas中的實(shí)現(xiàn),然后再做部分補(bǔ)充.數(shù)據(jù)我們依然使用數(shù)據(jù)探索那篇文章中用過的UCI紅酒質(zhì)量數(shù)據(jù)集.

本處最后區(qū)域圖與散點(diǎn)圖,六邊形容器圖代碼與文字基本來自于Pandas的文檔,僅僅略加修改,鏈接請參見文章末尾.

讀取數(shù)據(jù)

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline# 定義讀取數(shù)據(jù)的函數(shù) def ReadAndSaveDataByPandas(target_url = None,file_save_path = None ,save=False):if target_url !=None:target_url = ("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv") if file_save_path != None:file_save_path = "/home/fonttian/Data/UCI/Glass/glass.csv"wine = pd.read_csv(target_url, header=0, sep=";")if save == True:wine.to_csv(file_save_path, index=False)return winedef GetDataByPandas():wine = pd.read_csv("/home/font/Data/UCI/WINE/wine.csv")y = np.array(wine.quality)X = np.array(wine.drop("quality", axis=1))# X = np.array(wine)columns = np.array(wine.columns)return X, y, columns# X,y,names = GetDataByPandas() # wine = pd.DataFrame(X) wine = pd.read_csv("/home/font/Data/UCI/WINE/wine.csv") print(list(wine.columns))print(set(wine['quality'].value_counts())) print(wine['quality'].value_counts()) wine['quality'].value_counts().plot.pie(subplots=True,figsize=(8,8)) plt.show() ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'] {199, 681, 10, 18, 53, 638} 5 681 6 638 7 199 4 53 8 18 3 10 Name: quality, dtype: int64

df = pd.DataFrame(3 * np.random.rand(4, 2), index=['a', 'b', 'c', 'd'], columns=['x', 'y']) print(df) df.plot.pie(subplots=True, figsize=(16,8 )) df.plot.pie(subplots=True,labels=['AA', 'BB', 'CC', 'DD'], colors=['r', 'g', 'b', 'c'],autopct='%.2f', fontsize=20, figsize=(16, 8)) plt.show() x y a 0.357865 0.423390 b 2.318759 2.089677 c 0.464072 0.502673 d 1.140500 2.779330

wine['quality'][0:10].plot(kind='bar',figsize=(16,8)); plt.axhline(0, color='k') plt.show()

wine[['residual sugar','pH','quality']][0:10].plot.bar(figsize=(16,8)) wine[['residual sugar','pH']][0:10].plot.bar(figsize=(16,8),stacked=True) wine[['residual sugar','pH']][0:10].plot.barh(figsize=(16,8),stacked=True) plt.show()

# 普通畫法 wine[['residual sugar','alcohol']].plot.hist(bin==20,figsize=(16,10),alpha=0.5) # 分開畫 wine[['residual sugar','alcohol']].hist(color='k', alpha=0.5, bins=50,figsize=(16,10)) plt.show()

df_hist = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000),'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c']) df_hist.plot.hist(alpha=0.5) plt.show()

# orientation='horizontal', 斜向 # cumulative=True, 是否累積 df_hist['b'].plot.hist(orientation='horizontal', cumulative=True) plt.show()

wine[['fixed acidity', 'residual sugar', 'alcohol', 'quality']][0:15].plot() wine[['fixed acidity', 'residual sugar', 'alcohol', 'quality']][0:15].cumsum().plot() plt.show()

其他的常用圖形

除了這幾種圖像之外,Pandas還提供了很多種其他的數(shù)據(jù)可視化方法,這里我們介紹其中較為簡單和常用的幾種余下的幾種會在機(jī)器學(xué)習(xí)的數(shù)據(jù)探索中介紹,之前已經(jīng)寫過一篇簡單的入門,余下內(nèi)容日后補(bǔ)充—>https://blog.csdn.net/FontThrone/article/details/78188401

三種圖像的作用

1. 箱型圖 ---> 展示數(shù)據(jù)分布,發(fā)現(xiàn)異常點(diǎn) 2. 兩種區(qū)域圖 ---> 對比數(shù)據(jù)大小 3. 散點(diǎn)圖,六邊形容器圖 ---> 數(shù)據(jù)分布與趨勢

1.箱型圖

在豎向的箱型圖中從上到下的五個橫線分別是,上界,上四分位數(shù),中位數(shù),下四分位數(shù),下界,上下界以外的點(diǎn)可作為異常點(diǎn)的一個參考,這個圖形在書中第三章有將為詳細(xì)的解釋

2.區(qū)域圖

疊加與非疊加區(qū)域圖,有點(diǎn)在于可以更好地比較區(qū)域(x * y)的大小

3.散點(diǎn)圖與六邊形容器圖

可以一定程度上觀察數(shù)據(jù)分布,比如發(fā)現(xiàn)數(shù)據(jù)分布的區(qū)域和分布趨勢,對于發(fā)現(xiàn)數(shù)據(jù)分布的區(qū)域,或者找到一定的擬合規(guī)律還是有很大幫助的.有必要的話,還可以使用三維散點(diǎn)圖,但是這需要matplotlib實(shí)現(xiàn).

wine[['fixed acidity', 'residual sugar', 'alcohol', 'quality']].plot.box()# vert=False, 橫向 # positions=[1, 4, 6, 8], y軸位置 # color=color, 顏色 # sym='r+', 異常點(diǎn)的樣式 color = dict(boxes='DarkGreen', whiskers='DarkOrange', medians='DarkBlue', caps='Gray') wine[['fixed acidity', 'residual sugar', 'alcohol', 'quality']].plot.box(vert=False, positions=[1, 4, 6, 8],color=color,sym='r+') plt.show()# pandas 自帶接口 ----> boxplot(),此處不再演示

wine[['fixed acidity', 'residual sugar', 'alcohol', 'quality']][0:15].plot.area() wine[['fixed acidity', 'residual sugar', 'alcohol', 'quality']][0:15].plot.area(stacked=False) plt.show()

# 可以使用DataFrame.plot.scatter()方法繪制散點(diǎn)圖。散點(diǎn)圖需要x和y軸的數(shù)字列。這些可以分別由x和y關(guān)鍵字指定。 df = pd.DataFrame(np.random.rand(50, 4), columns=['a', 'b', 'c', 'd']) df.plot.scatter(x='a', y='b');# 要在單個軸上繪制多個列組，請重復(fù)指定目標(biāo)ax的plot方法。建議使用color和label關(guān)鍵字來區(qū)分每個組。 ax = df.plot.scatter(x='a', y='b', color='black', label='Group 1'); df.plot.scatter(x='c', y='d', color='red', label='Group 2', ax=ax);

# 可以給出關(guān)鍵字c作為列的名稱以為每個點(diǎn)提供顏色： # 您可以傳遞matplotlib scatter支持的其他關(guān)鍵字。以下示例顯示了使用數(shù)據(jù)框列值作為氣泡大小的氣泡圖。 df.plot.scatter(x='a', y='b', c='c', s=50); df.plot.scatter(x='a', y='b', s=df['c']*200);

您可以使用DataFrame.plot.hexbin()創(chuàng)建六邊形箱圖。如果數(shù)據(jù)過于密集而無法單獨(dú)繪制每個點(diǎn)，則Hexbin圖可能是散點(diǎn)圖的有用替代方案。

df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])df['b'] = df['b'] + np.arange(1000)df.plot.hexbin(x='a', y='b', gridsize=25,figsize=(16,10))plt.show()

一個有用的關(guān)鍵字參數(shù)是gridsize；它控制x方向上的六邊形數(shù)量，默認(rèn)為100。更大的gridsize意味著更多，更小的分組。

默認(rèn)情況下，計(jì)算每個（x， y）點(diǎn)周圍計(jì)數(shù)的直方圖。您可以通過將值傳遞給C和reduce_C_function參數(shù)來指定替代聚合。C specifies the value at each (x, y) point and reduce_C_function is a function of one argument that reduces all the values in a bin to a single number (e.g. mean, max, sum, std). 在這個例子中，位置由列a和b給出，而值由列z給出。箱子與numpy的max函數(shù)聚合。

df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])df['b'] = df['b'] = df['b'] + np.arange(1000)df['z'] = np.random.uniform(0, 3, 1000)df.plot.hexbin(x='a', y='b', C='z', reduce_C_function=np.max,gridsize=25,figsize=(16,10))plt.show()

參考文章

1.Pandas0.19.2 中文文檔可視化

2.Pandas最新文檔可視化

3.本人去年寫的數(shù)據(jù)探索入門級文章

總結(jié)

以上是生活随笔為你收集整理的深入浅出统计学第一章数据的可视化的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Ubuntu16.04 安装R与RStu
下一篇：深入浅出统计学第二三章量度