當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Pandas中文官档~基础用法2

發布時間：2024/9/15 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Pandas中文官档~基础用法2 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

呆鳥云：“翻譯不易，要么是一個詞反復思索，要么是上萬字一遍遍校稿修改，只為給大家翻譯更準確、閱讀更舒適的感受，呆鳥也不求啥，就是希望各位看官如果覺得本文有用，能給點個在看或分享給有需要的朋友，這就是對呆鳥辛苦翻譯的最大鼓勵。”

描述性統計

Series 與 DataFrame 支持大量計算描述性統計的方法與操作。這些方法大部分都是?sum()、mean()、quantile()?等聚合函數，其輸出結果比原始數據集小；此外，還有輸出結果與原始數據集同樣大小的?cumsum()?、?cumprod()?等函數。這些方法都基本上都接受?axis?參數，如，?ndarray.{sum,std,…}，但這里的?axis?可以用名稱或整數指定：

Series：無需?axis?參數
DataFrame：
- "index"，即?axis=0，默認值
- "columns", 即?axis=1

示例如下：

In [77]: df Out[77]:one two three a 1.394981 1.772517 NaN b 0.343054 1.912123 -0.050390 c 0.695246 1.478369 1.227435 d NaN 0.279344 -0.613172In [78]: df.mean(0) Out[78]: one 0.811094 two 1.360588 three 0.187958 dtype: float64In [79]: df.mean(1) Out[79]: a 1.583749 b 0.734929 c 1.133683 d -0.166914 dtype: float64

這些方法都支持?skipna，這個關鍵字指定是否要把缺失數據排除在外，默認值為?True。

In [80]: df.sum(0, skipna=False) Out[80]: one NaN two 5.442353 three NaN dtype: float64In [81]: df.sum(axis=1, skipna=True) Out[81]: a 3.167498 b 2.204786 c 3.401050 d -0.333828 dtype: float64

結合廣播機制或算數操作，可以描述不同統計過程，比如標準化，即渲染數據零均值與標準差 1，這種操作非常簡單：

In [82]: ts_stand = (df - df.mean()) / df.std()In [83]: ts_stand.std() Out[83]: one 1.0 two 1.0 three 1.0 dtype: float64In [84]: xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)In [85]: xs_stand.std(1) Out[85]: a 1.0 b 1.0 c 1.0 d 1.0 dtype: float64

注：?cumsum()?與?cumprod()?等方法保留?NaN?值的位置。這與?expanding()?和?rolling()?略顯不同，詳情請參閱本文。

In [86]: df.cumsum() Out[86]:one two three a 1.394981 1.772517 NaN b 1.738035 3.684640 -0.050390 c 2.433281 5.163008 1.177045 d NaN 5.442353 0.563873

下面是常用函數匯總表。每個函數都支持?level?參數，僅在數據對象為結構化 Index 時使用。

函數描述

count	統計非空值數量
sum	匯總值
mean	平均值
mad	平均絕對偏差
median	算數中位數
min	最小值
max	最大值
mode	眾數
abs	絕對值
prod	乘積
std	貝塞爾校正的樣本標準偏差
var	無偏方差
sem	平均值的標準誤差
skew	樣本偏度 (第三階)
kurt	樣本峰度 (第四階)
quantile	樣本分位數 (不同 % 的值)
cumsum	累加
cumprod	累乘
cummax	累積最大值
cummin	累積最小值

注意：Numpy 的?mean、std、sum?等方法默認不統計 Series 里的空值。

In [87]: np.mean(df['one']) Out[87]: 0.8110935116651192In [88]: np.mean(df['one'].to_numpy()) Out[88]: nan

Series.nunique()?返回 Series 里所有非空值的唯一值。

In [89]: series = pd.Series(np.random.randn(500))In [90]: series[20:500] = np.nanIn [91]: series[10:20] = 5In [92]: series.nunique() Out[92]: 11

數據總結：describe

describe()?函數計算 Series 與 DataFrame 數據列的各種數據統計量，注意，這里排除了空值。

In [93]: series = pd.Series(np.random.randn(1000))In [94]: series[::2] = np.nanIn [95]: series.describe() Out[95]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 25% -0.699070 50% -0.069718 75% 0.714483 max 3.160915 dtype: float64In [96]: frame = pd.DataFrame(np.random.randn(1000, 5),....: columns=['a', 'b', 'c', 'd', 'e'])....:In [97]: frame.iloc[::2] = np.nanIn [98]: frame.describe() Out[98]:a b c d e count 500.000000 500.000000 500.000000 500.000000 500.000000 mean 0.033387 0.030045 -0.043719 -0.051686 0.005979 std 1.017152 0.978743 1.025270 1.015988 1.006695 min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821 25% -0.647623 -0.576449 -0.712369 -0.691338 -0.691115 50% 0.047578 -0.021499 -0.023888 -0.032652 -0.025363 75% 0.729907 0.775880 0.618896 0.670047 0.649748 max 2.740139 2.752332 3.004229 2.728702 3.240991

此外，還可以指定輸出結果包含的分位數：

In [99]: series.describe(percentiles=[.05, .25, .75, .95]) Out[99]: count 500.000000 mean -0.021292 std 1.015906 min -2.683763 5% -1.645423 25% -0.699070 50% -0.069718 75% 0.714483 95% 1.711409 max 3.160915 dtype: float64

一般情況下，默認值包含中位數。

對于非數值型 Series 對象，?describe()?返回值的總數、唯一值數量、出現次數最多的值及出現的次數。

In [100]: s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])In [101]: s.describe() Out[101]: count 9 unique 4 top a freq 5 dtype: object

注意：對于混合型的 DataFrame 對象，?describe()?只返回數值列的匯總統計量，如果沒有數值列，則只顯示類別型的列。

In [102]: frame = pd.DataFrame({'a': ['Yes', 'Yes', 'No', 'No'], 'b': range(4)})In [103]: frame.describe() Out[103]:b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000

include/exclude?參數的值為列表，用該參數可以控制包含或排除的數據類型。這里還有一個特殊值，all：

In [104]: frame.describe(include=['object']) Out[104]:a count 4 unique 2 top Yes freq 2In [105]: frame.describe(include=['number']) Out[105]:b count 4.000000 mean 1.500000 std 1.290994 min 0.000000 25% 0.750000 50% 1.500000 75% 2.250000 max 3.000000In [106]: frame.describe(include='all') Out[106]:a b count 4 4.000000 unique 2 NaN top Yes NaN freq 2 NaN mean NaN 1.500000 std NaN 1.290994 min NaN 0.000000 25% NaN 0.750000 50% NaN 1.500000 75% NaN 2.250000 max NaN 3.000000

本功能依托于?select_dtypes，要了解該參數接受哪些輸入內容請參閱本文。

最大值與最小值對應的索引

Series 與 DataFrame 的?idxmax()?與?idxmin()?函數計算最大值與最小值對應的索引。

In [107]: s1 = pd.Series(np.random.randn(5))In [108]: s1 Out[108]: 0 1.118076 1 -0.352051 2 -1.242883 3 -1.277155 4 -0.641184 dtype: float64In [109]: s1.idxmin(), s1.idxmax() Out[109]: (3, 0)In [110]: df1 = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])In [111]: df1 Out[111]:A B C 0 -0.327863 -0.946180 -0.137570 1 -0.186235 -0.257213 -0.486567 2 -0.507027 -0.871259 -0.111110 3 2.000339 -2.430505 0.089759 4 -0.321434 -0.033695 0.096271In [112]: df1.idxmin(axis=0) Out[112]: A 2 B 3 C 1 dtype: int64In [113]: df1.idxmax(axis=1) Out[113]: 0 C 1 A 2 C 3 A 4 C dtype: object

多行或多列中存在多個最大值或最小值時，idxmax()?與?idxmin()?只返回匹配到的第一個值的?Index：

In [114]: df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))In [115]: df3 Out[115]:A e 2.0 d 1.0 c 1.0 b 3.0 a NaNIn [116]: df3['A'].idxmin() Out[116]: 'd'

::: tip 注意

idxmin?與?idxmax?對應 Numpy 里的?argmin?與?argmax。

:::

值計數（直方圖）與眾數

Series 的?value_counts()?方法及頂級函數計算一維數組中數據值的直方圖，還可以用作常規數組的函數：

In [117]: data = np.random.randint(0, 7, size=50)In [118]: data Out[118]: array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,6, 2, 6, 1, 5, 4])In [119]: s = pd.Series(data)In [120]: s.value_counts() Out[120]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64In [121]: pd.value_counts(data) Out[121]: 6 10 2 10 4 9 5 8 3 8 0 3 1 2 dtype: int64

與上述操作類似，還可以統計 Series 或 DataFrame 的眾數，即出現頻率最高的值：

In [122]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])In [123]: s5.mode() Out[123]: 0 3 1 7 dtype: int64In [124]: df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),.....: "B": np.random.randint(-10, 15, size=50)}).....:In [125]: df5.mode() Out[125]:A B 0 1.0 -9 1 NaN 10 2 NaN 13

離散化與分位數

cut()函數（以值為依據實現分箱）及?qcut()函數（以樣本分位數為依據實現分箱）用于連續值的離散化：

In [126]: arr = np.random.randn(20)In [127]: factor = pd.cut(arr, 4)In [128]: factor Out[128]: [(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251], ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251]] Length: 20 Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.179] <(1.179, 1.893]]In [129]: factor = pd.cut(arr, [-5, -1, 0, 1, 5])In [130]: factor Out[130]: [(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-1, 0]] Length: 20 Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut()?計算樣本分位數。比如，下列代碼按等距分位數分割正態分布的數據：

In [131]: arr = np.random.randn(30)In [132]: factor = pd.qcut(arr, [0, .25, .5, .75, 1])In [133]: factor Out[133]: [(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], ..., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.301]] Length: 30 Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.184] <(1.184, 2.346]]In [134]: pd.value_counts(factor) Out[134]: (1.184, 2.346] 8 (-2.278, -0.301] 8 (0.569, 1.184] 7 (-0.301, 0.569] 7 dtype: int64

定義分箱時，還可以傳遞無窮值：

In [135]: arr = np.random.randn(20)In [136]: factor = pd.cut(arr, [-np.inf, 0, np.inf])In [137]: factor Out[137]: [(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]] Length: 20 Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

推薦閱讀：（點擊標題即可跳轉）

??長按圖片 1 秒即可關注哦～

總結

以上是生活随笔為你收集整理的Pandas中文官档~基础用法2的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： DigSci科学数据挖掘大赛：如何在3天
下一篇： 520 页机器学习笔记！图文并茂可能更适