當前位置：首頁 > 编程语言 > python >内容正文

python

Python 数据分析三剑客之 Pandas（五）：统计计算与统计描述

發布時間：2023/12/10 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 数据分析三剑客之 Pandas（五）：统计计算与统计描述小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕松入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章：

Python 數據分析三劍客之 Pandas（一）：認識 Pandas 及其 Series、DataFrame 對象
Python 數據分析三劍客之 Pandas（二）：Index 索引對象以及各種索引操作
Python 數據分析三劍客之 Pandas（三）：算術運算與缺失值的處理
Python 數據分析三劍客之 Pandas（四）：函數應用、映射、排序和層級索引
Python 數據分析三劍客之 Pandas（五）：統計計算與統計描述
Python 數據分析三劍客之 Pandas（六）：GroupBy 數據分裂、應用與合并
Python 數據分析三劍客之 Pandas（七）：合并數據集
Python 數據分析三劍客之 Pandas（八）：數據重塑、重復數據處理與數據替換
Python 數據分析三劍客之 Pandas（九）：時間序列
Python 數據分析三劍客之 Pandas（十）：數據讀寫

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

- 【01x00】統計計算
- - 【01x01】sum() 求和
  - 【01x02】min() 最小值
  - 【01x03】max() 最大值
  - 【01x04】mean() 平均值
  - 【01x05】idxmin() 最小值索引
  - 【01x06】idxmax() 最大值索引
- 【02x00】統計描述
- 【03x00】常用統計方法

這里是一段防爬蟲文本，請讀者忽略。本文原創首發于 CSDN，作者 TRHX。博客首頁：https://itrhx.blog.csdn.net/ 本文鏈接：https://itrhx.blog.csdn.net/article/details/106788501 未經授權，禁止轉載！惡意轉載，后果自負！尊重原創，遠離剽竊！

【01x00】統計計算

Pandas 對象擁有一組常用的數學和統計方法。它們大部分都屬于約簡和匯總統計，用于從 Series 中提取單個值（如 sum 或 mean）或從 DataFrame 的行或列中提取一個 Series。跟對應的 NumPy 數組方法相比，它們都是基于沒有缺失數據的假設而構建的。

【01x01】sum() 求和

sum() 方法用于返回指定軸的和，相當于 numpy.sum()。

在 Series 和 DataFrame 中的基本語法如下：

Series.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
DataFrame.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.sum.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html

常用參數描述如下：

參數描述

axis	指定軸求和，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
skipna	bool 類型，求和時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求和

在 Series 中的應用：

>>> import pandas as pd >>> idx = pd.MultiIndex.from_arrays([['warm', 'warm', 'cold', 'cold'],['dog', 'falcon', 'fish', 'spider']],names=['blooded', 'animal']) >>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> obj blooded animal warm dog 4falcon 2 cold fish 0spider 8 Name: legs, dtype: int64 >>> >>> obj.sum() 14 >>> >>> obj.sum(level='blooded') blooded warm 6 cold 8 Name: legs, dtype: int64 >>> >>> obj.sum(level=0) blooded warm 6 cold 8 Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],columns=['one', 'two']) >>> objone two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 >>> >>> obj.sum() one 9.25 two -5.80 dtype: float64 >>> >>> obj.sum(axis=1) a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64

【01x02】min() 最小值

min() 方法用于返回指定軸的最小值。

在 Series 和 DataFrame 中的基本語法如下：

Series.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.min.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html

常用參數描述如下：

參數描述

axis	指定軸求最小值，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
skipna	bool 類型，求最小值時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求最小值

在 Series 中的應用：

>>> import pandas as pd >>> idx = pd.MultiIndex.from_arrays([['warm', 'warm', 'cold', 'cold'],['dog', 'falcon', 'fish', 'spider']],names=['blooded', 'animal']) >>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> obj blooded animal warm dog 4falcon 2 cold fish 0spider 8 Name: legs, dtype: int64 >>> >>> obj.min() 0 >>> >>> obj.min(level='blooded') blooded warm 2 cold 0 Name: legs, dtype: int64 >>> >>> obj.min(level=0) blooded warm 2 cold 0 Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],columns=['one', 'two']) >>> objone two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 >>> >>> obj.min() one 0.75 two -4.50 dtype: float64 >>> >>> obj.min(axis=1) a 1.4 b -4.5 c NaN d -1.3 dtype: float64 >>> >>> obj.min(axis='columns', skipna=False) a NaN b -4.5 c NaN d -1.3 dtype: float64

【01x03】max() 最大值

max() 方法用于返回指定軸的最大值。

在 Series 和 DataFrame 中的基本語法如下：

Series.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.max(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.max.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html

常用參數描述如下：

參數描述

axis	指定軸求最大值，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
skipna	bool 類型，求最大值時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求最大值

在 Series 中的應用：

>>> import pandas as pd >>> idx = pd.MultiIndex.from_arrays([['warm', 'warm', 'cold', 'cold'],['dog', 'falcon', 'fish', 'spider']],names=['blooded', 'animal']) >>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> obj blooded animal warm dog 4falcon 2 cold fish 0spider 8 Name: legs, dtype: int64 >>> >>> obj.max() 8 >>> >>> obj.max(level='blooded') blooded warm 4 cold 8 Name: legs, dtype: int64 >>> >>> obj.max(level=0) blooded warm 4 cold 8 Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],columns=['one', 'two']) >>> objone two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 >>> >>> obj.max() one 7.1 two -1.3 dtype: float64 >>> >>> obj.max(axis=1) a 1.40 b 7.10 c NaN d 0.75 dtype: float64 >>> >>> obj.max(axis='columns', skipna=False) a NaN b 7.10 c NaN d 0.75 dtype: float64

【01x04】mean() 平均值

mean() 方法用于返回指定軸的平均值。

在 Series 和 DataFrame 中的基本語法如下：

Series.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
DataFrame.mean(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.mean.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html

常用參數描述如下：

參數描述

axis	指定軸求平均值，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
skipna	bool 類型，求平均值時是否排除缺失值（NA/null），默認 True
level	如果軸是 MultiIndex（層次結構），則沿指定層次求平均值

在 Series 中的應用：

>>> import pandas as pd >>> idx = pd.MultiIndex.from_arrays([['warm', 'warm', 'cold', 'cold'],['dog', 'falcon', 'fish', 'spider']],names=['blooded', 'animal']) >>> obj = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> obj blooded animal warm dog 4falcon 2 cold fish 0spider 8 Name: legs, dtype: int64 >>> >>> obj.mean() 3.5 >>> >>> obj.mean(level='blooded') blooded warm 3 cold 4 Name: legs, dtype: int64 >>> >>> obj.mean(level=0) blooded warm 3 cold 4 Name: legs, dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]],index=['a', 'b', 'c', 'd'],columns=['one', 'two']) >>> objone two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 >>> >>> obj.mean() one 3.083333 two -2.900000 dtype: float64 >>> >>> obj.mean(axis=1) a 1.400 b 1.300 c NaN d -0.275 dtype: float64 >>> >>> obj.mean(axis='columns', skipna=False) a NaN b 1.300 c NaN d -0.275 dtype: float64

【01x05】idxmin() 最小值索引

idxmin() 方法用于返回最小值的索引。

在 Series 和 DataFrame 中的基本語法如下：

Series.idxmin(self, axis=0, skipna=True, *args, **kwargs)
DataFrame.idxmin(self, axis=0, skipna=True)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.idxmin.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmin.html

常用參數描述如下：

參數描述

axis	指定軸，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
skipna	bool 類型，是否排除缺失值（NA/null），默認 True

在 Series 中的應用：

在 DataFrame 中的應用：

【01x06】idxmax() 最大值索引

idxmax() 方法用于返回最大值的索引。

在 Series 和 DataFrame 中的基本語法如下：

Series.idxmax(self, axis=0, skipna=True, *args, **kwargs)
DataFrame.idxmax(self, axis=0, skipna=True)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.idxmax.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html

常用參數描述如下：

參數描述

axis	指定軸，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
skipna	bool 類型，是否排除缺失值（NA/null），默認 True

在 Series 中的應用：

在 DataFrame 中的應用：

【02x00】統計描述

describe() 方法用于快速綜合統計結果：計數、均值、標準差、最大最小值、四分位數等。還可以通過參數來設置需要忽略或者包含的統計選項。

在 Series 和 DataFrame 中的基本語法如下：

Series.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None)
DataFrame.describe(self: ~ FrameOrSeries, percentiles=None, include=None, exclude=None)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

參數描述

percentiles	數字列表，可選項，要包含在輸出中的百分比。所有值都應介于 0 和 1 之間。默認值為 [.25、.5、.75]，即返回第 25、50 和 75 個百分點
include	要包含在結果中的數據類型，數據類型列表，默認 None，具體取值類型參見官方文檔
exclude	要從結果中忽略的數據類型，數據類型列表，默認 None，具體取值類型參見官方文檔

描述數字形式的 Series 對象：

>>> import pandas as pd >>> obj = pd.Series([1, 2, 3]) >>> obj 0 1 1 2 2 3 dtype: int64 >>> >>> obj.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64

分類描述：

>>> import pandas as pd >>> obj = pd.Series(['a', 'a', 'b', 'c']) >>> obj 0 a 1 a 2 b 3 c dtype: object >>> >>> obj.describe() count 4 unique 3 top a freq 2 dtype: object

描述時間戳：

>>> import pandas as pd >>> obj = pd.Series([np.datetime64("2000-01-01"),np.datetime64("2010-01-01"),np.datetime64("2010-01-01")]) >>> obj 0 2000-01-01 1 2010-01-01 2 2010-01-01 dtype: datetime64[ns] >>> >>> obj.describe() count 3 unique 2 top 2010-01-01 00:00:00 freq 2 first 2000-01-01 00:00:00 last 2010-01-01 00:00:00 dtype: object

描述 DataFrame 對象：

不考慮數據類型，顯示所有描述：

>>> import pandas as pd >>> obj = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']}) >>> objcategorical numeric object 0 d 1 a 1 e 2 b 2 f 3 c >>> >>> obj.describe(include='all')categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN c freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN

僅包含 category 列：

【03x00】常用統計方法

其他常用統計方法參見下表：

方法描述官方文檔

count	非NA值的數量	Series丨DataFrame
describe	針對Series或各DataFrame列計算匯總統計	Series丨DataFrame
min	計算最小值	Series丨DataFrame
max	計算最大值	Series丨DataFrame
argmin	計算能夠獲取到最小值的索引位置（整數）	Series
argmax	計算能夠獲取到最大值的索引位置（整數）	Series
idxmin	計算能夠獲取到最小值的索引值	Series丨DataFrame
idxmax	計算能夠獲取到最大值的索引值	Series丨DataFrame
quantile	計算樣本的分位數（0到1）	Series丨DataFrame
sum	值的總和	Series丨DataFrame
mean	值的平均數	Series丨DataFrame
median	值的算術中位數（50%分位數）	Series丨DataFrame
mad	根據平均值計算平均絕對離差	Series丨DataFrame
var	樣本值的方差	Series丨DataFrame
std	樣本值的標準差	Series丨DataFrame

總結

以上是生活随笔為你收集整理的Python 数据分析三剑客之 Pandas（五）：统计计算与统计描述的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：浦发银行信用卡随借金怎么还款
下一篇： Python 数据分析三剑客之 Pand