當前位置：首頁 > 编程语言 > python >内容正文

python

Python 数据分析三剑客之 Pandas（四）：函数应用、映射、排序和层级索引

發布時間：2023/12/10 python 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 数据分析三剑客之 Pandas（四）：函数应用、映射、排序和层级索引小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕松入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章：

Python 數據分析三劍客之 Pandas（一）：認識 Pandas 及其 Series、DataFrame 對象
Python 數據分析三劍客之 Pandas（二）：Index 索引對象以及各種索引操作
Python 數據分析三劍客之 Pandas（三）：算術運算與缺失值的處理
Python 數據分析三劍客之 Pandas（四）：函數應用、映射、排序和層級索引
Python 數據分析三劍客之 Pandas（五）：統計計算與統計描述
Python 數據分析三劍客之 Pandas（六）：GroupBy 數據分裂、應用與合并
Python 數據分析三劍客之 Pandas（七）：合并數據集
Python 數據分析三劍客之 Pandas（八）：數據重塑、重復數據處理與數據替換
Python 數據分析三劍客之 Pandas（九）：時間序列
Python 數據分析三劍客之 Pandas（十）：數據讀寫

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

- 【01x00】函數應用和映射
- 【02x00】排序
- - 【02x01】sort_index() 索引排序
  - 【02x02】sort_values() 按值排序
  - 【02x03】rank() 返回排序后元素索引
- 【03x00】層級索引
- - 【03x01】認識層級索引
  - 【03x02】MultiIndex 索引對象
  - 【03x03】提取值
  - 【03x04】交換分層與排序

這里是一段防爬蟲文本，請讀者忽略。本文原創首發于 CSDN，作者 TRHX。博客首頁：https://itrhx.blog.csdn.net/ 本文鏈接：https://itrhx.blog.csdn.net/article/details/106758103 未經授權，禁止轉載！惡意轉載，后果自負！尊重原創，遠離剽竊！

【01x00】函數應用和映射

Pandas 可直接使用 NumPy 的 ufunc（元素級數組方法）函數：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame(np.random.randn(5,4) - 1) >>> obj0 1 2 3 0 -0.228107 1.377709 -1.096528 -2.051001 1 -2.477144 -0.500013 -0.040695 -0.267452 2 -0.485999 -1.232930 -0.390701 -1.947984 3 -0.839161 -0.702802 -1.756359 -1.873149 4 0.853121 -1.540105 0.621614 -0.583360 >>> >>> np.abs(obj)0 1 2 3 0 0.228107 1.377709 1.096528 2.051001 1 2.477144 0.500013 0.040695 0.267452 2 0.485999 1.232930 0.390701 1.947984 3 0.839161 0.702802 1.756359 1.873149 4 0.853121 1.540105 0.621614 0.583360

函數映射：在 Pandas 中 apply 方法可以將函數應用到列或行上，可以通過設置 axis 參數來指定行或列，默認 axis = 0，即按列映射：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame(np.random.randn(5,4) - 1) >>> obj0 1 2 3 0 -0.707028 -0.755552 -2.196480 -0.529676 1 -0.772668 0.127485 -2.015699 -0.283654 2 0.248200 -1.940189 -1.068028 -1.751737 3 -0.872904 -0.465371 -1.327951 -2.883160 4 -0.092664 0.258351 -1.010747 -2.313039 >>> >>> obj.apply(lambda x : x.max()) 0 0.248200 1 0.258351 2 -1.010747 3 -0.283654 dtype: float64 >>> >>> obj.apply(lambda x : x.max(), axis=1) 0 -0.529676 1 0.127485 2 0.248200 3 -0.465371 4 0.258351 dtype: float64

另外還可以通過 applymap 將函數映射到每個數據上：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame(np.random.randn(5,4) - 1) >>> obj0 1 2 3 0 -0.772463 -1.597008 -3.196100 -1.948486 1 -1.765108 -1.646421 -0.687175 -0.401782 2 0.275699 -3.115184 -1.429063 -1.075610 3 -0.251734 -0.448399 -3.077677 -0.294674 4 -1.495896 -1.689729 -0.560376 -1.808794 >>> >>> obj.applymap(lambda x : '%.2f' % x)0 1 2 3 0 -0.77 -1.60 -3.20 -1.95 1 -1.77 -1.65 -0.69 -0.40 2 0.28 -3.12 -1.43 -1.08 3 -0.25 -0.45 -3.08 -0.29 4 -1.50 -1.69 -0.56 -1.81

【02x00】排序

【02x01】sort_index() 索引排序

根據條件對數據集排序（sorting）也是一種重要的內置運算。要對行或列索引進行排序（按字典順序），可使用 sort_index 方法，它將返回一個已排序的新對象。

在 Series 和 DataFrame 中的基本語法如下：

Series.sort_index(self,axis=0,level=None,ascending=True,inplace=False,kind='quicksort',na_position='last',sort_remaining=True,ignore_index: bool = False) DataFrame.sort_index(self,axis=0,level=None,ascending=True,inplace=False,kind='quicksort',na_position='last',sort_remaining=True,ignore_index: bool = False)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_index.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

常用參數描述如下：

參數描述

axis	指定軸排序，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
ascending	為 True時升序排序（默認），為 False時降序排序
kind	排序方法，quicksort：快速排序（默認）；'mergesort’：歸并排序；'heapsort'：堆排序；具體可參見 numpy.sort()

在 Series 中的應用（按照索引 index 排序）：

>>> import pandas as pd >>> obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) >>> obj d 0 a 1 b 2 c 3 dtype: int64 >>> >>> obj.sort_index() a 1 b 2 c 3 d 0 dtype: int64

在 DataFrame 中的應用（可按照索引 index 或列標簽 columns 排序）：

>>> import pandas as pd >>> obj = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) >>> objd a b c three 0 1 2 3 one 4 5 6 7 >>> >>> obj.sort_index()d a b c one 4 5 6 7 three 0 1 2 3 >>> >>> obj.sort_index(axis=1)a b c d three 1 2 3 0 one 5 6 7 4 >>> >>> obj.sort_index(axis=1, ascending=False)d c b a three 0 3 2 1 one 4 7 6 5

【02x02】sort_values() 按值排序

在 Series 和 DataFrame 中的基本語法如下：

Series.sort_values(self,axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False) DataFrame.sort_values(self,by,axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last',ignore_index=False)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.sort_values.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

常用參數描述如下：

參數描述

by	DataFrame 中的必須參數，指定列的值進行排序，Series 中沒有此參數
axis	指定軸排序，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
ascending	為 True時升序排序（默認），為 False時降序排序
kind	排序方法，quicksort：快速排序（默認）；'mergesort’：歸并排序；'heapsort'：堆排序；具體可參見 numpy.sort()

在 Series 中的應用，按照值排序，如果有缺失值，默認都會被放到 Series 的末尾：

>>> import pandas as pd >>> obj = pd.Series([4, 7, -3, 2]) >>> obj 0 4 1 7 2 -3 3 2 dtype: int64 >>> >>> obj.sort_values() 2 -3 3 2 0 4 1 7 dtype: int64 >>> >>> obj = pd.Series([4, np.nan, 7, np.nan, -3, 2]) >>> obj 0 4.0 1 NaN 2 7.0 3 NaN 4 -3.0 5 2.0 dtype: float64 >>> >>> obj.sort_values() 4 -3.0 5 2.0 0 4.0 2 7.0 1 NaN 3 NaN dtype: float64

在 DataFrame 中的應用，有時候可能希望根據一個或多個列中的值進行排序。將一個或多個列的名字傳遞給 sort_values() 的 by 參數即可達到該目的，當傳遞多個列時，首先會對第一列進行排序，若第一列有相同的值，再根據第二列進行排序，依次類推：

>>> import pandas as pd >>> obj = pd.DataFrame({'a': [4, 4, -3, 2], 'b': [0, 1, 0, 1], 'c': [6, 4, 1, 3]}) >>> obja b c 0 4 0 6 1 4 1 4 2 -3 0 1 3 2 1 3 >>> >>> obj.sort_values(by='c')a b c 2 -3 0 1 3 2 1 3 1 4 1 4 0 4 0 6 >>> >>> obj.sort_values(by='c', ascending=False)a b c 0 4 0 6 1 4 1 4 3 2 1 3 2 -3 0 1 >>> >>> obj.sort_values(by=['a', 'b'])a b c 2 -3 0 1 3 2 1 3 0 4 0 6 1 4 1 4 >>> import pandas as pd >>> obj = pd.DataFrame({'a': [4, 4, -3, 2], 'b': [0, 1, 0, 1], 'c': [6, 4, 1, 3]}, index=['A', 'B', 'C', 'D']) >>> obja b c A 4 0 6 B 4 1 4 C -3 0 1 D 2 1 3 >>> >>> obj.sort_values(by='B', axis=1)b a c A 0 4 6 B 1 4 4 C 0 -3 1 D 1 2 3

【02x03】rank() 返回排序后元素索引

rank() 函數會返回一個對象，對象的值是原對象經過排序后的索引值，即下標。

在 Series 和 DataFrame 中的基本語法如下：

Series.rank(self: ~ FrameOrSeries,axis=0,method: str = 'average',numeric_only: Union[bool, NoneType] = None,na_option: str = 'keep',ascending: bool = True,pct: bool = False) DataFrame.rank(self: ~ FrameOrSeries,axis=0,method: str = 'average',numeric_only: Union[bool, NoneType] = None,na_option: str = 'keep',ascending: bool = True,pct: bool = False)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.rank.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html

常用參數描述如下：

參數描述

axis	指定軸排序，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
method	有相同值時，如何處理： ‘average’：默認值，去兩個相同索引的平均值；‘min’：取兩個相同索引的最小值； ‘max’：取兩個相同索引的最大值；‘first’：按照出現的先后順序； ‘dense’：和 'min' 差不多，但是各組之間總是+1的，不太好解釋，可以看后面的示例
ascending	為 True時升序排序（默認），為 False時降序排序

在 Series 中的應用，按照值排序，如果有缺失值，默認都會被放到 Series 的末尾：

>>> import pandas as pd >>> obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) >>> obj 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 >>> >>> obj.rank() 0 6.5 # 第 0 個和第 2 個值從小到大排名分別為 6 和 7，默認取平均值，即 6.5 1 1.0 2 6.5 3 4.5 # 第 3 個和第 6 個值從小到大排名分別為 4 和 5，默認取平均值，即 4.5 4 3.0 5 2.0 6 4.5 dtype: float64 >>> >>> obj.rank(method='first') 0 6.0 # 第 0 個和第 2 個值從小到大排名分別為 6 和 7，按照第一次出現排序，分別為 6 和 7 1 1.0 2 7.0 3 4.0 # 第 3 個和第 6 個值從小到大排名分別為 4 和 5，按照第一次出現排序，分別為 4 和 5 4 3.0 5 2.0 6 5.0 dtype: float64 >>> >>> obj.rank(method='dense') 0 5.0 # 第 0 個和第 2 個值從小到大排名分別為 6 和 7，按照最小值排序，但 dense 規定間隔為 1 所以為 5 1 1.0 2 5.0 3 4.0 # 第 3 個和第 6 個值從小到大排名分別為 4 和 5，按照最小值排序，即 4 4 3.0 5 2.0 6 4.0 dtype: float64 >>> >>> obj.rank(method='min') 0 6.0 # 第 0 個和第 2 個值從小到大排名分別為 6 和 7，按照最小值排序，即 6 1 1.0 2 6.0 3 4.0 # 第 3 個和第 6 個值從小到大排名分別為 4 和 5，按照最小值排序，即 4 4 3.0 5 2.0 6 4.0 dtype: float64

在 DataFrame 中可以使用 axis 參數來指定軸：

>>> import pandas as pd >>> obj = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]}) >>> objb a c 0 4.3 0 -2.0 1 7.0 1 5.0 2 -3.0 0 8.0 3 2.0 1 -2.5 >>> >>> obj.rank()b a c 0 3.0 1.5 2.0 1 4.0 3.5 3.0 2 1.0 1.5 4.0 3 2.0 3.5 1.0 >>> >>> obj.rank(axis='columns')b a c 0 3.0 2.0 1.0 1 3.0 1.0 2.0 2 1.0 2.0 3.0 3 3.0 2.0 1.0

【03x00】層級索引

【03x01】認識層級索引

以下示例將創建一個 Series 對象，索引 Index 由兩個子 list 組成，第一個子 list 是外層索引，第二個 list 是內層索引：

【03x02】MultiIndex 索引對象

官方文檔：https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html

嘗試打印上面示例中 Series 的索引類型，會得到一個 MultiIndex 對象，MultiIndex 對象的 levels 屬性表示兩個層級中分別有那些標簽，codes 屬性表示每個位置分別是什么標簽，如下所示：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.Series(np.random.randn(12),index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> obj a 0 0.0359461 -0.8672152 -0.053355 b 0 -0.9866161 0.0260712 -0.048394 c 0 0.2512741 0.2177902 1.137674 d 0 -1.2451781 1.2349722 -0.035624 dtype: float64 >>> >>> type(obj.index) <class 'pandas.core.indexes.multi.MultiIndex'> >>> >>> obj.index MultiIndex([('a', 0),('a', 1),('a', 2),('b', 0),('b', 1),('b', 2),('c', 0),('c', 1),('c', 2),('d', 0),('d', 1),('d', 2)],) >>> obj.index.levels FrozenList([['a', 'b', 'c', 'd'], [0, 1, 2]]) >>> >>> obj.index.codes FrozenList([[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

通常可以使用 from_arrays() 方法來將數組對象轉換為 MultiIndex 索引對象：

>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']] >>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color')) MultiIndex([(1, 'red'),(1, 'blue'),(2, 'red'),(2, 'blue')],names=['number', 'color'])

其他常用方法見下表（更多方法參見官方文檔）：

方法描述

from_arrays(arrays[, sortorder, names])	將數組轉換為 MultiIndex
from_tuples(tuples[, sortorder, names])	將元組列表轉換為 MultiIndex
from_product(iterables[, sortorder, names])	將多個可迭代的笛卡爾積轉換成 MultiIndex
from_frame(df[, sortorder, names])	將 DataFrame 對象轉換為 MultiIndex
set_levels(self, levels[, level, inplace, …])	為 MultiIndex 設置新的 levels
set_codes(self, codes[, level, inplace, …])	為 MultiIndex 設置新的 codes
sortlevel(self[, level, ascending, …])	根據 level 進行排序
droplevel(self[, level])	刪除指定的 level
swaplevel(self[, i, j])	交換 level i 與 level i，即交換外層索引與內層索引

【03x03】提取值

對于這種有多層索引的對象，如果只傳入一個參數，則會對外層索引進行提取，其中包含對應所有的內層索引，如果傳入兩個參數，則第一個參數表示外層索引，第二個參數表示內層索引，示例如下：

【03x04】交換分層與排序

MultiIndex 對象的 swaplevel() 方法可以交換外層與內層索引，sortlevel() 方法會先對外層索引進行排序，再對內層索引進行排序，默認是升序，如果設置 ascending 參數為 False 則會降序排列，示例如下：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.Series(np.random.randn(12),index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd'], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> obj a 0 -0.1102151 0.1930752 -1.101706 b 0 -1.3257431 0.5284182 -0.127081 c 0 -0.7338221 1.6652622 0.127073 d 0 1.2620221 -1.1705182 0.966334 dtype: float64 >>> >>> obj.swaplevel() 0 a -0.110215 1 a 0.193075 2 a -1.101706 0 b -1.325743 1 b 0.528418 2 b -0.127081 0 c -0.733822 1 c 1.665262 2 c 0.127073 0 d 1.262022 1 d -1.170518 2 d 0.966334 dtype: float64 >>> >>> obj.swaplevel().index.sortlevel() (MultiIndex([(0, 'a'),(0, 'b'),(0, 'c'),(0, 'd'),(1, 'a'),(1, 'b'),(1, 'c'),(1, 'd'),(2, 'a'),(2, 'b'),(2, 'c'),(2, 'd')],), array([ 0, 3, 6, 9, 1, 4, 7, 10, 2, 5, 8, 11], dtype=int32))

總結

以上是生活随笔為你收集整理的Python 数据分析三剑客之 Pandas（四）：函数应用、映射、排序和层级索引的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：招行现金分期申请失败招行现金分期不能用
下一篇：【Python CheckiO 题解】B