當前位置：首頁 >

Python 数据分析三剑客之 Pandas（八）：数据重塑、重复数据处理与数据替换

發布時間：2023/12/10 41 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 数据分析三剑客之 Pandas（八）：数据重塑、重复数据处理与数据替换小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕松入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章：

Python 數據分析三劍客之 Pandas（一）：認識 Pandas 及其 Series、DataFrame 對象
Python 數據分析三劍客之 Pandas（二）：Index 索引對象以及各種索引操作
Python 數據分析三劍客之 Pandas（三）：算術運算與缺失值的處理
Python 數據分析三劍客之 Pandas（四）：函數應用、映射、排序和層級索引
Python 數據分析三劍客之 Pandas（五）：統計計算與統計描述
Python 數據分析三劍客之 Pandas（六）：GroupBy 數據分裂、應用與合并
Python 數據分析三劍客之 Pandas（七）：合并數據集
Python 數據分析三劍客之 Pandas（八）：數據重塑、重復數據處理與數據替換
Python 數據分析三劍客之 Pandas（九）：時間序列
Python 數據分析三劍客之 Pandas（十）：數據讀寫

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

- 【01x00】數據重塑
- - 【01x01】stack
  - 【01x02】unstack
- 【02x00】重復數據處理
- - 【02x01】duplicated
  - 【02x02】drop_duplicates
- 【03x00】數據替換
- - 【03x01】replace
  - 【03x02】where
  - 【03x03】mask

這里是一段防爬蟲文本，請讀者忽略。本文原創首發于 CSDN，作者 TRHX。博客首頁：https://itrhx.blog.csdn.net/ 本文鏈接：https://itrhx.blog.csdn.net/article/details/106900748 未經授權，禁止轉載！惡意轉載，后果自負！尊重原創，遠離剽竊！

【01x00】數據重塑

有許多用于重新排列表格型數據的基礎運算。這些函數也稱作重塑（reshape）或軸向旋轉（pivot）運算。重塑層次化索引主要有以下兩個方法：

stack：將數據的列轉換成行；
unstack：將數據的行轉換成列。

【01x01】stack

stack 方法用于將數據的列轉換成為行；

基本語法：DataFrame.stack(self, level=-1, dropna=True)

官方文檔：https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.stack.html

參數描述

level	從列轉換到行，指定不同層級的列索引或列標簽、由列索引或列標簽組成的數組，默認-1
dropna	bool 類型，是否刪除重塑后數據中所有值為 NaN 的行，默認 True

單層列（Single level columns）：

>>> import pandas as pd >>> obj = pd.DataFrame([[0, 1], [2, 3]], index=['cat', 'dog'], columns=['weight', 'height']) >>> objweight height cat 0 1 dog 2 3 >>> >>> obj.stack() cat weight 0height 1 dog weight 2height 3 dtype: int64

多層列（Multi level columns）：

>>> import pandas as pd >>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('weight', 'pounds')]) >>> obj = pd.DataFrame([[1, 2], [2, 4]], index=['cat', 'dog'], columns=multicol) >>> objweight kg pounds cat 1 2 dog 2 4 >>> >>> obj.stack()weight cat kg 1pounds 2 dog kg 2pounds 4

缺失值填充：

通過 level 參數指定不同層級的軸進行重塑：

>>> import pandas as pd >>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')]) >>> obj = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]], index=['cat', 'dog'], columns=multicol) >>> objweight heightkg m cat 1.0 2.0 dog 3.0 4.0 >>> >>> obj.stack(level=0)kg m cat height NaN 2.0weight 1.0 NaN dog height NaN 4.0weight 3.0 NaN >>> >>> obj.stack(level=1)height weight cat kg NaN 1.0m 2.0 NaN dog kg NaN 3.0m 4.0 NaN >>> >>> obj.stack(level=[0, 1]) cat height m 2.0weight kg 1.0 dog height m 4.0weight kg 3.0 dtype: float64

對于重塑后的數據，若有一行的值均為 NaN，則默認會被刪除，可以設置 dropna=False 來保留缺失值：

>>> import pandas as pd >>> multicol = pd.MultiIndex.from_tuples([('weight', 'kg'), ('height', 'm')]) >>> obj = pd.DataFrame([[None, 1.0], [2.0, 3.0]], index=['cat', 'dog'], columns=multicol) >>> objweight heightkg m cat NaN 1.0 dog 2.0 3.0 >>> >>> obj.stack(dropna=False)height weight cat kg NaN NaNm 1.0 NaN dog kg NaN 2.0m 3.0 NaN >>> >>> obj.stack(dropna=True)height weight cat m 1.0 NaN dog kg NaN 2.0m 3.0 NaN

【01x02】unstack

unstack：將數據的行轉換成列。

基本語法：

Series.unstack(self, level=-1, fill_value=None)
DataFrame.unstack(self, level=-1, fill_value=None)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.unstack.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html

參數描述

level	從行轉換到列，指定不同層級的行索引，默認-1
fill_value	用于替換 NaN 的值

在 Series 對象中的應用：

>>> import pandas as pd >>> obj = pd.Series([1, 2, 3, 4], index=pd.MultiIndex.from_product([['one', 'two'], ['a', 'b']])) >>> obj one a 1b 2 two a 3b 4 dtype: int64 >>> >>> obj.unstack()a b one 1 2 two 3 4 >>> >>> obj.unstack(level=0)one two a 1 3 b 2 4

和 stack 方法類似，如果值不存在將會引入缺失值（NaN）：

>>> import pandas as pd >>> obj1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd']) >>> obj2 = pd.Series([4, 5, 6], index=['c', 'd', 'e']) >>> obj3 = pd.concat([obj1, obj2], keys=['one', 'two']) >>> obj3 one a 0b 1c 2d 3 two c 4d 5e 6 dtype: int64 >>> >>> obj3.unstack()a b c d e one 0.0 1.0 2.0 3.0 NaN two NaN NaN 4.0 5.0 6.0

在 DataFrame 對象中的應用：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame(np.arange(6).reshape((2, 3)),index=pd.Index(['Ohio','Colorado'], name='state'),columns=pd.Index(['one', 'two', 'three'],name='number')) >>> obj number one two three state Ohio 0 1 2 Colorado 3 4 5 >>> >>> obj2 = obj.stack() >>> obj2 state number Ohio one 0two 1three 2 Colorado one 3two 4three 5 dtype: int32 >>> >>> obj3 = pd.DataFrame({'left': obj2, 'right': obj2 + 5},columns=pd.Index(['left', 'right'], name='side')) >>> obj3 side left right state number Ohio one 0 5two 1 6three 2 7 Colorado one 3 8two 4 9three 5 10 >>> >>> obj3.unstack('state') side left right state Ohio Colorado Ohio Colorado number one 0 3 5 8 two 1 4 6 9 three 2 5 7 10 >>> >>> obj3.unstack('state').stack('side') state Colorado Ohio number side one left 3 0right 8 5 two left 4 1right 9 6 three left 5 2right 10 7

【02x00】重復數據處理

duplicated：判斷是否為重復值；
drop_duplicates：刪除重復值。

【02x01】duplicated

duplicated 方法可以判斷值是否為重復數據。

基本語法：

Series.duplicated(self, keep='first')
DataFrame.duplicated(self, subset: Union[Hashable, Sequence[Hashable], NoneType] = None, keep: Union[str, bool] = 'first') → ’Series’

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.duplicated.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

參數描述

keep	標記重復項的方法，默認 'first' 'first'：將非重復項和第一個重復項標記為 False，其他重復項標記為 True 'last'：將非重復項和最后一個重復項標記為 False，其他重復項標記為 True False：將所有重復項標記為 True，非重復項標記為 False
subset	列標簽或標簽序列，在 DataFrame 對象中才有此參數，用于指定某列，僅標記該列的重復項，默認情況下將考慮所有列

默認情況下，對于每組重復的值，第一個出現的重復值標記為 False，其他重復項標記為 True，非重復項標記為 False，相當于 keep='first'：

>>> import pandas as pd >>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> obj 0 lama 1 cow 2 lama 3 beetle 4 lama dtype: object >>> >>> obj.duplicated() 0 False 1 False 2 True 3 False 4 True dtype: bool >>> >>> obj.duplicated(keep='first') 0 False 1 False 2 True 3 False 4 True dtype: bool

設置 keep='last'，將每組非重復項和最后一次出現的重復項標記為 False，其他重復項標記為 True，設置 keep=False，則所有重復項均為 True，其他值為 False：

>>> import pandas as pd >>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama']) >>> obj 0 lama 1 cow 2 lama 3 beetle 4 lama dtype: object >>> >>> obj.duplicated(keep='last') 0 True 1 False 2 True 3 False 4 False dtype: bool >>> >>> obj.duplicated(keep=False) 0 True 1 False 2 True 3 False 4 True dtype: bool

在 DataFrame 對象中，subset 參數用于指定某列，僅標記該列的重復項，默認情況下將考慮所有列：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)}) >>> objdata1 data2 0 a 0 1 a 0 2 a 0 3 a 3 4 b 3 5 b 3 6 b 0 7 b 2 >>> >>> obj.duplicated() 0 False 1 True 2 True 3 False 4 False 5 True 6 False 7 False dtype: bool >>> >>> obj.duplicated(subset='data1') 0 False 1 True 2 True 3 True 4 False 5 True 6 True 7 True dtype: bool >>> >>> obj.duplicated(subset='data2', keep='last') 0 True 1 True 2 True 3 True 4 True 5 False 6 False 7 False dtype: bool

【02x02】drop_duplicates

drop_duplicates 方法會返回一個刪除了重復值的序列。

基本語法：

Series.drop_duplicates(self, keep='first', inplace=False) DataFrame.drop_duplicates(self,subset: Union[Hashable, Sequence[Hashable], NoneType] = None,keep: Union[str, bool] = 'first',inplace: bool = False,ignore_index: bool = False) → Union[ForwardRef(‘DataFrame’), NoneType]

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.drop_duplicates.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html

參數描述

keep	刪除重復項的方法，默認 'first' 'first'：保留非重復項和第一個重復項，其他重復項標記均刪除 'last'：保留非重復項和最后一個重復項，其他重復項刪除 False：將所有重復項刪除，非重復項保留
inplace	是否返回刪除重復項后的值，默認 False，若設置為 True，則不返回值，直接改變原數據
subset	列標簽或標簽序列，在 DataFrame 對象中才有此參數，用于指定某列，僅標記該列的重復項，默認情況下將考慮所有列
ignore_index	bool 類型，在 DataFrame 對象中才有此參數，是否忽略原對象的軸標記，默認 False，如果為 True，則新對象的索引將是 0, 1, 2, …, n-1

keep 參數的使用：

>>> import pandas as pd >>> obj = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], name='animal') >>> obj 0 lama 1 cow 2 lama 3 beetle 4 lama 5 hippo Name: animal, dtype: object >>> >>> obj.drop_duplicates() 0 lama 1 cow 3 beetle 5 hippo Name: animal, dtype: object >>> >>> obj.drop_duplicates(keep='last') 1 cow 3 beetle 4 lama 5 hippo Name: animal, dtype: object >>> >>> obj.drop_duplicates(keep=False) 1 cow 3 beetle 5 hippo Name: animal, dtype: object

如果設置 inplace=True，則不會返回任何值，但原對象的值已被改變：

>>> import pandas as pd >>> obj1 = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'], name='animal') >>> obj1 0 lama 1 cow 2 lama 3 beetle 4 lama 5 hippo Name: animal, dtype: object >>> >>> obj2 = obj1.drop_duplicates() >>> obj2 # 有返回值 0 lama 1 cow 3 beetle 5 hippo Name: animal, dtype: object >>> >>> obj3 = obj1.drop_duplicates(inplace=True) >>> obj3 # 無返回值 >>> >>> obj1 # 原對象的值已改變 0 lama 1 cow 3 beetle 5 hippo Name: animal, dtype: object

在 DataFrame 對象中的使用：

>>> import numpy as np >>> import pandas as pd >>> obj = pd.DataFrame({'data1' : ['a'] * 4 + ['b'] * 4,'data2' : np.random.randint(0, 4, 8)}) >>> objdata1 data2 0 a 2 1 a 1 2 a 1 3 a 2 4 b 1 5 b 2 6 b 0 7 b 0 >>> >>> obj.drop_duplicates()data1 data2 0 a 2 1 a 1 4 b 1 5 b 2 6 b 0 >>> >>> obj.drop_duplicates(subset='data2')data1 data2 0 a 2 1 a 1 6 b 0 >>> >>> obj.drop_duplicates(subset='data2', ignore_index=True)data1 data2 0 a 2 1 a 1 2 b 0

【03x00】數據替換

【03x01】replace

replace 方法可以根據值的內容進行替換。

基本語法：

Series.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
DataFrame.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

常用參數：

參數描述

to_replace	找到要替換值的方法，可以是：字符串、正則表達式、列表、字典、整數、浮點數、Series 對象或者 None 使用不同參數的區別參見官方文檔
value	用于替換匹配項的值，對于 DataFrame，可以使用字典的值來指定每列要使用的值，還允許使用此類對象的正則表達式，字符串和列表或字典
inplace	bool 類型，是否直接改變原數據且不返回值，默認 False
regex	bool 類型或者與 to_replace 相同的類型，當 to_replace 參數為正則表達式時，regex 應為 True，或者直接使用該參數代替 to_replace

to_replace 和 value 參數只傳入一個值，單個值替換單個值：

>>> import pandas as pd >>> obj = pd.Series([0, 1, 2, 3, 4]) >>> obj 0 0 1 1 2 2 3 3 4 4 dtype: int64 >>> >>> obj.replace(0, 5) 0 5 1 1 2 2 3 3 4 4 dtype: int64

to_replace 傳入多個值，value 傳入一個值，多個值替換一個值：

>>> import pandas as pd >>> obj = pd.Series([0, 1, 2, 3, 4]) >>> obj 0 0 1 1 2 2 3 3 4 4 dtype: int64 >>> >>> obj.replace([0, 1, 2, 3], 4) 0 4 1 4 2 4 3 4 4 4 dtype: int64

to_replace 和 value 參數都傳入多個值，多個值替換多個值：

>>> import pandas as pd >>> obj = pd.Series([0, 1, 2, 3, 4]) >>> obj 0 0 1 1 2 2 3 3 4 4 dtype: int64 >>> >>> obj.replace([0, 1, 2, 3], [4, 3, 2, 1]) 0 4 1 3 2 2 3 1 4 4 dtype: int64

to_replace 傳入字典：

>>> import pandas as pd >>> obj = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']}) >>> objA B C 0 0 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> >>> obj.replace(0, 5)A B C 0 5 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> >>> obj.replace({0: 10, 1: 100})A B C 0 10 5 a 1 100 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> >>> obj.replace({'A': 0, 'B': 5}, 100)A B C 0 100 100 a 1 1 6 b 2 2 7 c 3 3 8 d 4 4 9 e >>> obj.replace({'A': {0: 100, 4: 400}})A B C 0 100 5 a 1 1 6 b 2 2 7 c 3 3 8 d 4 400 9 e

to_replace 傳入正則表達式：

>>> import pandas as pd >>> obj = pd.DataFrame({'A': ['bat', 'foo', 'bait'],'B': ['abc', 'bar', 'xyz']}) >>> objA B 0 bat abc 1 foo bar 2 bait xyz >>> >>> obj.replace(to_replace=r'^ba.$', value='new', regex=True)A B 0 new abc 1 foo new 2 bait xyz >>> >>> obj.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)A B 0 new abc 1 foo bar 2 bait xyz >>> >>> obj.replace(regex=r'^ba.$', value='new')A B 0 new abc 1 foo new 2 bait xyz >>> >>> obj.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})A B 0 new abc 1 xyz new 2 bait xyz >>> >>> obj.replace(regex=[r'^ba.$', 'foo'], value='new')A B 0 new abc 1 new new 2 bait xyz

【03x02】where

where 方法用于替換條件為 False 的值。

基本語法：

Series.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
DataFrame.where(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.where.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html

常用參數：

參數描述

cond	替換條件，如果 cond 為 True，則保留原始值。如果為 False，則替換為來自 other 的相應值
other	替換值，如果 cond 為 False，則替換為來自該參數的相應值
inplace	bool 類型，是否直接改變原數據且不返回值，默認 False

在 Series 中的應用：

>>> import pandas as pd >>> obj = pd.Series(range(5)) >>> obj 0 0 1 1 2 2 3 3 4 4 dtype: int64 >>> >>> obj.where(obj > 0) 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 >>> >>> obj.where(obj > 1, 10) 0 10 1 10 2 2 3 3 4 4 dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd >>> obj = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> objA B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> >>> m = obj % 3 == 0 >>> obj.where(m, -obj)A B 0 0 -1 1 -2 3 2 -4 -5 3 6 -7 4 -8 9 >>> >>> obj.where(m, -obj) == np.where(m, obj, -obj)A B 0 True True 1 True True 2 True True 3 True True 4 True True

【03x03】mask

mask 方法與 where 方法相反，mask 用于替換條件為 False 的值。

基本語法：

Series.mask(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
DataFrame.mask(self, cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.mask.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html

常用參數：

參數描述

cond	替換條件，如果 cond 為 False，則保留原始值。如果為 True，則替換為來自 other 的相應值
other	替換值，如果 cond 為 False，則替換為來自該參數的相應值
inplace	bool 類型，是否直接改變原數據且不返回值，默認 False

在 Series 中的應用：

>>> import pandas as pd >>> obj = pd.Series(range(5)) >>> obj 0 0 1 1 2 2 3 3 4 4 dtype: int64 >>> >>> obj.mask(obj > 0) 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64 >>> >>> obj.mask(obj > 1, 10) 0 0 1 1 2 10 3 10 4 10 dtype: int64

在 DataFrame 中的應用：

>>> import pandas as pd >>> obj = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B']) >>> objA B 0 0 1 1 2 3 2 4 5 3 6 7 4 8 9 >>> >>> m = obj % 3 == 0 >>> >>> obj.mask(m, -obj)A B 0 0 1 1 2 -3 2 4 5 3 -6 7 4 8 -9 >>> >>> obj.where(m, -obj) == obj.mask(~m, -obj)A B 0 True True 1 True True 2 True True 3 True True 4 True True