當前位置：首頁 > 编程语言 > python >内容正文

python

【Python基础】如何用Pandas处理文本数据？

發布時間：2025/3/8 python 15 豆豆

生活随笔收集整理的這篇文章主要介紹了【Python基础】如何用Pandas处理文本数据？小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

作者：耿遠昊，Datawhale成員，華東師范大學文本數據是指不能參與算術運算的任何字符，也稱為字符型數據。如英文字母、漢字、不作為數值使用的數字(以單引號開頭)和其他可輸入的字符。文本數據具有數據維度高、數據量大且語義復雜等特點，是一種較為復雜的數據類型。今天，我們就來一起看看如何使用Pandas對文本數據進行數據處理。?本文目錄1. string類型的性質????????1.1.?string與object的區別??????? 1.2. string類型的轉換????2.?拆分與拼接????????2.1. str.split方法????????2.2.?str.cat方法????3. 替換????????3.1. str.replace的常見用法????????3.2. 子組與函數替換????????3.3. 關于str.replace的注意事項????4.?字串匹配與提取????????4.1.?str.extract方法????????4.2.?str.extractall方法????????4.3.?str.contains和str.match????5. 常用字符串方法????????5.1. 過濾型方法????????5.2. isnumeric方法????6.?問題及練習????????6.1. 問題????????6.2. 練習

一、string類型的性質

1. 1 string與object的區別

string類型和object不同之處有三點：

① 字符存取方法（string accessor methods，如str.count）會返回相應數據的Nullable類型，而object會隨缺失值的存在而改變返回類型；

② 某些Series方法不能在string上使用，例如：Series.str.decode()，因為存儲的是字符串而不是字節；

③ string類型在缺失值存儲或運算時，類型會廣播為pd.NA，而不是浮點型np.nan

其余全部內容在當前版本下完全一致，但迎合Pandas的發展模式，我們仍然全部用string來操作字符串。

1.2 string類型的轉換

首先，導入需要使用的包

import pandas as pd import numpy as np

如果將一個其他類型的容器直接轉換string類型可能會出錯：

#pd.Series([1,'1.']).astype('string') #報錯 #pd.Series([1,2]).astype('string') #報錯 #pd.Series([True,False]).astype('string') #報錯

當下正確的方法是分兩部轉換，先轉為str型object，在轉為string類型：

pd.Series([1,'1.']).astype('str').astype('string') 0 1 1 1 dtype: stringpd.Series([1,2]).astype('str').astype('string') 0 1 1 2 dtype: stringpd.Series([True,False]).astype('str').astype('string') 0 True 1 False dtype: string

二、拆分與拼接

2.1 str.split方法

（a）分割符與str的位置元素選取

s = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string") s 0 a_b_c 1 c_d_e 2 <NA> 3 f_g_h dtype: string

根據某一個元素分割，默認為空格

s.str.split('_') 0 [a, b, c] 1 [c, d, e] 2 <NA> 3 [f, g, h] dtype: object

這里需要注意split后的類型是object，因為現在Series中的元素已經不是string，而包含了list，且string類型只能含有字符串。

對于str方法可以進行元素的選擇，如果該單元格元素是列表，那么str[i]表示取出第i個元素，如果是單個元素，則先把元素轉為列表在取出。

s.str.split('_').str[1] 0 b 1 d 2 <NA> 3 g dtype: objectpd.Series(['a_b_c',?['a','b','c']],?dtype="object").str[1]? #第一個元素先轉為['a','_','b','_','c'] 0 _ 1 b dtype: object

（b）其他參數

expand參數控制了是否將列拆開，n參數代表最多分割多少次

s.str.split('_',expand=True)

s.str.split('_',n=1) 0 [a, b_c] 1 [c, d_e] 2 <NA> 3 [f, g_h] dtype: objects.str.split('_',expand=True,n=1)

2.2?str.cat方法

（a）不同對象的拼接模式

cat方法對于不同對象的作用結果并不相同，其中的對象包括：單列、雙列、多列

① 對于單個Series而言，就是指所有的元素進行字符合并為一個字符串

s = pd.Series(['ab',None,'d'],dtype='string') s 0 ab 1 <NA> 2 d dtype: strings.str.cat() 'abd'

其中可選sep分隔符參數，和缺失值替代字符na_rep參數

s.str.cat(sep=',') 'ab,d's.str.cat(sep=',',na_rep='*') 'ab,*,d'

② 對于兩個Series合并而言，是對應索引的元素進行合并

s2 = pd.Series(['24',None,None],dtype='string') s2 0 24 1 <NA> 2 <NA> dtype: strings.str.cat(s2) 0 ab24 1 <NA> 2 <NA> dtype: string

同樣也有相應參數，需要注意的是兩個缺失值會被同時替換

s.str.cat(s2,sep=',',na_rep='*') 0 ab,24 1 *,* 2 d,* dtype: string

③ 多列拼接可以分為表的拼接和多Series拼接

表的拼接

s.str.cat(pd.DataFrame({0:['1','3','5'],1:['5','b',None]},dtype='string'),na_rep='*') 0 ab15 1 *3b 2 d5* dtype: string

多個Series拼接

s.str.cat([s+'0',s*2]) 0 abab0abab 1 <NA> 2 dd0dd dtype: string

（b）cat中的索引對齊

當前版本中，如果兩邊合并的索引不相同且未指定join參數，默認為左連接，設置join='left'

s2 = pd.Series(list('abc'),index=[1,2,3],dtype='string') s2 1 a 2 b 3 c dtype: strings.str.cat(s2,na_rep='*') 0 ab* 1 *a 2 db dtype: string

三、替換

廣義上的替換，就是指str.replace函數的應用，fillna是針對缺失值的替換，上一章已經提及。

提到替換，就不可避免地接觸到正則表達式，這里默認讀者已掌握常見正則表達式知識點，若對其還不了解的，可以通過這份資料來熟悉

3.1?str.replace的常見用法

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca','', np.nan, 'CABA', 'dog', 'cat'],dtype="string") s 0 A 1 B 2 C 3 Aaba 4 Baca 5 6 <NA> 7 CABA 8 dog 9 cat dtype: string

第一個值寫r開頭的正則表達式，后一個寫替換的字符串

s.str.replace(r'^[AB]','***') 0 *** 1 *** 2 C 3 ***aba 4 ***aca 5 6 <NA> 7 CABA 8 dog 9 cat dtype: string

3.2 子組與函數替換

通過正整數調用子組（0返回字符本身，從1開始才是子組）

s.str.replace(r'([ABC])(\w+)',lambda x:x.group(2)[1:]+'*') 0 A 1 B 2 C 3 ba* 4 ca* 5 6 <NA> 7 BA* 8 dog 9 cat dtype: string

利用?P<....>表達式可以對子組命名調用

s.str.replace(r'(?P<one>[ABC])(?P<two>\w+)',lambda x:x.group('two')[1:]+'*') 0 A 1 B 2 C 3 ba* 4 ca* 5 6 <NA> 7 BA* 8 dog 9 cat dtype: string

3.3 關于str.replace的注意事項

首先，要明確str.replace和replace并不是一個東西：

str.replace針對的是object類型或string類型，默認是以正則表達式為操作，目前暫時不支持DataFrame上使用；
replace針對的是任意類型的序列或數據框，如果要以正則表達式替換，需要設置regex=True，該方法通過字典可支持多列替換。

但現在由于string類型的初步引入，用法上出現了一些問題，這些issue有望在以后的版本中修復。

（a）str.replace賦值參數不得為pd.NA

這聽上去非常不合理，例如對滿足某些正則條件的字符串替換為缺失值，直接更改為缺失值在當下版本就會報錯

#pd.Series(['A','B'],dtype='string').str.replace(r'[A]',pd.NA) #報錯 #pd.Series(['A','B'],dtype='O').str.replace(r'[A]',pd.NA) #報錯

此時，可以先轉為object類型再轉換回來，曲線救國：

pd.Series(['A','B'],dtype='string').astype('O').replace(r'[A]',pd.NA,regex=True).astype('string') 0 <NA> 1 B dtype: string

至于為什么不用replace函數的regex替換（但string類型replace的非正則替換是可以的），原因在下面一條

（b）對于string類型Series

在使用replace函數時不能使用正則表達式替換，該bug現在還未修復

pd.Series(['A','B'],dtype='string').replace(r'[A]','C',regex=True) 0 A 1 B dtype: stringpd.Series(['A','B'],dtype='O').replace(r'[A]','C',regex=True) 0 C 1 B dtype: object

（c）string類型序列如果存在缺失值，不能使用replace替換

#pd.Series(['A',np.nan],dtype='string').replace('A','B') #報錯 pd.Series(['A',np.nan],dtype='string').str.replace('A','B') 0 B 1 <NA> dtype: string

綜上，概況的說，除非需要賦值元素為缺失值（轉為object再轉回來），否則請使用str.replace方法

四、子串匹配與提取

4.1 str.extract方法

（a）常見用法

pd.Series(['10-87', '10-88', '10-89'],dtype="string").str.extract(r'([\d]{2})-([\d]{2})')

使用子組名作為列名

pd.Series(['10-87', '10-88', '-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})')

利用?正則標記選擇部分提取

pd.Series(['10-87',?'10-88',?'-89'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})?-(?P<name_2>[\d]{2})')

pd.Series(['10-87',?'10-88',?'10-'],dtype="string").str.extract(r'(?P<name_1>[\d]{2})-(?P<name_2>[\d]{2})?')

（b）expand參數（默認為True）

對于一個子組的Series，如果expand設置為False，則返回Series，若大于一個子組，則expand參數無效，全部返回DataFrame。

對于一個子組的Index，如果expand設置為False，則返回提取后的Index，若大于一個子組且expand為False，報錯。

s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string") s.index

Index(['A11', 'B22', 'C33'], dtype='object')

s.str.extract(r'([\w])')

s.str.extract(r'([\w])',expand=False) A11 a B22 b C33 c dtype: strings.index.str.extract(r'([\w])')

s.index.str.extract(r'([\w])',expand=False)

Index(['A', 'B', 'C'], dtype='object')

s.index.str.extract(r'([\w])([\d])')

#s.index.str.extract(r'([\w])([\d])',expand=False) #報錯

4.2 str.extractall方法

與extract只匹配第一個符合條件的表達式不同，extractall會找出所有符合條件的字符串，并建立多級索引（即使只找到一個）

s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"],dtype="string") two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])' s.str.extract(two_groups, expand=True)

s.str.extractall(two_groups)

s['A']='a1' s.str.extractall(two_groups)

如果想查看第i層匹配，可使用xs方法

s = pd.Series(["a1a2", "b1b2", "c1c2"], index=["A", "B", "C"],dtype="string") s.str.extractall(two_groups).xs(1,level='match')

4.3 str.contains和str.match

前者的作用為檢測是否包含某種正則模式

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains(r'[0-9][a-z]') 0 False 1 <NA> 2 True 3 True 4 True dtype: boolean

可選參數為na

pd.Series(['1', None, '3a', '3b', '03c'], dtype="string").str.contains('a', na=False) 0 False 1 False 2 True 3 False 4 False dtype: boolean

str.match與其區別在于，match依賴于python的re.match，檢測內容為是否從頭開始包含該正則模式

pd.Series(['1', None, '3a_', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False) 0 False 1 False 2 True 3 True 4 False dtype: booleanpd.Series(['1', None, '_3a', '3b', '03c'], dtype="string").str.match(r'[0-9][a-z]',na=False) 0 False 1 False 2 False 3 True 4 False dtype: boolean

五、常用字符串方法

5.1 過濾型方法

（a）str.strip

常用于過濾空格

pd.Series(list('abc'),index=[' space1 ','space2 ',' space3'],dtype="string").index.str.strip() Index(['space1', 'space2', 'space3'], dtype='object')

（b）str.lower和str.upper

pd.Series('A',dtype="string").str.lower() 0 a dtype: stringpd.Series('a',dtype="string").str.upper() 0 A dtype: string

（c）str.swapcase和str.capitalize

分別表示交換字母大小寫和大寫首字母

pd.Series('abCD',dtype="string").str.swapcase() 0 ABcd dtype: stringpd.Series('abCD',dtype="string").str.capitalize() 0 Abcd dtype: string

5.2 isnumeric方法

檢查每一位是否都是數字，請問如何判斷是否是數值？（問題二）

pd.Series(['1.2','1','-0.3','a',np.nan],dtype="string").str.isnumeric() 0 False 1 True 2 False 3 False 4 <NA> dtype: boolean

六、問題與練習

6.1 問題

【問題一】 str對象方法和df/Series對象方法有什么區別？

【問題二】給出一列string類型，如何判斷單元格是否是數值型數據？

【問題三】 rsplit方法的作用是什么？它在什么場合下適用？

【問題四】在本章的第二到第四節分別介紹了字符串類型的5類操作，請思考它們各自應用于什么場景？

6.2?練習

【練習一】現有一份關于字符串的數據集，請解決以下問題：

（a）現對字符串編碼存儲人員信息（在編號后添加ID列），使用如下格式：“×××（名字）：×國人，性別×，生于×年×月×日”

# 方法一 > ex1_ori = pd.read_csv('data/String_data_one.csv',index_col='人員編號') > ex1_ori.head()姓名國籍性別出生年出生月出生日人員編號 1 aesfd 2 男 1942 8 10 2 fasefa 5 女 1985 10 4 3 aeagd 4 女 1946 10 15 4 aef 4 男 1999 5 13 5 eaf 1 女 2010 6 24> ex1 = ex1_ori.copy() > ex1['冒號'] = '：' > ex1['逗號'] = '，' > ex1['國人'] = '國人' > ex1['性別2'] = '性別' > ex1['生于'] = '生于' > ex1['年'] = '年' > ex1['月'] = '月' > ex1['日'] = '日' > ID = ex1['姓名'].str.cat([ex1['冒號'], ex1['國籍'].astype('str'), ex1['國人'],ex1['逗號'],ex1['性別2'],ex1['性別'],ex1['逗號'],ex1['生于'],ex1['出生年'].astype('str'),ex1['年'],ex1['出生月'].astype('str'),ex1['月'],ex1['出生日'].astype('str'),ex1['日']]) > ex1_ori['ID'] = ID > ex1_ori姓名國籍性別出生年出生月出生日 ID 人員編號 1 aesfd 2 男 1942 8 10 aesfd：2國人，性別男，生于1942年8月10日 2 fasefa 5 女 1985 10 4 fasefa：5國人，性別女，生于1985年10月4日 3 aeagd 4 女 1946 10 15 aeagd：4國人，性別女，生于1946年10月15日 4 aef 4 男 1999 5 13 aef：4國人，性別男，生于1999年5月13日 5??eaf??1??女??2010??6??24? eaf：1國人，性別女，生于2010年6月24日

（b）將（a）中的人員生日信息部分修改為用中文表示（如一九七四年十月二十三日），其余返回格式不變。

（c）將（b）中的ID列結果拆分為原列表相應的5列，并使用equals檢驗是否一致。

# 參考答案 > dic_year = {i[0]:i[1] for i in zip(list('零一二三四五六七八九'),list('0123456789'))} > dic_two = {i[0]:i[1] for i in zip(list('十一二三四五六七八九'),list('0123456789'))} > dic_one = {'十':'1','二十':'2','三十':'3',None:''} > df_res = df_new['ID'].str.extract(r'(?P<姓名>[a-zA-Z]+):(?P<國籍>[\d])國人，性別(?P<性別>[\w])，生于(?P<出生年>[\w]{4})年(?P<出生月>[\w]+)月(?P<出生日>[\w]+)日') > df_res['出生年'] = df_res['出生年'].str.replace(r'(\w)+',lambda x:''.join([dic_year[x.group(0)[i]] for i in range(4)])) > df_res['出生月'] = df_res['出生月'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'0','10') > df_res['出生日'] = df_res['出生日'].str.replace(r'(?P<one>\w?十)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'^0','10') > df_res.head()姓名國籍性別出生年出生月出生日人員編號 1 aesfd 2 男 1942 8 10 2 fasefa 5 女 1985 10 4 3 aeagd 4 女 1946 10 15 4 aef 4 男 1999 5 13 5 eaf 1 女 2010 6 24

【練習二】 現有一份半虛擬的數據集，第一列包含了新型冠狀病毒的一些新聞標題，請解決以下問題：

（a）選出所有關于北京市和上海市新聞標題的所在行。

（b）求col2的均值。

ex2.col2.str.rstrip('-`').str.lstrip('/').astype(float).mean()

-0.984

（c）求col3的均值。

ex2.columns?=?ex2.columns.str.strip('?')## ！！！用于尋找臟數據 def?is_number(x):try:float(x)return Trueexcept (SyntaxError, ValueError) as e:return False ex2[~ex2.col3.map(is_number)]

?ex2.col3.str.replace(r'[`\\{]',?'').astype(float).mean()

24.707484999999988

往期精彩回顧適合初學者入門人工智能的路線及資料下載機器學習及深度學習筆記等資料打印機器學習在線手冊深度學習筆記專輯《統計學習方法》的代碼復現專輯 AI基礎下載機器學習的數學基礎專輯獲取一折本站知識星球優惠券，復制鏈接直接打開：https://t.zsxq.com/yFQV7am本站qq群1003271085。加入微信群請掃碼進群：

總結

以上是生活随笔為你收集整理的【Python基础】如何用Pandas处理文本数据？的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：我用Python爬取并分析了30万个房产
下一篇：【论文解读】打破常规，逆残差模块超强改进