六月:手动学数据分析(task02)
復(fù)習(xí): 在前面我們已經(jīng)學(xué)習(xí)了Pandas基礎(chǔ),第二章我們開始進入數(shù)據(jù)分析的業(yè)務(wù)部分,在第二章第一節(jié)的內(nèi)容中,我們學(xué)習(xí)了數(shù)據(jù)的清洗,這一部分十分重要,只有數(shù)據(jù)變得相對干凈,我們之后對數(shù)據(jù)的分析才可以更有力。而這一節(jié),我們要做的是數(shù)據(jù)重構(gòu),數(shù)據(jù)重構(gòu)依舊屬于數(shù)據(jù)理解(準(zhǔn)備)的范圍。
# Time: 2021-06-16 # 本文有少量備注,并對文章內(nèi)容進行了優(yōu)化 # 目標(biāo)是成為【優(yōu)秀學(xué)習(xí)者】 # 總結(jié)不易,望點贊鼓勵文章目錄
- 【task 02】數(shù)據(jù)清洗和特征處理
- 第二章:數(shù)據(jù)清洗和特征處理
- 2.1 數(shù)據(jù)的合并
- 2.1.1 任務(wù)一:載入四份被分割的數(shù)據(jù)
- 2.1.2 任務(wù)二:使用concat方法,合并兩CSV文件
- 2.1.3 任務(wù)三:使用concat方法,兩表縱向合并
- 2.1.4 任務(wù)四:join方法和append:完成任務(wù)二和任務(wù)三
- 2.1.5 任務(wù)五:使用pd.merge和append方法:完成任務(wù)二和任務(wù)三的任務(wù)
- 2.1.6 任務(wù)六:完成的數(shù)據(jù)保存為result.csv
- 2.2 換一種角度看數(shù)據(jù)
- 2.2.1 任務(wù)一:將我們的數(shù)據(jù)變?yōu)镾eries類型的數(shù)據(jù)
【task 02】數(shù)據(jù)清洗和特征處理
<--------感謝評論區(qū)指正,內(nèi)容已更新!--------->
第二章:數(shù)據(jù)清洗和特征處理
import numpy as np import pandas as pd2.1 數(shù)據(jù)的合并
2.1.1 任務(wù)一:載入四份被分割的數(shù)據(jù)
將data文件夾里面的所有數(shù)據(jù)都載入,我們看到四分?jǐn)?shù)據(jù),是將上一講完整數(shù)據(jù)行、列進行了切割:
- left_up:左上部分
- left_down:左下部分
- right-up:右上部分
- right_down:右下部分
| 1 | 0 | 3 | Braund, Mr. Owen Harris |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
| 3 | 1 | 3 | Heikkinen, Miss. Laina |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
| 5 | 0 | 3 | Allen, Mr. William Henry |
- 【乘客ID】【是否幸存】【艙位等級】【名字】
| 440 | 0 | 2 | Kvillner, Mr. Johan Henrik Johannesson |
| 441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) |
| 442 | 0 | 3 | Hampe, Mr. Leon |
| 443 | 0 | 3 | Petterson, Mr. Johan Emil |
| 444 | 1 | 2 | Reynaldo, Ms. Encarnacion |
- 【乘客ID】【是否幸存】【艙位等級】【名字】
| male | 31.0 | 0 | 0 | C.A. 18723 | 10.500 | NaN | S |
| female | 45.0 | 1 | 1 | F.C.C. 13529 | 26.250 | NaN | S |
| male | 20.0 | 0 | 0 | 345769 | 9.500 | NaN | S |
| male | 25.0 | 1 | 0 | 347076 | 7.775 | NaN | S |
| female | 28.0 | 0 | 0 | 230434 | 13.000 | NaN | S |
- 【性別】【年齡】【兄弟姐妹個數(shù)】【父母孩子個數(shù)】【船票信息】【票價】【船艙】【登船口】
| male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
- 【性別】【年齡】【兄弟姐妹個數(shù)】【父母孩子個數(shù)】【船票信息】【票價】【船艙】【登船口】
2.1.2 任務(wù)二:使用concat方法,合并兩CSV文件
- pd.concat()
參數(shù)含義
- objs:Series,DataFrame或Panel對象的序列或映射。如果傳遞了dict,則排序的鍵將用作鍵參數(shù),除非它被傳遞,在這種情況下,將選擇值(見下文)。任何無對象將被靜默刪除,除非它們都是無,在這種情況下將引發(fā)一個ValueError。
- axis:{0,1,…},默認(rèn)為0。0是行,1是列。
- join:{‘inner’,‘outer’},默認(rèn)為“outer”。如何處理其他軸上的索引。outer為聯(lián)合和inner為交集。
- ignore_index:boolean,default False。如果為True,請不要使用并置軸上的索引值。結(jié)果軸將被標(biāo)記為0,…,n-1。如果要連接其中并置軸沒有有意義的索引信息的對象,這將非常有用。注意,其他軸上的索引值在連接中仍然受到尊重。
- join_axes:Index對象列表。用于其他n-1軸的特定索引,而不是執(zhí)行內(nèi)部/外部設(shè)置邏輯。
- keys:序列,默認(rèn)值無。使用傳遞的鍵作為最外層構(gòu)建層次索引。如果為多索引,應(yīng)該使用元組。
- levels:序列列表,默認(rèn)值無。用于構(gòu)建MultiIndex的特定級別(唯一值)。否則,它們將從鍵推斷。
- names:list,default無。結(jié)果層次索引中的級別的名稱。
- verify_integrity:boolean,default False。檢查新連接的軸是否包含重復(fù)項。這相對于實際的數(shù)據(jù)串聯(lián)可能是非常昂貴的。
- copy:boolean,default True。如果為False,請勿不必要地復(fù)制數(shù)據(jù)。
【默認(rèn)形式】
默認(rèn)形式是改行,列對齊
【用KEY來區(qū)分不同表的來源】
result=pd.concat(frames,keys=['x','y','z'])
【列上的合并,axis=1】
- 默認(rèn)join = ‘outer’,為取并集的關(guān)系,有相同索引的連接【如圖行索引2.3】,確實的NaN
【列上合并,內(nèi)聯(lián)join='inner’取交】
result = pd.concat([df1, df4], axis=1, join='inner')【join_axes】
如果是join_axes的參數(shù)傳入,可以指定根據(jù)那個軸來對齊數(shù)據(jù)
result=pd.concat([df1,df4],axis=1,join_axes=[df1.index])- 列合并,以df1的索引為軸,將df4與其連接,缺失的用NaN
【任務(wù)要求】將數(shù)據(jù)train-left-up.csv和train-right-up.csv橫向合并為一張表,并保存這張表為result_up
list_up = [text_left_up,text_right_up] result_up = pd.concat(list_up,axis=1) result_up| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 435 | 0 | 1 | Silvey, Mr. William Baird | male | 50.0 | 1 | 0 | 13507 | 55.9000 | E44 | S |
| 436 | 1 | 1 | Carter, Miss. Lucile Polk | female | 14.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S |
| 437 | 0 | 3 | Ford, Miss. Doolina Margaret "Daisy" | female | 21.0 | 2 | 2 | W./C. 6608 | 34.3750 | NaN | S |
| 438 | 1 | 2 | Richards, Mrs. Sidney (Emily Hocking) | female | 24.0 | 2 | 3 | 29106 | 18.7500 | NaN | S |
| 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
439 rows × 12 columns
- 現(xiàn)將表構(gòu)成list,然后在作為concat的輸入
2.1.3 任務(wù)三:使用concat方法,兩表縱向合并
使用concat方法:將train-left-down和train-right-down橫向合并為一張表,并保存這張表為result_down。然后將上邊的result_up和result_down縱向合并為result。
list_down=[text_left_down,text_right_down] result_down = pd.concat(list_down,axis=1) result_down| 440 | 0 | 2 | Kvillner, Mr. Johan Henrik Johannesson | male | 31.0 | 0 | 0 | C.A. 18723 | 10.500 | NaN | S |
| 441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) | female | 45.0 | 1 | 1 | F.C.C. 13529 | 26.250 | NaN | S |
| 442 | 0 | 3 | Hampe, Mr. Leon | male | 20.0 | 0 | 0 | 345769 | 9.500 | NaN | S |
| 443 | 0 | 3 | Petterson, Mr. Johan Emil | male | 25.0 | 1 | 0 | 347076 | 7.775 | NaN | S |
| 444 | 1 | 2 | Reynaldo, Ms. Encarnacion | female | 28.0 | 0 | 0 | 230434 | 13.000 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.000 | NaN | S |
| 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.000 | B42 | S |
| 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.450 | NaN | S |
| 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.000 | C148 | C |
| 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.750 | NaN | Q |
452 rows × 12 columns
result = pd.concat([result_up,result_down]) result| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
result.loc[1].head()| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) | female | 45.0 | 1 | 1 | F.C.C. 13529 | 26.2500 | NaN | S |
- 我們會發(fā)現(xiàn) 表是拼起來了 但是第一列索引是亂的
【解決】用到了drop
-
drop=True就是把原來的索引index列去掉,重置index。
-
drop=False就是保留原來的索引,添加重置的index。
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
2.1.4 任務(wù)四:join方法和append:完成任務(wù)二和任務(wù)三
resul_up = text_left_up.join(text_right_up) resul_up| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 435 | 0 | 1 | Silvey, Mr. William Baird | male | 50.0 | 1 | 0 | 13507 | 55.9000 | E44 | S |
| 436 | 1 | 1 | Carter, Miss. Lucile Polk | female | 14.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S |
| 437 | 0 | 3 | Ford, Miss. Doolina Margaret "Daisy" | female | 21.0 | 2 | 2 | W./C. 6608 | 34.3750 | NaN | S |
| 438 | 1 | 2 | Richards, Mrs. Sidney (Emily Hocking) | female | 24.0 | 2 | 3 | 29106 | 18.7500 | NaN | S |
| 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
439 rows × 12 columns
result_down = text_left_down.join(text_right_down) result = result_up.append(result_down) result| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
2.1.5 任務(wù)五:使用pd.merge和append方法:完成任務(wù)二和任務(wù)三的任務(wù)
- pd.merge()
以index為鏈接鍵,需要同時設(shè)置left_index= True 和 right_index= True,或者left_index設(shè)置的同時,right_on指定某個Key。總的來說就是需要指定left、right鏈接的鍵,可以同時是key、index或者混合使用。
result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True) result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True) result = result_up.append(result_down) result| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
| 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
| 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
| 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
| 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
【思考】對比merge、join以及concat的方法的不同以及相同。思考一下在任務(wù)四和任務(wù)五的情況下,為什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任務(wù)四和任務(wù)五呢?
2.1.6 任務(wù)六:完成的數(shù)據(jù)保存為result.csv
result.to_csv('result_task02.csv')2.2 換一種角度看數(shù)據(jù)
2.2.1 任務(wù)一:將我們的數(shù)據(jù)變?yōu)镾eries類型的數(shù)據(jù)
# 將完整的數(shù)據(jù)加載出來 text = pd.read_csv('result_task02.csv') text.head() # 代碼寫在這里 unit_result=text.stack().head(30) unit_result 0 Unnamed: 0 0PassengerId 1Survived 0Pclass 3Name Braund, Mr. Owen HarrisSex maleAge 22.0SibSp 1Parch 0Ticket A/5 21171Fare 7.25Embarked S 1 Unnamed: 0 1PassengerId 2Survived 1Pclass 1Name Cumings, Mrs. John Bradley (Florence Briggs Th...Sex femaleAge 38.0SibSp 1Parch 0Ticket PC 17599Fare 71.2833Cabin C85Embarked C 2 Unnamed: 0 2PassengerId 3Survived 1Pclass 3Name Heikkinen, Miss. Laina dtype: object #將代碼保存為unit_result,csv unit_result.to_csv('unit_result.csv') test = pd.read_csv('unit_result.csv') test| 0 | Unnamed: 0 | 0 |
| 0 | PassengerId | 1 |
| 0 | Survived | 0 |
| 0 | Pclass | 3 |
| 0 | Name | Braund, Mr. Owen Harris |
| 0 | Sex | male |
| 0 | Age | 22.0 |
| 0 | SibSp | 1 |
| 0 | Parch | 0 |
| 0 | Ticket | A/5 21171 |
| 0 | Fare | 7.25 |
| 0 | Embarked | S |
| 1 | Unnamed: 0 | 1 |
| 1 | PassengerId | 2 |
| 1 | Survived | 1 |
| 1 | Pclass | 1 |
| 1 | Name | Cumings, Mrs. John Bradley (Florence Briggs Th... |
| 1 | Sex | female |
| 1 | Age | 38.0 |
| 1 | SibSp | 1 |
- 這個stack函數(shù)是干什么的?
stack是棧的意思 其實就是講列表傳入到棧中,每條記錄 收尾相接
函數(shù)原型為:stack(arrays, axis=0),arrays可以傳數(shù)組和列表。
總結(jié)
以上是生活随笔為你收集整理的六月:手动学数据分析(task02)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 国产系统UOS上的可视化大屏电子看板系统
- 下一篇: Dynamics 365 On-prem