日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Datawhale-数据分析-泰坦尼克-第一单元

發布時間:2023/12/8 编程问答 53 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Datawhale-数据分析-泰坦尼克-第一单元 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1 第一章:數據載入及初步觀察

1.1 載入數據

數據集下載 https://www.kaggle.com/c/titanic/overview

1.1.1 任務一:導入numpy和pandas

#寫入代碼 import numpy as np import pandas as pd import os

【提示】如果加載失敗,學會如何在你的python環境下安裝numpy和pandas這兩個庫

1.1.2 任務二:載入數據

(1) 使用相對路徑載入數據
(2) 使用絕對路徑載入數據

#寫入代碼 test_data = pd.read_csv('test_1.csv') f = open('E://study//master3//數據分析//DataWhale//Titanic//hands-on-data-analysis-master//hands-on-data-analysis-master//第一單元項目集合/train.csv') train_data = pd.read_csv(f) # test_data_t = pd.read_table('./test_1.csv') # os.getcwd() # test_data_t train_data.head(5) PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
#寫入代碼 test_data.head(3) Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda012
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100

【提示】相對路徑載入報錯時,嘗試使用os.getcwd()查看當前工作目錄。
【思考】知道數據加載的方法后,試試pd.read_csv()和pd.read_table()的不同,如果想讓他們效果一樣,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加載這兩個數據集?
【總結】加載的數據是所有工作的第一步,我們的工作會接觸到不同的數據格式(eg:.csv;.tsv;.xlsx),但是加載的方法和思路都是一樣的,在以后工作和做項目的過程中,遇到之前沒有碰到的問題,要多多查資料嗎,使用googel,了解業務邏輯,明白輸入和輸出是什么。

1.1.3 任務三:每1000行為一個數據模塊,逐塊讀取

#寫入代碼 chunker = pd.read_csv('train.csv',chunksize=1000) for piece in chunker:print(type(piece))print(len(piece))print(piece) <class 'pandas.core.frame.DataFrame'> 891PassengerId Survived Pclass \ 0 1 0 3 1 2 1 1 2 3 1 3 3 4 1 1 4 5 0 3 5 6 0 3 6 7 0 1 7 8 0 3 8 9 1 3 9 10 1 2 10 11 1 3 11 12 1 1 12 13 0 3 13 14 0 3 14 15 0 3 15 16 1 2 16 17 0 3 17 18 1 2 18 19 0 3 19 20 1 3 20 21 0 2 21 22 1 2 22 23 1 3 23 24 1 1 24 25 0 3 25 26 1 3 26 27 0 3 27 28 0 1 28 29 1 3 29 30 0 3 .. ... ... ... 861 862 0 2 862 863 1 1 863 864 0 3 864 865 0 2 865 866 1 2 866 867 1 2 867 868 0 1 868 869 0 3 869 870 1 3 870 871 0 3 871 872 1 1 872 873 0 1 873 874 0 3 874 875 1 2 875 876 1 3 876 877 0 3 877 878 0 3 878 879 0 3 879 880 1 1 880 881 1 2 881 882 0 3 882 883 0 3 883 884 0 2 884 885 0 3 885 886 0 3 886 887 0 2 887 888 1 1 888 889 0 3 889 890 1 1 890 891 0 3 Name Sex Age SibSp \ 0 Braund, Mr. Owen Harris male 22.0 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 2 Heikkinen, Miss. Laina female 26.0 0 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 4 Allen, Mr. William Henry male 35.0 0 5 Moran, Mr. James male NaN 0 6 McCarthy, Mr. Timothy J male 54.0 0 7 Palsson, Master. Gosta Leonard male 2.0 3 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 10 Sandstrom, Miss. Marguerite Rut female 4.0 1 11 Bonnell, Miss. Elizabeth female 58.0 0 12 Saundercock, Mr. William Henry male 20.0 0 13 Andersson, Mr. Anders Johan male 39.0 1 14 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 16 Rice, Master. Eugene male 2.0 4 17 Williams, Mr. Charles Eugene male NaN 0 18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 19 Masselmani, Mrs. Fatima female NaN 0 20 Fynney, Mr. Joseph J male 35.0 0 21 Beesley, Mr. Lawrence male 34.0 0 22 McGowan, Miss. Anna "Annie" female 15.0 0 23 Sloper, Mr. William Thompson male 28.0 0 24 Palsson, Miss. Torborg Danira female 8.0 3 25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 26 Emir, Mr. Farred Chehab male NaN 0 27 Fortune, Mr. Charles Alexander male 19.0 3 28 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 29 Todoroff, Mr. Lalio male NaN 0 .. ... ... ... ... 861 Giles, Mr. Frederick Edward male 21.0 1 862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 863 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 864 Gill, Mr. John William male 24.0 0 865 Bystrom, Mrs. (Karolina) female 42.0 0 866 Duran y More, Miss. Asuncion female 27.0 1 867 Roebling, Mr. Washington Augustus II male 31.0 0 868 van Melkebeke, Mr. Philemon male NaN 0 869 Johnson, Master. Harold Theodor male 4.0 1 870 Balkic, Mr. Cerin male 26.0 0 871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 872 Carlsson, Mr. Frans Olof male 33.0 0 873 Vander Cruyssen, Mr. Victor male 47.0 0 874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 875 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 876 Gustafsson, Mr. Alfred Ossian male 20.0 0 877 Petroff, Mr. Nedelio male 19.0 0 878 Laleff, Mr. Kristo male NaN 0 879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 881 Markun, Mr. Johann male 33.0 0 882 Dahlberg, Miss. Gerda Ulrika female 22.0 0 883 Banfield, Mr. Frederick James male 28.0 0 884 Sutehall, Mr. Henry Jr male 25.0 0 885 Rice, Mrs. William (Margaret Norton) female 39.0 0 886 Montvila, Rev. Juozas male 27.0 0 887 Graham, Miss. Margaret Edith female 19.0 0 888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 889 Behr, Mr. Karl Howell male 26.0 0 890 Dooley, Mr. Patrick male 32.0 0 Parch Ticket Fare Cabin Embarked 0 0 A/5 21171 7.2500 NaN S 1 0 PC 17599 71.2833 C85 C 2 0 STON/O2. 3101282 7.9250 NaN S 3 0 113803 53.1000 C123 S 4 0 373450 8.0500 NaN S 5 0 330877 8.4583 NaN Q 6 0 17463 51.8625 E46 S 7 1 349909 21.0750 NaN S 8 2 347742 11.1333 NaN S 9 0 237736 30.0708 NaN C 10 1 PP 9549 16.7000 G6 S 11 0 113783 26.5500 C103 S 12 0 A/5. 2151 8.0500 NaN S 13 5 347082 31.2750 NaN S 14 0 350406 7.8542 NaN S 15 0 248706 16.0000 NaN S 16 1 382652 29.1250 NaN Q 17 0 244373 13.0000 NaN S 18 0 345763 18.0000 NaN S 19 0 2649 7.2250 NaN C 20 0 239865 26.0000 NaN S 21 0 248698 13.0000 D56 S 22 0 330923 8.0292 NaN Q 23 0 113788 35.5000 A6 S 24 1 349909 21.0750 NaN S 25 5 347077 31.3875 NaN S 26 0 2631 7.2250 NaN C 27 2 19950 263.0000 C23 C25 C27 S 28 0 330959 7.8792 NaN Q 29 0 349216 7.8958 NaN S .. ... ... ... ... ... 861 0 28134 11.5000 NaN S 862 0 17466 25.9292 D17 S 863 2 CA. 2343 69.5500 NaN S 864 0 233866 13.0000 NaN S 865 0 236852 13.0000 NaN S 866 0 SC/PARIS 2149 13.8583 NaN C 867 0 PC 17590 50.4958 A24 S 868 0 345777 9.5000 NaN S 869 1 347742 11.1333 NaN S 870 0 349248 7.8958 NaN S 871 1 11751 52.5542 D35 S 872 0 695 5.0000 B51 B53 B55 S 873 0 345765 9.0000 NaN S 874 0 P/PP 3381 24.0000 NaN C 875 0 2667 7.2250 NaN C 876 0 7534 9.8458 NaN S 877 0 349212 7.8958 NaN S 878 0 349217 7.8958 NaN S 879 1 11767 83.1583 C50 C 880 1 230433 26.0000 NaN S 881 0 349257 7.8958 NaN S 882 0 7552 10.5167 NaN S 883 0 C.A./SOTON 34068 10.5000 NaN S 884 0 SOTON/OQ 392076 7.0500 NaN S 885 5 382652 29.1250 NaN Q 886 0 211536 13.0000 NaN S 887 0 112053 30.0000 B42 S 888 2 W./C. 6607 23.4500 NaN S 889 0 111369 30.0000 C148 C 890 0 370376 7.7500 NaN Q [891 rows x 12 columns]

【思考】什么是逐塊讀取?為什么要逐塊讀取呢?
將文本分成若干塊,每次處理chunksize行的數據,最終返回一個TextParser對象,對該對象進行迭代遍歷,可以完成逐塊統計的合并處理。
因為文本太大,需要一部分數據,或者需要一塊一塊進行處理。
【提示】大家可以chunker(數據塊)是什么類型?用for循環打印出來出處具體的樣子是什么?
DataFrame的數據類型

1.1.4 任務四:將表頭改成中文,索引改為乘客ID [對于某些英文資料,我們可以通過翻譯來更直觀的熟悉我們的數據]

PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等級(1/2/3等艙位)
Name => 乘客姓名
Sex => 性別
Age => 年齡
SibSp => 堂兄弟/妹個數
Parch => 父母與小孩個數
Ticket => 船票信息
Fare => 票價
Cabin => 客艙
Embarked => 登船港口

#寫入代碼 train_data = pd.read_csv('train.csv',names=['乘客ID','是否幸存','倉位等級','姓名','性別','年齡','兄弟姐妹個數','父母子女個數','船票信息','票價','客艙','登船港口'],index_col='乘客ID',header=0) train_data.head(3) 是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口乘客ID123
03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
13Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

【思考】所謂將表頭改為中文其中一個思路是:將英文列名表頭替換成中文。還有其他的方法嗎?

1.2 初步觀察

導入數據后,你可能要對數據的整體結構和樣例進行概覽,比如說,數據大小、有多少列,各列都是什么格式的,是否包含null等

1.2.1 任務一:查看數據的基本信息

#寫入代碼 train_data.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 891 entries, 1 to 891 Data columns (total 11 columns): 是否幸存 891 non-null int64 倉位等級 891 non-null int64 姓名 891 non-null object 性別 891 non-null object 年齡 714 non-null float64 兄弟姐妹個數 891 non-null int64 父母子女個數 891 non-null int64 船票信息 891 non-null object 票價 891 non-null float64 客艙 204 non-null object 登船港口 889 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 83.5+ KB

【提示】有多個函數可以這樣做,你可以做一下總結

train_data.describe() 是否幸存倉位等級年齡兄弟姐妹個數父母子女個數票價countmeanstdmin25%50%75%max
891.000000891.000000714.000000891.000000891.000000891.000000
0.3838382.30864229.6991180.5230080.38159432.204208
0.4865920.83607114.5264971.1027430.80605749.693429
0.0000001.0000000.4200000.0000000.0000000.000000
0.0000002.00000020.1250000.0000000.0000007.910400
0.0000003.00000028.0000000.0000000.00000014.454200
1.0000003.00000038.0000001.0000000.00000031.000000
1.0000003.00000080.0000008.0000006.000000512.329200

1.2.2 任務二:觀察表格前10行的數據和后15行的數據

#寫入代碼 train_data.head(10) 是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口乘客ID12345678910
03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
13Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
11Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
03Allen, Mr. William Henrymale35.0003734508.0500NaNS
03Moran, Mr. JamesmaleNaN003308778.4583NaNQ
01McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
03Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
13Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
12Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
#寫入代碼 train_data.tail(15) 是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口乘客ID877878879880881882883884885886887888889890891
03Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
03Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
03Laleff, Mr. KristomaleNaN003492177.8958NaNS
11Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
12Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
03Markun, Mr. Johannmale33.0003492577.8958NaNS
03Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
02Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
03Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
03Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
02Montvila, Rev. Juozasmale27.00021153613.0000NaNS
11Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
03Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
11Behr, Mr. Karl Howellmale26.00011136930.0000C148C
03Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

1.2.4 任務三:判斷數據是否為空,為空的地方返回True,其余地方返回False

#寫入代碼 train_data.isnull().head() 是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口乘客ID12345
FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse

【總結】上面的操作都是數據分析中對于數據本身的觀察

【思考】對于一個數據,還可以從哪些方面來觀察?找找答案,這個將對下面的數據分析有很大的幫助

1.3 保存數據

1.3.1 任務一:將你加載并做出改變的數據,在工作目錄下保存為一個新文件train_chinese.csv

#寫入代碼 # 注意:不同的操作系統保存下來可能會有亂碼。大家可以加入`encoding='GBK' 或者 ’encoding = ’uft-8‘‘` train_data.to_csv('train_Chinese.csv',encoding='utf-8')

【總結】數據的加載以及入門,接下來就要接觸數據本身的運算,我們將主要掌握numpy和pandas在工作和項目場景的運用。

1 第一章:數據載入及初步觀察

1.4 知道你的數據叫什么

我們學習pandas的基礎操作,那么上一節通過pandas加載之后的數據,其數據類型是什么呢?

開始前導入numpy和pandas

import numpy as np import pandas as pd

1.4.1 任務一:pandas中有兩個數據類型DateFrame和Series,通過查找簡單了解他們。然后自己寫一個關于這兩個數據類型的小例子🌰[開放題]

https://www.cnblogs.com/lavender1221/p/12664641.html#
Pandas的核心是三大數據結構:Series、DataFrame和Index。絕大多數操作都是圍繞這三種結構進行的。

Series是一個一維的數組對象,它包含一個值序列和一個對應的索引序列。 Numpy的一維數組通過隱式定義的整數索引獲取元素值,而Series用一種顯式定義的索引與元素關聯。顯式索引讓Series對象擁有更強的能力,索引也不再僅僅是整數,還可以是別的類型,比如字符串,索引也不需要連續,也可以重復,自由度非常高。

DataFrame是Pandas的核心數據結構,表示的是二維的矩陣數據表,類似關系型數據庫的結構,每一列可以是不同的值類型,比如數值、字符串、布爾值等等。DataFrame既有行索引,也有列索引,它可以被看做為一個共享相同索引的Series的字典。

創建DataFrame對象的方法有很多,最常用的是利用包含等長度列表或Numpy數組的字典來生成。可以查看DataFrame對象的columns和index屬性。

#寫入代碼 sdata_1 = [7,-2,567,8] example_1 = pd.Series(sdata_1,index = ['a','b','c','d']) example_1 a 7 b -2 c 567 d 8 dtype: int64 sdata_2 = {'a':7,'b':-2,'c':567,'d':8} example_2 = pd.Series(sdata_2) example_2 a 7 b -2 c 567 d 8 dtype: int64 sdata_3 = {'city':['nanjing','wuxi','wuhan','changsha'],'code':['001','002','003','004']} example_3 = pd.DataFrame(sdata_3) example_3 citycode0123
nanjing001
wuxi002
wuhan003
changsha004
''' #我們舉的例子 sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000} example_1 = pd.Series(sdata) example_1 ''' '''#我們舉的例子data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}example_2 = pd.DataFrame(data)example_2'''

1.4.2 任務二:根據上節課的方法載入"train.csv"文件

#寫入代碼train_chinese = pd.read_csv('train_Chinese.csv')train_chinese.head()train_data = pd.read_csv('train.csv')

也可以加載上一節課保存的"train_chinese.csv"文件。通過翻譯版train_chinese.csv熟悉了這個數據集,然后我們對trian.csv來進行操作

1.4.3 任務三:查看DataFrame數據的每列的名稱

#寫入代碼train_chinese.columns Index(['乘客ID', '是否幸存', '倉位等級', '姓名', '性別', '年齡', '兄弟姐妹個數', '父母子女個數', '船票信息', '票價', '客艙', '登船港口'], dtype='object') train_data.columns Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object') train_data.head() PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1.4.4任務四:查看"Cabin"這列的所有值[有多種方法]

#寫入代碼train_data['Cabin'].head() 0 NaN1 C852 NaN3 C1234 NaNName: Cabin, dtype: object #寫入代碼train_data.Cabin.head() 0 NaN1 C852 NaN3 C1234 NaNName: Cabin, dtype: object

1.4.5 任務五:加載文件"test_1.csv",然后對比"train.csv",看看有哪些多出的列,然后將多出的列刪除

經過我們的觀察發現一個測試集test_1.csv有一列是多余的,我們需要將這個多余的列刪去

#寫入代碼test_data = pd.read_csv('test_1.csv')test_data.head() Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda01234
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S100
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS100
#寫入代碼test_data.pop('a').head()test_data Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234567891011121314151617181920212223242526272829...861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
202102Fynney, Mr. Joseph Jmale35.00023986526.0000NaNS
212212Beesley, Mr. Lawrencemale34.00024869813.0000D56S
222313McGowan, Miss. Anna "Annie"female15.0003309238.0292NaNQ
232411Sloper, Mr. William Thompsonmale28.00011378835.5000A6S
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.01534707731.3875NaNS
262703Emir, Mr. Farred ChehabmaleNaN0026317.2250NaNC
272801Fortune, Mr. Charles Alexandermale19.03219950263.0000C23 C25 C27S
282913O'Dwyer, Miss. Ellen "Nellie"femaleNaN003309597.8792NaNQ
293003Todoroff, Mr. LaliomaleNaN003492167.8958NaNS
.......................................
86186202Giles, Mr. Frederick Edwardmale21.0102813411.5000NaNS
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48.0001746625.9292D17S
86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.5500NaNS
86486502Gill, Mr. John Williammale24.00023386613.0000NaNS
86586612Bystrom, Mrs. (Karolina)female42.00023685213.0000NaNS
86686712Duran y More, Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
86786801Roebling, Mr. Washington Augustus IImale31.000PC 1759050.4958A24S
86886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
86987013Johnson, Master. Harold Theodormale4.01134774211.1333NaNS
87087103Balkic, Mr. Cerinmale26.0003492487.8958NaNS
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87287301Carlsson, Mr. Frans Olofmale33.0006955.0000B51 B53 B55S
87387403Vander Cruyssen, Mr. Victormale47.0003457659.0000NaNS
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
87587613Najib, Miss. Adele Kiamie "Jane"female15.00026677.2250NaNC
87687703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87787803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88188203Markun, Mr. Johannmale33.0003492577.8958NaNS
88288303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88388402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 13 columns

【思考】還有其他的刪除多余的列的方式嗎?

# 思考回答del test_data['a']test_data.head() Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1.4.6 任務六: 將[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]這幾個列元素隱藏,只觀察其他幾個列元素

#寫入代碼test_data.drop(['PassengerId','Name','Age','Ticket'],axis=1).head() Unnamed: 0SurvivedPclassSexSibSpParchFareCabinEmbarked01234
003male107.2500NaNS
111female1071.2833C85C
213female007.9250NaNS
311female1053.1000C123S
403male008.0500NaNS

【思考】對比任務五和任務六,是不是使用了不一樣的方法(函數),如果使用一樣的函數如何完成上面的不同的要求呢?

【思考回答】

如果想要完全的刪除你的數據結構,使用inplace=True,因為使用inplace就將原數據覆蓋了,所以這里沒有用

1.5 篩選的邏輯

表格數據中,最重要的一個功能就是要具有可篩選的能力,選出我所需要的信息,丟棄無用的信息。

下面我們還是用實戰來學習pandas這個功能。

1.5.1 任務一: 我們以"Age"為篩選條件,顯示年齡在10歲以下的乘客信息。

#寫入代碼test_data[test_data['Age']<10].head() Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked710162443
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
434412Laroche, Miss. Simonne Marie Anne Andreefemale3.012SC/Paris 212341.5792NaNC

1.5.2 任務二: 以"Age"為條件,將年齡在10歲以上和50歲以下的乘客信息顯示出來,并將這個數據命名為midage

#寫入代碼midage = test_data[(test_data['Age']>10) & (test_data['Age']<50)]midage.head() Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【提示】了解pandas的條件篩選方式以及如何使用交集和并集操作

1.5.3 任務三:將midage的數據中第100行的"Pclass"和"Sex"的數據顯示出來

#寫入代碼midage = midage.reset_index()midage.head() indexUnnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked01234
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【提示】在抽取數據中,我們希望數據的相對順序保持不變,用什么函數可以達到這個效果呢?
reset_index()函數: 使用索引重置生成一個新的DataFrame或Series,可以把索引用作列。保留原索引,即保持數據的相對順序

midage.loc[[100],['Pclass','Sex']] PclassSex100
2male

1.5.4 任務四:使用loc方法將midage的數據中第100,105,108行的"Pclass","Name"和"Sex"的數據顯示出來

#寫入代碼midage.loc[[100,105,108],['Pclass','Name','Sex']] #因為你主動的延長了行的距離,所以會產生表格形式 PclassNameSex100105108
2Byles, Rev. Thomas Roussel Davidsmale
3Cribb, Mr. John Hatfieldmale
3Calic, Mr. Jovomale

1.5.5 任務五:使用iloc方法將midage的數據中第100,105,108行的"Pclass","Name"和"Sex"的數據顯示出來

#寫入代碼midage.iloc[[100,105,108],[4,5,6]] #iloc的行和列都按照整數,不能按照列名 PclassNameSex100105108
2Byles, Rev. Thomas Roussel Davidsmale
3Cribb, Mr. John Hatfieldmale
3Calic, Mr. Jovomale

【思考】對比iloc和loc的異同
iloc是按照行數取值,而loc按著index名取值

復習:在前面我們已經學習了Pandas基礎,知道利用Pandas讀取csv數據的增刪查改,今天我們要學習的就是探索性數據分析,主要介紹如何利用Pandas進行排序、算術計算以及計算描述函數describe()的使用。

1 第一章:探索性數據分析

開始之前,導入numpy、pandas包和數據

#加載所需的庫 import numpy as np import pandas as pd #載入之前保存的train_chinese.csv數據,關于泰坦尼克號的任務,我們就使用這個數據 train_data = pd.read_csv('train_Chinese.csv')

1.6 了解你的數據嗎?

教材《Python for Data Analysis》第五章

1.6.1 任務一:利用Pandas對示例數據進行排序,要求升序

# 具體請看《利用Python進行數據分析》第五章 排序和排名 部分#自己構建一個都為數字的DataFrame數據''' 我們舉了一個例子 pd.DataFrame() :創建一個DataFrame對象 np.arange(8).reshape((2, 4)) : 生成一個二維數組(2*4),第一列:0,1,2,3 第二列:4,5,6,7 index=[2,1] :DataFrame 對象的索引列 columns=['d', 'a', 'b', 'c'] :DataFrame 對象的索引行 ''' frame = pd.DataFrame(np.arange(8).reshape(2,4),index=[2,1],columns=['d','a','b','c']) frame dabc21
0123
4567

【代碼解析】

pd.DataFrame() :創建一個DataFrame對象

np.arange(8).reshape((2, 4)) : 生成一個二維數組(2*4),第一列:0,1,2,3 第二列:4,5,6,7

index=['2, 1] :DataFrame 對象的索引列

columns=[‘d’, ‘a’, ‘b’, ‘c’] :DataFrame 對象的索引行

【問題】:大多數時候我們都是想根據列的值來排序,所以將你構建的DataFrame中的數據根據某一列,升序排列

#回答代碼 frame.sort_values(by = 'c',ascending = True) dabc21
0123
4567

【思考】通過書本你能說出Pandas對DataFrame數據的其他排序方式嗎?
sort_index()對索引進行排序,axis=1是對列

frame.sort_index() dabc12
4567
0123

【總結】下面將不同的排序方式做一個總結

1.讓行索引升序排序

#代碼frame.sort_index() dabc12
4567
0123

2.讓列索引升序排序

#代碼frame.sort_index(axis=1) abcd21
1230
5674

3.讓列索引降序排序

#代碼frame.sort_index(axis=1,ascending=False) dcba21
0321
4765

4.讓任選兩列數據同時降序排序

#代碼frame.sort_values(['a','c'],ascending=False) dabc12
4567
0123

1.6.2 任務二:對泰坦尼克號數據(trian.csv)按票價和年齡兩列進行綜合排序(降序排列),從這個數據中你可以分析出什么?

'''在開始我們已經導入了train_chinese.csv數據,而且前面我們也學習了導入數據過程,根據上面學習,我們直接對目標列進行排序即可head(20) : 讀取前20條數據'''train_data.head(20) 乘客ID是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口012345678910111213141516171819
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
1113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
1211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
1303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
1403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
1503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
1612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
1703Rice, Master. Eugenemale2.04138265229.1250NaNQ
1812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
1903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
2013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
#代碼train_data.sort_values(['票價','年齡'],ascending=False) 乘客ID是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口6792587374383418827742311299118380716700557527377779730689856318268609332498708297305195...611477129804825411143654202371818843326872378597263806822179271302277413466481633674732815
68011Cardeza, Mr. Thomas Drake Martinezmale36.0001PC 17755512.3292B51 B53 B55C
25911Ward, Miss. Annafemale35.0000PC 17755512.3292NaNC
73811Lesurer, Mr. Gustave Jmale35.0000PC 17755512.3292B101C
43901Fortune, Mr. Markmale64.001419950263.0000C23 C25 C27S
34211Fortune, Miss. Alice Elizabethfemale24.003219950263.0000C23 C25 C27S
8911Fortune, Miss. Mabel Helenfemale23.003219950263.0000C23 C25 C27S
2801Fortune, Mr. Charles Alexandermale19.003219950263.0000C23 C25 C27S
74311Ryerson, Miss. Susan Parker "Suzette"female21.0022PC 17608262.3750B57 B59 B63 B66C
31211Ryerson, Miss. Emily Boriefemale18.0022PC 17608262.3750B57 B59 B63 B66C
30011Baxter, Mrs. James (Helene DeLaudeniere Chaput)female50.0001PC 17558247.5208B58 B60C
11901Baxter, Mr. Quigg Edmondmale24.0001PC 17558247.5208B58 B60C
38111Bidois, Miss. Rosaliefemale42.0000PC 17757227.5250NaNC
71711Endres, Miss. Caroline Louisefemale38.0000PC 17757227.5250C45C
70111Astor, Mrs. John Jacob (Madeleine Talmadge Force)female18.0010PC 17757227.5250C62 C64C
55801Robbins, Mr. VictormaleNaN00PC 17757227.5250NaNC
52801Farthing, Mr. JohnmaleNaN00PC 17483221.7792C95S
37801Widener, Mr. Harry Elkinsmale27.0002113503211.5000C82C
78011Robert, Mrs. Edward Scott (Elisabeth Walton Mc...female43.000124160211.3375B3S
73111Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S
69011Madill, Miss. Georgette Alexandrafemale15.000124160211.3375B5S
85711Wick, Mrs. George Dennick (Mary Hitchcock)female45.001136928164.8667NaNS
31911Wick, Miss. Mary Nataliefemale31.000236928164.8667C7S
26911Graham, Mrs. William Thompson (Edith Junkins)female58.0001PC 17582153.4625C125S
61011Shutes, Miss. Elizabeth Wfemale40.0000PC 17582153.4625C125S
33301Graham, Mr. George Edwardmale38.0001PC 17582153.4625C91S
49901Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.0012113781151.5500C22 C26S
70911Cleaver, Miss. Alicefemale22.0000113781151.5500NaNS
29801Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26S
30611Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S
19611Lurette, Miss. Elisefemale58.0000PC 17569146.5208B80C
....................................
61203Jardin, Mr. Jose NetomaleNaN00SOTON/O.Q. 31013057.0500NaNS
47803Braund, Mr. Lewis Richardmale29.001034607.0458NaNS
13003Ekstrom, Mr. Johanmale45.00003470616.9750NaNS
80513Hedman, Mr. Oskar Arvidmale27.00003470896.9750NaNS
82603Flynn, Mr. JohnmaleNaN003683236.9500NaNQ
41203Hart, Mr. HenrymaleNaN003941406.8583NaNQ
14403Burke, Mr. Jeremiahmale19.00003652226.7500NaNQ
65503Hegarty, Miss. Hanora "Nora"female18.00003652266.7500NaNQ
20303Johanson, Mr. Jakob Alfredmale34.000031012646.4958NaNS
37203Wiklund, Mr. Jakob Alfredmale18.001031012676.4958NaNS
81903Holm, Mr. John Fredrik Alexandermale43.0000C 70756.4500NaNS
84403Lemberopolous, Mr. Peter Lmale34.500026836.4375NaNC
32703Nysveen, Mr. Johan Hansenmale61.00003453646.2375NaNS
87301Carlsson, Mr. Frans Olofmale33.00006955.0000B51 B53 B55S
37903Betros, Mr. Tannousmale20.000026484.0125NaNC
59803Johnson, Mr. Alfredmale49.0000LINE0.0000NaNS
26401Harrison, Mr. Williammale40.00001120590.0000B94S
80701Andrews, Mr. Thomas Jrmale39.00001120500.0000A36S
82301Reuchlin, Jonkheer. John Georgemale38.0000199720.0000NaNS
18003Leonard, Mr. Lionelmale36.0000LINE0.0000NaNS
27213Tornquist, Mr. William Henrymale25.0000LINE0.0000NaNS
30303Johnson, Mr. William Cahoone Jrmale19.0000LINE0.0000NaNS
27802Parkes, Mr. Francis "Frank"maleNaN002398530.0000NaNS
41402Cunningham, Mr. Alfred FlemingmaleNaN002398530.0000NaNS
46702Campbell, Mr. WilliammaleNaN002398530.0000NaNS
48202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS
63401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS
67502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS
73302Knight, Mr. Robert JmaleNaN002398550.0000NaNS
81601Fry, Mr. RichardmaleNaN001120580.0000B102S

891 rows × 12 columns

【思考】排序后,如果我們僅僅關注年齡和票價兩列。根據常識我知道發現票價越高的應該客艙越好,所以我們會明顯看出,票價前20的乘客中存活的有14人,這是相當高的一個比例,那么我們后面是不是可以進一步分析一下票價和存活之間的關系,年齡和存活之間的關系呢?當你開始發現數據之間的關系了,數據分析就開始了。

當然,這只是我的想法,你還可以有更多想法,歡迎寫在你的學習筆記中。
存活數與男女之間的關系

多做幾個數據的排序

#代碼train_data.sort_values(['兄弟姐妹個數','父母子女個數','性別'],ascending=False).head(20) 乘客ID是否幸存倉位等級姓名性別年齡兄弟姐妹個數父母子女個數船票信息票價客艙登船港口159201324846180792863593864806837118226185068119233541542
16003Sage, Master. Thomas HenrymaleNaN82CA. 234369.5500NaNS
20203Sage, Mr. FrederickmaleNaN82CA. 234369.5500NaNS
32503Sage, Mr. George John JrmaleNaN82CA. 234369.5500NaNS
84703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.5500NaNS
18103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.5500NaNS
79303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.5500NaNS
86403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.5500NaNS
6003Goodwin, Master. William Frederickmale11.052CA 214446.9000NaNS
38703Goodwin, Master. Sidney Leonardmale1.052CA 214446.9000NaNS
48103Goodwin, Master. Harold Victormale9.052CA 214446.9000NaNS
68403Goodwin, Mr. Charles Edwardmale14.052CA 214446.9000NaNS
7203Goodwin, Miss. Lillian Amyfemale16.052CA 214446.9000NaNS
18303Asplund, Master. Clarence Gustaf Hugomale9.04234707731.3875NaNS
26213Asplund, Master. Edvin Rojj Felixmale3.04234707731.3875NaNS
85103Andersson, Master. Sigvard Harald Eliasmale4.04234708231.2750NaNS
6913Andersson, Miss. Erna Alexandrafemale17.04231012817.9250NaNS
12003Andersson, Miss. Ellis Anna Mariafemale2.04234708231.2750NaNS
23413Asplund, Miss. Lillian Gertrudfemale5.04234707731.3875NaNS
54203Andersson, Miss. Ingeborg Constanziafemale9.04234708231.2750NaNS
54303Andersson, Miss. Sigrid Elisabethfemale11.04234708231.2750NaNS
#寫下你的思考兄弟姐妹越多的,存活率越低,男性可能比女性存活率低

1.6.3 任務三:利用Pandas進行算術計算,計算兩個DataFrame數據相加結果

# 具體請看《利用Python進行數據分析》第五章 算術運算與數據對齊 部分#自己構建兩個都為數字的DataFrame數據"""我們舉了一個例子:frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3), columns=['a', 'b', 'c'], index=['one', 'two', 'three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3), columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])frame1_a""" #代碼frame1_a = pd.DataFrame(np.arange(9.).reshape(3,3),columns=['a','b','c'],index=['one','two','three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])

將frame_a和frame_b進行相加

#代碼frame1_a abconetwothree
0.01.02.0
3.04.05.0
6.07.08.0

【提醒】兩個DataFrame相加后,會返回一個新的DataFrame,對應的行和列的值會相加,沒有對應的會變成空值NaN。

當然,DataFrame還有很多算術運算,如減法,除法等,有興趣的同學可以看《利用Python進行數據分析》第五章 算術運算與數據對齊 部分,多在網絡上查找相關學習資料。

frame1_b aecfirstonetwosecond
0.01.02.0
3.04.05.0
6.07.08.0
9.010.011.0
frame1_a + frame1_b abcefirstonesecondthreetwo
NaNNaNNaNNaN
3.0NaN7.0NaN
NaNNaNNaNNaN
NaNNaNNaNNaN
9.0NaN13.0NaN

1.6.4 任務四:通過泰坦尼克號數據如何計算出在船上最大的家族有多少人?

'''還是用之前導入的chinese_train.csv如果我們想看看在船上,最大的家族有多少人(‘兄弟姐妹個數’+‘父母子女個數’),我們該怎么做呢?'''max(train_data['兄弟姐妹個數']+train_data['父母子女個數']) 10

【提醒】我們只需找出”兄弟姐妹個數“和”父母子女個數“之和最大的數,當然你還可以想出很多方法和思考角度,歡迎你來說出你的看法。

多做幾個數據的相加,看看你能分析出什么?

1.6.5 任務五:學會使用Pandas describe()函數查看數據基本統計信息

#(1) 關鍵知識點示例做一遍(簡單數據)# 具體請看《利用Python進行數據分析》第五章 匯總和計算描述統計 部分#自己構建一個有數字有空值的DataFrame數據"""我們舉了一個例子:frame2 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3] ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2""" #代碼frame2 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3] ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2 onetwoabcd
1.40NaN
7.10-4.5
NaNNaN
0.75-1.3

調用 describe 函數,觀察frame2的數據基本信息

#代碼frame2.describe() onetwocountmeanstdmin25%50%75%max
3.0000002.000000
3.083333-2.900000
3.4936852.262742
0.750000-4.500000
1.075000-3.700000
1.400000-2.900000
4.250000-2.100000
7.100000-1.300000

1.6.6 任務六:分別看看泰坦尼克號數據集中 票價、父母子女 這列數據的基本統計數據,你能發現什么?

'''看看泰坦尼克號數據集中 票價 這列數據的基本統計數據''' #代碼train_data['票價'].describe() count 891.000000mean 32.204208std 49.693429min 0.00000025% 7.91040050% 14.45420075% 31.000000max 512.329200Name: 票價, dtype: float64 train_data['父母子女個數'].describe() count 891.000000mean 0.381594std 0.806057min 0.00000025% 0.00000050% 0.00000075% 0.000000max 6.000000Name: 父母子女個數, dtype: float64

【思考】從上面數據我們可以看出,試試在下面寫出你的看法。然后看看我們給出的答案。
【思考】從上面數據我們可以看出,
一共有891個票價數據,
平均值約為:32.20,
標準差約為49.69,說明票價波動特別大,
25%的人的票價是低于7.91的,50%的人的票價低于14.45,75%的人的票價低于31.00,
票價最大值約為512.33,最小值為0。

75%的人沒有子女或父母,說明出玩人員大部分都孤身一身

當然,答案只是我的想法,你還可以有更多想法,歡迎寫在你的學習筆記中。

多做幾個組數據的統計,看看你能分析出什么?

# 寫下你的其他分析

【思考】有更多想法,歡迎寫在你的學習筆記中。

【總結】本節中我們通過Pandas的一些內置函數對數據進行了初步統計查看,這個過程最重要的不是大家得掌握這些函數,而是看懂從這些函數出來的數據,構建自己的數據分析思維,這也是第一章最重要的點,希望大家學完第一章能對數據有個基本認識,了解自己在做什么,為什么這么做,后面的章節我們將開始對數據進行清洗,進一步分析。

總結

以上是生活随笔為你收集整理的Datawhale-数据分析-泰坦尼克-第一单元的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。