【算法竞赛学习】心跳信号分类预测-数据分析
Task 2 數(shù)據(jù)分析
Tip: 此部分為零基礎(chǔ)入門(mén)數(shù)據(jù)挖掘的 Task2 EDA-數(shù)據(jù)探索性分析 部分,帶你來(lái)了解數(shù)據(jù),熟悉數(shù)據(jù),和數(shù)據(jù)做朋友,歡迎大家后續(xù)多多交流。
賽題:心電圖心跳信號(hào)多分類預(yù)測(cè)
2.1 EDA 目標(biāo)
- EDA的價(jià)值主要在于熟悉數(shù)據(jù)集,了解數(shù)據(jù)集,對(duì)數(shù)據(jù)集進(jìn)行驗(yàn)證來(lái)確定所獲得數(shù)據(jù)集可以用于接下來(lái)的機(jī)器學(xué)習(xí)或者深度學(xué)習(xí)使用。
- 當(dāng)了解了數(shù)據(jù)集之后我們下一步就是要去了解變量間的相互關(guān)系以及變量與預(yù)測(cè)值之間的存在關(guān)系。
- 引導(dǎo)數(shù)據(jù)科學(xué)從業(yè)者進(jìn)行數(shù)據(jù)處理以及特征工程的步驟,使數(shù)據(jù)集的結(jié)構(gòu)和特征集讓接下來(lái)的預(yù)測(cè)問(wèn)題更加可靠。
- 完成對(duì)于數(shù)據(jù)的探索性分析,并對(duì)于數(shù)據(jù)進(jìn)行一些圖表或者文字總結(jié)并打卡。
2.2 內(nèi)容介紹
- 數(shù)據(jù)科學(xué)庫(kù) pandas、numpy、scipy;
- 可視化庫(kù) matplotlib、seabon;
- 載入訓(xùn)練集和測(cè)試集;
- 簡(jiǎn)略觀察數(shù)據(jù)(head()+shape);
- 通過(guò)describe()來(lái)熟悉數(shù)據(jù)的相關(guān)統(tǒng)計(jì)量
- 通過(guò)info()來(lái)熟悉數(shù)據(jù)類型
- 查看每列的存在nan情況
- 異常值檢測(cè)
- 總體分布概況
- 查看skewness and kurtosis
- 查看預(yù)測(cè)值的具體頻數(shù)
2.3 代碼示例
2.3.1 載入各種數(shù)據(jù)科學(xué)與可視化庫(kù)
#coding:utf-8 #導(dǎo)入warnings包,利用過(guò)濾器來(lái)實(shí)現(xiàn)忽略警告語(yǔ)句。 import warnings warnings.filterwarnings('ignore') import missingno as msno import pandas as pd from pandas import DataFrame import matplotlib.pyplot as plt import seaborn as sns import numpy as np2.3.2 載入訓(xùn)練集和測(cè)試集
導(dǎo)入訓(xùn)練集train.csv
import pandas as pd from pandas import DataFrame, Series import matplotlib.pyplot as plt Train_data = pd.read_csv('./train.csv')導(dǎo)入測(cè)試集testA.csv
import pandas as pd from pandas import DataFrame, Series import matplotlib.pyplot as plt Test_data = pd.read_csv('./testA.csv')所有特征集均脫敏處理(方便大家觀看)
- id - 心跳信號(hào)分配的唯一標(biāo)識(shí)
- heartbeat_signals - 心跳信號(hào)序列
- label - 心跳信號(hào)類別(0、1、2、3)
data.head().append(data.tail())——觀察首尾數(shù)據(jù)
data.shape——觀察數(shù)據(jù)集的行列信息
觀察train首尾數(shù)據(jù)
Train_data.head().append(Train_data.tail()) <bound method DataFrame.info of id heartbeat_signals label 0 0 0.9912297987616655,0.9435330436439665,0.764677... 0.0 1 1 0.9714822034884503,0.9289687459588268,0.572932... 0.0 2 2 1.0,0.9591487564065292,0.7013782792997189,0.23... 2.0 3 3 0.9757952826275774,0.9340884687738161,0.659636... 0.0 4 4 0.0,0.055816398940721094,0.26129357194994196,0... 2.0 ... ... ... ... 99995 99995 1.0,0.677705342021188,0.22239242747868546,0.25... 0.0 99996 99996 0.9268571578157265,0.9063471198026871,0.636993... 2.0 99997 99997 0.9258351628306013,0.5873839035878395,0.633226... 3.0 99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45... 2.0 99999 99999 0.9259994004527861,0.916476635326053,0.4042900... 0.0[100000 rows x 3 columns]>觀察train數(shù)據(jù)集的行列信息
Train_data.shape (100000, 3)觀察testA首尾數(shù)據(jù)
Test_data.head().append(Test_data.tail()) id heartbeat_signals 0 100000 0.9915713654170097,1.0,0.6318163407681274,0.13... 1 100001 0.6075533139615096,0.5417083883163654,0.340694... 2 100002 0.9752726292239277,0.6710965234906665,0.686758... 3 100003 0.9956348033996116,0.9170249621481004,0.521096... 4 100004 1.0,0.8879490481178918,0.745564725322326,0.531... 19995 119995 1.0,0.8330283177934747,0.6340472606311671,0.63... 19996 119996 1.0,0.8259705825857048,0.4521053488322387,0.08... 19997 119997 0.951744840752379,0.9162611283848351,0.6675251... 19998 119998 0.9276692903808186,0.6771898159607004,0.242906... 19999 119999 0.6653212231837624,0.527064114047737,0.5166625...觀察testA數(shù)據(jù)集的行列信
Test_data.shape (20000, 2)要養(yǎng)成看數(shù)據(jù)集的head()以及shape的習(xí)慣,這會(huì)讓你每一步更放心,導(dǎo)致接下里的連串的錯(cuò)誤, 如果對(duì)自己的pandas等操作不放心,建議執(zhí)行一步看一下,這樣會(huì)有效的方便你進(jìn)行理解函數(shù)并進(jìn)行操作
2.3.3 總覽數(shù)據(jù)概況
data.describe()——獲取數(shù)據(jù)的相關(guān)統(tǒng)計(jì)量
data.info()——獲取數(shù)據(jù)類型
獲取train數(shù)據(jù)的相關(guān)統(tǒng)計(jì)量
Train_data.describe() id label count 100000.000000 100000.000000 mean 49999.500000 0.856960 std 28867.657797 1.217084 min 0.000000 0.000000 25% 24999.750000 0.000000 50% 49999.500000 0.000000 75% 74999.250000 2.000000 max 99999.000000 3.000000獲取train數(shù)據(jù)類型
Train_data.info() <bound method DataFrame.info of id heartbeat_signals label 0 0 0.9912297987616655,0.9435330436439665,0.764677... 0.0 1 1 0.9714822034884503,0.9289687459588268,0.572932... 0.0 2 2 1.0,0.9591487564065292,0.7013782792997189,0.23... 2.0 3 3 0.9757952826275774,0.9340884687738161,0.659636... 0.0 4 4 0.0,0.055816398940721094,0.26129357194994196,0... 2.0 ... ... ... ... 99995 99995 1.0,0.677705342021188,0.22239242747868546,0.25... 0.0 99996 99996 0.9268571578157265,0.9063471198026871,0.636993... 2.0 99997 99997 0.9258351628306013,0.5873839035878395,0.633226... 3.0 99998 99998 1.0,0.9947621698382489,0.8297017704865509,0.45... 2.0 99999 99999 0.9259994004527861,0.916476635326053,0.4042900... 0.0[100000 rows x 3 columns]>獲取testA數(shù)據(jù)的相關(guān)統(tǒng)計(jì)量
Test_data.describe() id count 20000.000000 mean 109999.500000 std 5773.647028 min 100000.000000 25% 104999.750000 50% 109999.500000 75% 114999.250000 max 119999.000000獲取testA數(shù)據(jù)類型
Test_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 2 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 20000 non-null int64 1 heartbeat_signals 20000 non-null object dtypes: int64(1), object(1) memory usage: 312.6+ KB2.3.4 判斷數(shù)據(jù)缺失和異常
data.isnull().sum()——查看每列的存在nan情況
查看trian每列的存在nan情況
Train_data.isnull().sum() id 0 heartbeat_signals 0 label 0 dtype: int64查看testA每列的存在nan情況
Test_data.isnull().sum() id 0 heartbeat_signals 0 dtype: int642.3.5 了解預(yù)測(cè)值的分布
Train_data['label'] 0 0.0 1 0.0 2 4.0 3 0.0 4 0.0... 99995 4.0 99996 0.0 99997 0.0 99998 0.0 99999 1.0 Name: label, Length: 100000, dtype: float64 Train_data['label'].value_counts() 0.0 58883 4.0 19660 2.0 12994 1.0 6522 3.0 1941 Name: label, dtype: int64 ## 1) 總體分布概況(無(wú)界約翰遜分布等) import scipy.stats as st y = Train_data['label'] plt.figure(1); plt.title('Default') sns.distplot(y, rug=True, bins=20) plt.figure(2); plt.title('Normal') sns.distplot(y, kde=False, fit=st.norm) plt.figure(3); plt.title('Log Normal') sns.distplot(y, kde=False, fit=st.lognorm) # 2)查看skewness and kurtosis sns.distplot(Train_data['label']); print("Skewness: %f" % Train_data['label'].skew()) print("Kurtosis: %f" % Train_data['label'].kurt()) Skewness: 0.917596 Kurtosis: -0.825276 Train_data.skew(), Train_data.kurt() (id 0.000000label 0.917596dtype: float64, id -1.200000label -0.825276dtype: float64) sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness') ## 3) 查看預(yù)測(cè)值的具體頻數(shù) plt.hist(Train_data['label'], orientation = 'vertical',histtype = 'bar', color ='red') plt.show()2.3.7 用pandas_profiling生成數(shù)據(jù)報(bào)告
import pandas_profiling pfr = pandas_profiling.ProfileReport(data_train) pfr.to_file("./example.html")2.4 總結(jié)
數(shù)據(jù)探索性分析是我們初步了解數(shù)據(jù),熟悉數(shù)據(jù)為特征工程做準(zhǔn)備的階段,甚至很多時(shí)候EDA階段提取出來(lái)的特征可以直接當(dāng)作規(guī)則來(lái)用。可見(jiàn)EDA的重要性,這個(gè)階段的主要工作還是借助于各個(gè)簡(jiǎn)單的統(tǒng)計(jì)量來(lái)對(duì)數(shù)據(jù)整體的了解,分析各個(gè)類型變量相互之間的關(guān)系,以及用合適的圖形可視化出來(lái)直觀觀察。希望本節(jié)內(nèi)容能給初學(xué)者帶來(lái)幫助,更期待各位學(xué)習(xí)者對(duì)其中的不足提出建議。
總結(jié)
以上是生活随笔為你收集整理的【算法竞赛学习】心跳信号分类预测-数据分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 【算法竞赛学习】金融风控之贷款违约预测-
- 下一篇: 【算法竞赛学习】心跳信号分类预测-特征工