日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】二手车交易价格预测-Task2数据分析

發布時間:2023/12/15 编程问答 34 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【算法竞赛学习】二手车交易价格预测-Task2数据分析 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

二手車交易價格預測-Task2 數據分析

二、 EDA-數據探索性分析

Tip:此部分為零基礎入門數據挖掘的 Task2 EDA-數據探索性分析 部分,帶你來了解數據,熟悉數據,和數據做朋友,歡迎大家后續多多交流。

賽題:零基礎入門數據挖掘 - 二手車交易價格預測

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

2.1 EDA目標

  • EDA的價值主要在于熟悉數據集,了解數據集,對數據集進行驗證來確定所獲得數據集可以用于接下來的機器學習或者深度學習使用。

  • 當了解了數據集之后我們下一步就是要去了解變量間的相互關系以及變量與預測值之間的存在關系。

  • 引導數據科學從業者進行數據處理以及特征工程的步驟,使數據集的結構和特征集讓接下來的預測問題更加可靠。

  • 完成對于數據的探索性分析,并對于數據進行一些圖表或者文字總結并打卡。

2.2 內容介紹

  • 載入各種數據科學以及可視化庫:
    • 數據科學庫 pandas、numpy、scipy;
    • 可視化庫 matplotlib、seabon;
    • 其他;
  • 載入數據:
    • 載入訓練集和測試集;
    • 簡略觀察數據(head()+shape);
  • 數據總覽:
    • 通過describe()來熟悉數據的相關統計量
    • 通過info()來熟悉數據類型
  • 判斷數據缺失和異常
    • 查看每列的存在nan情況
    • 異常值檢測
  • 了解預測值的分布
    • 總體分布概況(無界約翰遜分布等)
    • 查看skewness and kurtosis
    • 查看預測值的具體頻數
  • 特征分為類別特征和數字特征,并對類別特征查看unique分布
  • 數字特征分析
    • 相關性分析
    • 查看幾個特征得 偏度和峰值
    • 每個數字特征得分布可視化
    • 數字特征相互之間的關系可視化
    • 多變量互相回歸關系可視化
  • 類型特征分析
    • unique分布
    • 類別特征箱形圖可視化
    • 類別特征的小提琴圖可視化
    • 類別特征的柱形圖可視化類別
    • 特征的每個類別頻數可視化(count_plot)
  • 用pandas_profiling生成數據報告
  • 2.3 代碼示例

    2.3.1 載入各種數據科學以及可視化庫

    以下庫都是pip install 安裝, 有特殊情況我會單獨說明
    例如 pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple

    #coding:utf-8 #導入warnings包,利用過濾器來實現忽略警告語句。 import warnings warnings.filterwarnings('ignore')import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import missingno as msno

    2.3.2 載入數據

    ## 1) 載入訓練集和測試集; Train_data = pd.read_csv('train.csv', sep=' ') Test_data = pd.read_csv('testA.csv', sep=' ')

    所有特征集均脫敏處理(方便大家觀看)

    • name - 汽車編碼
    • regDate - 汽車注冊時間
    • model - 車型編碼
    • brand - 品牌
    • bodyType - 車身類型
    • fuelType - 燃油類型
    • gearbox - 變速箱
    • power - 汽車功率
    • kilometer - 汽車行駛公里
    • notRepairedDamage - 汽車有尚未修復的損壞
    • regionCode - 看車地區編碼
    • seller - 銷售方
    • offerType - 報價類型
    • creatDate - 廣告發布時間
    • price - 汽車價格
    • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根據汽車的評論、標簽等大量信息得到的embedding向量)【人工構造 匿名特征】
    ## 2) 簡略觀察數據(head()+shape) Train_data.head().append(Train_data.tail()) SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_1401234149995149996149997149998149999
    07362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
    122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
    21487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
    37186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
    411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482
    14999516397820000607121.0104.00.01.016315.0...0.2802640.0003100.0484410.0711580.0191741.988114-2.9839730.589167-1.304370-0.302592
    14999618453520091102116.0110.00.00.012510.0...0.2532170.0007770.0840790.0996810.0793711.839166-2.7746152.5539940.924196-0.272160
    1499971475872010100360.0111.01.00.0906.0...0.2333530.0007050.1188720.1001180.0979142.439812-1.6306772.2901971.8919220.414931
    149998459072006031234.0103.01.00.015615.0...0.2563690.0002520.0814790.0835580.0814982.075380-2.6337191.4149370.431981-1.659014
    1499991776721999020419.0286.00.01.019312.5...0.2844750.0000000.0400720.0625430.0258191.978453-3.1799130.031724-1.483350-0.342674

    10 rows × 31 columns

    Train_data.shape (150000, 31) Test_data.head().append(Test_data.tail()) SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14012344999549996499974999849999
    1500006693220111212222.045.01.01.031315.0...0.2644050.1218000.0708990.1065580.078867-7.050969-0.8546264.8001510.620011-3.664654
    1500011749601999021119.0210.00.00.07512.5...0.2617450.0000000.0967330.0137050.0523833.679418-0.729039-3.796107-1.541230-0.757055
    15000253562009030482.0210.00.00.01097.0...0.2602160.1120810.0780820.0620780.050540-4.9266901.0011060.8265620.1382260.754033
    15000350688201004050.000.00.01.01607.0...0.2604660.1067270.0811460.0759710.048268-4.8646370.5054931.8703790.3660381.312775
    1500041614281997070326.0142.00.00.07515.0...0.2509990.0000000.0778060.0286000.0817093.616475-0.673236-3.197685-0.025678-0.101290
    19999520903199605034.044.00.00.011615.0...0.2846640.1300440.0498330.0288070.004616-5.9785111.303174-1.207191-1.981240-0.357695
    199996708199910110.000.00.00.07515.0...0.2681010.1080950.0660390.0254680.025971-3.9138251.759524-2.075658-1.1548470.169073
    19999766932004041249.010.01.01.022415.0...0.2694320.1057240.1176520.0574790.015669-4.6390650.6547131.137756-1.3905310.254420
    199998969002002000827.010.00.01.033415.0...0.2611520.0004900.1373660.0862160.0513831.833504-2.8286872.465630-0.911682-2.057353
    19999919338420041109166.061.0NaN1.0689.0...0.2287300.0003000.1035340.0806250.1242642.914571-1.1352700.5476282.094057-1.552150

    10 rows × 30 columns

    Test_data.shape (50000, 30)

    要養成看數據集的head()以及shape的習慣,這會讓你每一步更放心,導致接下里的連串的錯誤, 如果對自己的pandas等操作不放心,建議執行一步看一下,這樣會有效的方便你進行理解函數并進行操作

    2.3.3 總覽數據概況

  • describe種有每列的統計量,個數count、平均值mean、方差std、最小值min、中位數25% 50% 75% 、以及最大值 看這個信息主要是瞬間掌握數據的大概的范圍以及每個值的異常值的判斷,比如有的時候會發現999 9999 -1 等值這些其實都是nan的另外一種表達方式,有的時候需要注意下
  • info 通過info來了解數據每列的type,有助于了解是否存在除了nan以外的特殊符號異常
  • ## 1) 通過describe()來熟悉數據的相關統計量 Train_data.describe() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14countmeanstdmin25%50%75%max
    150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000...150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
    74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.597160...0.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
    43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.919576...0.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
    0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
    37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
    74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.000000...0.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
    112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
    149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.000000...0.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418

    8 rows × 30 columns

    Test_data.describe() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14countmeanstdmin25%50%75%max
    50000.00000050000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
    174999.50000068542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.595580...0.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
    14433.90106761052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.908979...0.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
    150000.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
    162499.75000011203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
    174999.50000052248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.000000...0.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
    187499.250000118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
    199999.000000196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622

    8 rows × 29 columns

    ## 2) 通過info()來熟悉數據類型 Train_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 31 columns): SaleID 150000 non-null int64 name 150000 non-null int64 regDate 150000 non-null int64 model 149999 non-null float64 brand 150000 non-null int64 bodyType 145494 non-null float64 fuelType 141320 non-null float64 gearbox 144019 non-null float64 power 150000 non-null int64 kilometer 150000 non-null float64 notRepairedDamage 150000 non-null object regionCode 150000 non-null int64 seller 150000 non-null int64 offerType 150000 non-null int64 creatDate 150000 non-null int64 price 150000 non-null int64 v_0 150000 non-null float64 v_1 150000 non-null float64 v_2 150000 non-null float64 v_3 150000 non-null float64 v_4 150000 non-null float64 v_5 150000 non-null float64 v_6 150000 non-null float64 v_7 150000 non-null float64 v_8 150000 non-null float64 v_9 150000 non-null float64 v_10 150000 non-null float64 v_11 150000 non-null float64 v_12 150000 non-null float64 v_13 150000 non-null float64 v_14 150000 non-null float64 dtypes: float64(20), int64(10), object(1) memory usage: 35.5+ MB Test_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 30 columns): SaleID 50000 non-null int64 name 50000 non-null int64 regDate 50000 non-null int64 model 50000 non-null float64 brand 50000 non-null int64 bodyType 48587 non-null float64 fuelType 47107 non-null float64 gearbox 48090 non-null float64 power 50000 non-null int64 kilometer 50000 non-null float64 notRepairedDamage 50000 non-null object regionCode 50000 non-null int64 seller 50000 non-null int64 offerType 50000 non-null int64 creatDate 50000 non-null int64 v_0 50000 non-null float64 v_1 50000 non-null float64 v_2 50000 non-null float64 v_3 50000 non-null float64 v_4 50000 non-null float64 v_5 50000 non-null float64 v_6 50000 non-null float64 v_7 50000 non-null float64 v_8 50000 non-null float64 v_9 50000 non-null float64 v_10 50000 non-null float64 v_11 50000 non-null float64 v_12 50000 non-null float64 v_13 50000 non-null float64 v_14 50000 non-null float64 dtypes: float64(20), int64(9), object(1) memory usage: 11.4+ MB

    2.3.4 判斷數據缺失和異常

    ## 1) 查看每列的存在nan情況 Train_data.isnull().sum() SaleID 0 name 0 regDate 0 model 1 brand 0 bodyType 4506 fuelType 8680 gearbox 5981 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 Test_data.isnull().sum() SaleID 0 name 0 regDate 0 model 0 brand 0 bodyType 1413 fuelType 2893 gearbox 1910 power 0 kilometer 0 notRepairedDamage 0 regionCode 0 seller 0 offerType 0 creatDate 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 # nan可視化 missing = Train_data.isnull().sum() missing = missing[missing > 0] missing.sort_values(inplace=True) missing.plot.bar() <matplotlib.axes._subplots.AxesSubplot at 0x111b156dba8>

    通過以上兩句可以很直觀的了解哪些列存在 “nan”, 并可以把nan的個數打印,主要的目的在于 nan存在的個數是否真的很大,如果很小一般選擇填充,如果使用lgb等樹模型可以直接空缺,讓樹自己去優化,但如果nan存在的過多、可以考慮刪掉

    # 可視化看下缺省值 msno.matrix(Train_data.sample(250)) <matplotlib.axes._subplots.AxesSubplot at 0x111b16e75f8>

    msno.bar(Train_data.sample(1000)) <matplotlib.axes._subplots.AxesSubplot at 0x111b1936f98>

    # 可視化看下缺省值 msno.matrix(Test_data.sample(250)) <matplotlib.axes._subplots.AxesSubplot at 0x111b1b4ba20>

    msno.bar(Test_data.sample(1000)) <matplotlib.axes._subplots.AxesSubplot at 0x111b1bc5978>

    測試集的缺省和訓練集的差不多情況, 可視化有四列有缺省,notRepairedDamage缺省得最多

    ## 2) 查看異常值檢測 Train_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 31 columns): SaleID 150000 non-null int64 name 150000 non-null int64 regDate 150000 non-null int64 model 149999 non-null float64 brand 150000 non-null int64 bodyType 145494 non-null float64 fuelType 141320 non-null float64 gearbox 144019 non-null float64 power 150000 non-null int64 kilometer 150000 non-null float64 notRepairedDamage 150000 non-null object regionCode 150000 non-null int64 seller 150000 non-null int64 offerType 150000 non-null int64 creatDate 150000 non-null int64 price 150000 non-null int64 v_0 150000 non-null float64 v_1 150000 non-null float64 v_2 150000 non-null float64 v_3 150000 non-null float64 v_4 150000 non-null float64 v_5 150000 non-null float64 v_6 150000 non-null float64 v_7 150000 non-null float64 v_8 150000 non-null float64 v_9 150000 non-null float64 v_10 150000 non-null float64 v_11 150000 non-null float64 v_12 150000 non-null float64 v_13 150000 non-null float64 v_14 150000 non-null float64 dtypes: float64(20), int64(10), object(1) memory usage: 35.5+ MB

    可以發現除了notRepairedDamage 為object類型其他都為數字 這里我們把他的幾個不同的值都進行顯示就知道了

    Train_data['notRepairedDamage'].value_counts() 0.0 111361 - 24324 1.0 14315 Name: notRepairedDamage, dtype: int64

    可以看出來‘ - ’也為空缺值,因為很多模型對nan有直接的處理,這里我們先不做處理,先替換成nan

    Train_data['notRepairedDamage'].replace('-', np.nan, inplace=True) Train_data['notRepairedDamage'].value_counts() 0.0 111361 1.0 14315 Name: notRepairedDamage, dtype: int64 Train_data.isnull().sum() SaleID 0 name 0 regDate 0 model 1 brand 0 bodyType 4506 fuelType 8680 gearbox 5981 power 0 kilometer 0 notRepairedDamage 24324 regionCode 0 seller 0 offerType 0 creatDate 0 price 0 v_0 0 v_1 0 v_2 0 v_3 0 v_4 0 v_5 0 v_6 0 v_7 0 v_8 0 v_9 0 v_10 0 v_11 0 v_12 0 v_13 0 v_14 0 dtype: int64 Test_data['notRepairedDamage'].value_counts() 0.0 37249 - 8031 1.0 4720 Name: notRepairedDamage, dtype: int64 Test_data['notRepairedDamage'].replace('-', np.nan, inplace=True)

    以下兩個類別特征嚴重傾斜,一般不會對預測有什么幫助,故這邊先刪掉,當然你也可以繼續挖掘,但是一般意義不大

    Train_data["seller"].value_counts() 0 149999 1 1 Name: seller, dtype: int64 Train_data["offerType"].value_counts() 0 150000 Name: offerType, dtype: int64 del Train_data["seller"] del Train_data["offerType"] del Test_data["seller"] del Test_data["offerType"]

    2.3.5 了解預測值的分布

    Train_data['price'] 0 1850 1 3600 2 6222 3 2400 4 5200... 149995 5900 149996 9500 149997 7500 149998 4999 149999 4700 Name: price, Length: 150000, dtype: int64 Train_data['price'].value_counts() 500 2337 1500 2158 1200 1922 1000 1850 2500 1821... 25321 1 8886 1 8801 1 37920 1 8188 1 Name: price, Length: 3763, dtype: int64 ## 1) 總體分布概況(無界約翰遜分布等) import scipy.stats as st y = Train_data['price'] plt.figure(1); plt.title('Johnson SU') sns.distplot(y, kde=False, fit=st.johnsonsu) plt.figure(2); plt.title('Normal') sns.distplot(y, kde=False, fit=st.norm) plt.figure(3); plt.title('Log Normal') sns.distplot(y, kde=False, fit=st.lognorm) <matplotlib.axes._subplots.AxesSubplot at 0x111b1ee82b0>

    價格不服從正態分布,所以在進行回歸之前,它必須進行轉換。雖然對數變換做得很好,但最佳擬合是無界約翰遜分布

    ## 2) 查看skewness and kurtosis sns.distplot(Train_data['price']); print("Skewness: %f" % Train_data['price'].skew()) print("Kurtosis: %f" % Train_data['price'].kurt()) Skewness: 3.346487 Kurtosis: 18.995183

    Train_data.skew(), Train_data.kurt() (SaleID 6.017846e-17name 5.576058e-01regDate 2.849508e-02model 1.484388e+00brand 1.150760e+00bodyType 9.915299e-01fuelType 1.595486e+00gearbox 1.317514e+00power 6.586318e+01kilometer -1.525921e+00notRepairedDamage 2.430640e+00regionCode 6.888812e-01creatDate -7.901331e+01price 3.346487e+00v_0 -1.316712e+00v_1 3.594543e-01v_2 4.842556e+00v_3 1.062920e-01v_4 3.679890e-01v_5 -4.737094e+00v_6 3.680730e-01v_7 5.130233e+00v_8 2.046133e-01v_9 4.195007e-01v_10 2.522046e-02v_11 3.029146e+00v_12 3.653576e-01v_13 2.679152e-01v_14 -1.186355e+00dtype: float64, SaleID -1.200000name -1.039945regDate -0.697308model 1.740483brand 1.076201bodyType 0.206937fuelType 5.880049gearbox -0.264161power 5733.451054kilometer 1.141934notRepairedDamage 3.908072regionCode -0.340832creatDate 6881.080328price 18.995183v_0 3.993841v_1 -1.753017v_2 23.860591v_3 -0.418006v_4 -0.197295v_5 22.934081v_6 -1.742567v_7 25.845489v_8 -0.636225v_9 -0.321491v_10 -0.577935v_11 12.568731v_12 0.268937v_13 -0.438274v_14 2.393526dtype: float64) sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness') <matplotlib.axes._subplots.AxesSubplot at 0x111b484ab38>

    sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness') <matplotlib.axes._subplots.AxesSubplot at 0x111b4953940>

    skew、kurt說明參考https://www.cnblogs.com/wyy1480/p/10474046.html

    ## 3) 查看預測值的具體頻數 plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red') plt.show()

    查看頻數, 大于20000得值極少,其實這里也可以把這些當作特殊得值(異常值)直接用填充或者刪掉,再前面進行

    # log變換 z之后的分布較均勻,可以進行log變換進行預測,這也是預測問題常用的trick plt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') plt.show()

    2.3.6 特征分為類別特征和數字特征,并對類別特征查看unique分布

    數據類型

    • name - 汽車編碼
    • regDate - 汽車注冊時間
    • model - 車型編碼
    • brand - 品牌
    • bodyType - 車身類型
    • fuelType - 燃油類型
    • gearbox - 變速箱
    • power - 汽車功率
    • kilometer - 汽車行駛公里
    • notRepairedDamage - 汽車有尚未修復的損壞
    • regionCode - 看車地區編碼
    • seller - 銷售方 【以刪】
    • offerType - 報價類型 【以刪】
    • creatDate - 廣告發布時間
    • price - 汽車價格
    • v_0’, ‘v_1’, ‘v_2’, ‘v_3’, ‘v_4’, ‘v_5’, ‘v_6’, ‘v_7’, ‘v_8’, ‘v_9’, ‘v_10’, ‘v_11’, ‘v_12’, ‘v_13’,‘v_14’(根據汽車的評論、標簽等大量信息得到的embedding向量)【人工構造 匿名特征】
    # 分離label即預測值 Y_train = Train_data['price'] # 這個區別方式適用于沒有直接label coding的數據 # 這里不適用,需要人為根據實際含義來區分 # 數字特征 # numeric_features = Train_data.select_dtypes(include=[np.number]) # numeric_features.columns # # 類型特征 # categorical_features = Train_data.select_dtypes(include=[np.object]) # categorical_features.columns numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',] # 特征nunique分布 for cat_fea in categorical_features:print(cat_fea + "的特征分布如下:")print("{}特征有個{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))print(Train_data[cat_fea].value_counts()) name的特征分布如下: name特征有個99662不同的值 708 282 387 282 55 280 1541 263 203 233... 5074 1 7123 1 11221 1 13270 1 174485 1 Name: name, Length: 99662, dtype: int64 model的特征分布如下: model特征有個248不同的值 0.0 11762 19.0 9573 4.0 8445 1.0 6038 29.0 5186... 245.0 2 209.0 2 240.0 2 242.0 2 247.0 1 Name: model, Length: 248, dtype: int64 brand的特征分布如下: brand特征有個40不同的值 0 31480 4 16737 14 16089 10 14249 1 13794 6 10217 9 7306 5 4665 13 3817 11 2945 3 2461 7 2361 16 2223 8 2077 25 2064 27 2053 21 1547 15 1458 19 1388 20 1236 12 1109 22 1085 26 966 30 940 17 913 24 772 28 649 32 592 29 406 37 333 2 321 31 318 18 316 36 228 34 227 33 218 23 186 35 180 38 65 39 9 Name: brand, dtype: int64 bodyType的特征分布如下: bodyType特征有個8不同的值 0.0 41420 1.0 35272 2.0 30324 3.0 13491 4.0 9609 5.0 7607 6.0 6482 7.0 1289 Name: bodyType, dtype: int64 fuelType的特征分布如下: fuelType特征有個7不同的值 0.0 91656 1.0 46991 2.0 2212 3.0 262 4.0 118 5.0 45 6.0 36 Name: fuelType, dtype: int64 gearbox的特征分布如下: gearbox特征有個2不同的值 0.0 111623 1.0 32396 Name: gearbox, dtype: int64 notRepairedDamage的特征分布如下: notRepairedDamage特征有個2不同的值 0.0 111361 1.0 14315 Name: notRepairedDamage, dtype: int64 regionCode的特征分布如下: regionCode特征有個7905不同的值 419 369 764 258 125 137 176 136 462 134... 6414 1 7063 1 4239 1 5931 1 7267 1 Name: regionCode, Length: 7905, dtype: int64 # 特征nunique分布 for cat_fea in categorical_features:print(cat_fea + "的特征分布如下:")print("{}特征有個{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))print(Test_data[cat_fea].value_counts()) name的特征分布如下: name特征有個37453不同的值 55 97 708 96 387 95 1541 88 713 74.. 22270 1 89855 1 42752 1 48899 1 11808 1 Name: name, Length: 37453, dtype: int64 model的特征分布如下: model特征有個247不同的值 0.0 3896 19.0 3245 4.0 3007 1.0 1981 29.0 1742... 242.0 1 240.0 1 244.0 1 243.0 1 246.0 1 Name: model, Length: 247, dtype: int64 brand的特征分布如下: brand特征有個40不同的值 0 10348 4 5763 14 5314 10 4766 1 4532 6 3502 9 2423 5 1569 13 1245 11 919 7 795 3 773 16 771 8 704 25 695 27 650 21 544 15 511 20 450 19 450 12 389 22 363 30 324 17 317 26 303 24 268 28 225 32 193 29 117 31 115 18 106 2 104 37 92 34 77 33 76 36 67 23 62 35 53 38 23 39 2 Name: brand, dtype: int64 bodyType的特征分布如下: bodyType特征有個8不同的值 0.0 13985 1.0 11882 2.0 9900 3.0 4433 4.0 3303 5.0 2537 6.0 2116 7.0 431 Name: bodyType, dtype: int64 fuelType的特征分布如下: fuelType特征有個7不同的值 0.0 30656 1.0 15544 2.0 774 3.0 72 4.0 37 6.0 14 5.0 10 Name: fuelType, dtype: int64 gearbox的特征分布如下: gearbox特征有個2不同的值 0.0 37301 1.0 10789 Name: gearbox, dtype: int64 notRepairedDamage的特征分布如下: notRepairedDamage特征有個2不同的值 0.0 37249 1.0 4720 Name: notRepairedDamage, dtype: int64 regionCode的特征分布如下: regionCode特征有個6971不同的值 419 146 764 78 188 52 125 51 759 51... 7753 1 7463 1 7230 1 826 1 112 1 Name: regionCode, Length: 6971, dtype: int64

    2.3.7 數字特征分析

    numeric_features.append('price') numeric_features ['power','kilometer','v_0','v_1','v_2','v_3','v_4','v_5','v_6','v_7','v_8','v_9','v_10','v_11','v_12','v_13','v_14','price'] Train_data.head() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_1401234
    07362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
    122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
    21487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
    37186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
    411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482

    5 rows × 29 columns

    ## 1) 相關性分析 price_numeric = Train_data[numeric_features] correlation = price_numeric.corr() print(correlation['price'].sort_values(ascending = False),'\n') price 1.000000 v_12 0.692823 v_8 0.685798 v_0 0.628397 power 0.219834 v_5 0.164317 v_2 0.085322 v_6 0.068970 v_1 0.060914 v_14 0.035911 v_13 -0.013993 v_7 -0.053024 v_4 -0.147085 v_9 -0.206205 v_10 -0.246175 v_11 -0.275320 kilometer -0.440519 v_3 -0.730946 Name: price, dtype: float64 f , ax = plt.subplots(figsize = (7, 7))plt.title('Correlation of Numeric Features with Price',y=1,size=16)sns.heatmap(correlation,square = True, vmax=0.8) <matplotlib.axes._subplots.AxesSubplot at 0x111b6ed5b70>

    del price_numeric['price'] ## 2) 查看幾個特征得 偏度和峰值 for col in numeric_features:print('{:15}'.format(col), 'Skewness: {:05.2f}'.format(Train_data[col].skew()) , ' ' ,'Kurtosis: {:06.2f}'.format(Train_data[col].kurt()) ) power Skewness: 65.86 Kurtosis: 5733.45 kilometer Skewness: -1.53 Kurtosis: 001.14 v_0 Skewness: -1.32 Kurtosis: 003.99 v_1 Skewness: 00.36 Kurtosis: -01.75 v_2 Skewness: 04.84 Kurtosis: 023.86 v_3 Skewness: 00.11 Kurtosis: -00.42 v_4 Skewness: 00.37 Kurtosis: -00.20 v_5 Skewness: -4.74 Kurtosis: 022.93 v_6 Skewness: 00.37 Kurtosis: -01.74 v_7 Skewness: 05.13 Kurtosis: 025.85 v_8 Skewness: 00.20 Kurtosis: -00.64 v_9 Skewness: 00.42 Kurtosis: -00.32 v_10 Skewness: 00.03 Kurtosis: -00.58 v_11 Skewness: 03.03 Kurtosis: 012.57 v_12 Skewness: 00.37 Kurtosis: 000.27 v_13 Skewness: 00.27 Kurtosis: -00.44 v_14 Skewness: -1.19 Kurtosis: 002.39 price Skewness: 03.35 Kurtosis: 019.00 ## 3) 每個數字特征得分布可視化 f = pd.melt(Train_data, value_vars=numeric_features) g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False) g = g.map(sns.distplot, "value")

    可以看出匿名特征相對分布均勻

    ## 4) 數字特征相互之間的關系可視化 sns.set() columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14'] sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde') plt.show()

    Train_data.columns Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object') Y_train 0 1850 1 3600 2 6222 3 2400 4 5200... 149995 5900 149996 9500 149997 7500 149998 4999 149999 4700 Name: price, Length: 150000, dtype: int64

    此處是多變量之間的關系可視化,可視化更多學習可參考很不錯的文章 https://www.jianshu.com/p/6e18d21a4cad

    ## 5) 多變量互相回歸關系可視化 fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20)) # ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14'] v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1) sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1) sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1) sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1) sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1) sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1) sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1) sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1) sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1) sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1) sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10) <matplotlib.axes._subplots.AxesSubplot at 0x111b47b2b38>

    2.3.8 類別特征分析

    ## 1) unique分布 for fea in categorical_features:print(Train_data[fea].nunique()) 99662 248 40 8 7 2 2 7905 categorical_features ['name','model','brand','bodyType','fuelType','gearbox','notRepairedDamage','regionCode'] ## 2) 類別特征箱形圖可視化# 因為 name和 regionCode的類別太稀疏了,這里我們把不稀疏的幾類畫一下 categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage'] for c in categorical_features:Train_data[c] = Train_data[c].astype('category')if Train_data[c].isnull().any():Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])Train_data[c] = Train_data[c].fillna('MISSING')def boxplot(x, y, **kwargs):sns.boxplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features) g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5) g = g.map(boxplot, "value", "price")

    Train_data.columns Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object') ## 3) 類別特征的小提琴圖可視化 catg_list = categorical_features target = 'price' for catg in catg_list :sns.violinplot(x=catg, y=target, data=Train_data)plt.show()

    categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage'] ## 4) 類別特征的柱形圖可視化 def bar_plot(x, y, **kwargs):sns.barplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features) g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5) g = g.map(bar_plot, "value", "price")

    ## 5) 類別特征的每個類別頻數可視化(count_plot) def count_plot(x, **kwargs):sns.countplot(x=x)x=plt.xticks(rotation=90)f = pd.melt(Train_data, value_vars=categorical_features) g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5) g = g.map(count_plot, "value")

    2.3.9 用pandas_profiling生成數據報告

    用pandas_profiling生成一個較為全面的可視化和數據報告(較為簡單、方便) 最終打開html文件即可

    import pandas_profiling pfr = pandas_profiling.ProfileReport(Train_data) pfr.to_file("./example.html") HBox(children=(FloatProgress(value=0.0, description='variables', max=29.0, style=ProgressStyle(description_wid…HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=729.0, style=ProgressStyl…HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…HBox(children=(FloatProgress(value=0.0, description='missing', max=4.0, style=ProgressStyle(description_width=…HBox(children=(FloatProgress(value=0.0, description='warnings', max=3.0, style=ProgressStyle(description_width…HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…

    2.4 經驗總結

    所給出的EDA步驟為廣為普遍的步驟,在實際的不管是工程還是比賽過程中,這只是最開始的一步,也是最基本的一步。

    接下來一般要結合模型的效果以及特征工程等來分析數據的實際建模情況,根據自己的一些理解,查閱文獻,對實際問題做出判斷和深入的理解。

    最后不斷進行EDA與數據處理和挖掘,來到達更好的數據結構和分布以及較為強勢相關的特征


    數據探索在機器學習中我們一般稱為EDA(Exploratory Data Analysis):

    是指對已有的數據(特別是調查或觀察得來的原始數據)在盡量少的先驗假定下進行探索,通過作圖、制表、方程擬合、計算特征量等手段探索數據的結構和規律的一種數據分析方法。

    數據探索有利于我們發現數據的一些特性,數據之間的關聯性,對于后續的特征構建是很有幫助的。

  • 對于數據的初步分析(直接查看數據,或.sum(), .mean(),.descirbe()等統計函數)可以從:樣本數量,訓練集數量,是否有時間特征,是否是時許問題,特征所表示的含義(非匿名特征),特征類型(字符類似,int,float,time),特征的缺失情況(注意缺失的在數據中的表現形式,有些是空的有些是”NAN”符號等),特征的均值方差情況。

  • 分析記錄某些特征值缺失占比30%以上樣本的缺失處理,有助于后續的模型驗證和調節,分析特征應該是填充(填充方式是什么,均值填充,0填充,眾數填充等),還是舍去,還是先做樣本分類用不同的特征模型去預測。

  • 對于異常值做專門的分析,分析特征異常的label是否為異常值(或者偏離均值較遠或者事特殊符號),異常值是否應該剔除,還是用正常值填充,是記錄異常,還是機器本身異常等。

  • 對于Label做專門的分析,分析標簽的分布情況等。

  • 進步分析可以通過對特征作圖,特征和label聯合做圖(統計圖,離散圖),直觀了解特征的分布情況,通過這一步也可以發現數據之中的一些異常值等,通過箱型圖分析一些特征值的偏離情況,對于特征和特征聯合作圖,對于特征和label聯合作圖,分析其中的一些關聯性。

  • 總結

    以上是生活随笔為你收集整理的【算法竞赛学习】二手车交易价格预测-Task2数据分析的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。