日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【算法竞赛学习】二手车交易价格预测-Baseline

發布時間:2023/12/15 编程问答 43 豆豆
生活随笔 收集整理的這篇文章主要介紹了 【算法竞赛学习】二手车交易价格预测-Baseline 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

二手車交易價格預測-Baseline

Baseline-v1.0 版

Tip:這是一個最初始baseline版本,拋磚引玉,為大家提供一個基本Baseline和一個競賽流程的基本介紹,歡迎大家多多交流。

賽題:零基礎入門數據挖掘 - 二手車交易價格預測

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

# 查看數據文件目錄 list datalab files !ls datalab/ 231784

Step 1:導入函數工具箱

## 基礎工具 import numpy as np import pandas as pd import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as sns from scipy.special import jn from IPython.display import display, clear_output import timewarnings.filterwarnings('ignore') %matplotlib inline## 模型預測的 from sklearn import linear_model from sklearn import preprocessing from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor## 數據降維處理的 from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCAimport lightgbm as lgb import xgboost as xgb## 參數搜索和評價的 from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error

Step 2:數據讀取

## 通過Pandas對于數據進行讀取 (pandas是一個很友好的數據讀取函數庫) Train_data = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ') TestA_data = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')## 輸出數據的大小信息 print('Train data shape:',Train_data.shape) print('TestA data shape:',TestA_data.shape) Train data shape: (150000, 31) TestA data shape: (50000, 30)

1) 數據簡要瀏覽

## 通過.head() 簡要瀏覽讀取數據的形式 Train_data.head() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_1401234
07362004040230.061.00.00.06012.5...0.2356760.1019880.1295490.0228160.097462-2.8818032.804097-2.4208210.7952920.914762
122622003030140.012.00.00.0015.0...0.2647770.1210040.1357310.0265970.020582-4.9004822.096338-1.030483-1.7226740.245522
21487420040403115.0151.00.00.016312.5...0.2514100.1149120.1651470.0621730.027075-4.8467491.8035591.565330-0.832687-0.229963
37186519960908109.0100.00.01.019315.0...0.2742930.1103000.1219640.0333950.000000-4.5095991.285940-0.501868-2.438353-0.478699
411108020120103110.051.00.00.0685.0...0.2280360.0732050.0918800.0788190.121534-1.8962400.9107830.9311102.8345181.923482

5 rows × 31 columns

2) 數據信息查看

## 通過 .info() 簡要可以看到對應一些數據列名,以及NAN缺失信息 Train_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 150000 entries, 0 to 149999 Data columns (total 31 columns): SaleID 150000 non-null int64 name 150000 non-null int64 regDate 150000 non-null int64 model 149999 non-null float64 brand 150000 non-null int64 bodyType 145494 non-null float64 fuelType 141320 non-null float64 gearbox 144019 non-null float64 power 150000 non-null int64 kilometer 150000 non-null float64 notRepairedDamage 150000 non-null object regionCode 150000 non-null int64 seller 150000 non-null int64 offerType 150000 non-null int64 creatDate 150000 non-null int64 price 150000 non-null int64 v_0 150000 non-null float64 v_1 150000 non-null float64 v_2 150000 non-null float64 v_3 150000 non-null float64 v_4 150000 non-null float64 v_5 150000 non-null float64 v_6 150000 non-null float64 v_7 150000 non-null float64 v_8 150000 non-null float64 v_9 150000 non-null float64 v_10 150000 non-null float64 v_11 150000 non-null float64 v_12 150000 non-null float64 v_13 150000 non-null float64 v_14 150000 non-null float64 dtypes: float64(20), int64(10), object(1) memory usage: 35.5+ MB ## 通過 .columns 查看列名 Train_data.columns Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14'],dtype='object') TestA_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 30 columns): SaleID 50000 non-null int64 name 50000 non-null int64 regDate 50000 non-null int64 model 50000 non-null float64 brand 50000 non-null int64 bodyType 48587 non-null float64 fuelType 47107 non-null float64 gearbox 48090 non-null float64 power 50000 non-null int64 kilometer 50000 non-null float64 notRepairedDamage 50000 non-null object regionCode 50000 non-null int64 seller 50000 non-null int64 offerType 50000 non-null int64 creatDate 50000 non-null int64 v_0 50000 non-null float64 v_1 50000 non-null float64 v_2 50000 non-null float64 v_3 50000 non-null float64 v_4 50000 non-null float64 v_5 50000 non-null float64 v_6 50000 non-null float64 v_7 50000 non-null float64 v_8 50000 non-null float64 v_9 50000 non-null float64 v_10 50000 non-null float64 v_11 50000 non-null float64 v_12 50000 non-null float64 v_13 50000 non-null float64 v_14 50000 non-null float64 dtypes: float64(20), int64(9), object(1) memory usage: 11.4+ MB

3) 數據統計信息瀏覽

## 通過 .describe() 可以查看數值特征列的一些統計信息 Train_data.describe() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14countmeanstdmin25%50%75%max
150000.000000150000.0000001.500000e+05149999.000000150000.000000145494.000000141320.000000144019.000000150000.000000150000.000000...150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000150000.000000
74999.50000068349.1728732.003417e+0747.1290218.0527331.7923690.3758420.224943119.31654712.597160...0.2482040.0449230.1246920.0581440.061996-0.0010000.0090350.0048130.000313-0.000688
43301.41452761103.8750955.364988e+0449.5360407.8649561.7606400.5486770.417546177.1684193.919576...0.0458040.0517430.2014100.0291860.0356923.7723863.2860712.5174781.2889881.038685
0.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.168192-5.558207-9.639552-4.153899-6.546556
37499.75000011156.0000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2436150.0000380.0624740.0353340.033930-3.722303-1.951543-1.871846-1.057789-0.437034
74999.50000051638.0000002.003091e+0730.0000006.0000001.0000000.0000000.000000110.00000015.000000...0.2577980.0008120.0958660.0570140.0584841.624076-0.358053-0.130753-0.0362450.141246
112499.250000118841.2500002.007111e+0766.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2652970.1020090.1252430.0793820.0874912.8443571.2550221.7769330.9428130.680378
149999.000000196812.0000002.015121e+07247.00000039.0000007.0000006.0000001.00000019312.00000015.000000...0.2918380.1514201.4049360.1607910.22278712.35701118.81904213.84779211.1476698.658418

8 rows × 30 columns

TestA_data.describe() SaleIDnameregDatemodelbrandbodyTypefuelTypegearboxpowerkilometer...v_5v_6v_7v_8v_9v_10v_11v_12v_13v_14countmeanstdmin25%50%75%max
50000.00000050000.0000005.000000e+0450000.00000050000.00000048587.00000047107.00000048090.00000050000.00000050000.000000...50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.000000
174999.50000068542.2232802.003393e+0746.8445208.0562401.7821850.3734050.224350119.88362012.595580...0.2486690.0450210.1227440.0579970.062000-0.017855-0.013742-0.013554-0.0031470.001516
14433.90106761052.8081335.368870e+0449.4695487.8194771.7607360.5464420.417158185.0973873.908979...0.0446010.0517660.1959720.0292110.0356533.7479853.2312582.5159621.2865971.027360
150000.0000000.0000001.991000e+070.0000000.0000000.0000000.0000000.0000000.0000000.500000...0.0000000.0000000.0000000.0000000.000000-9.160049-5.411964-8.916949-4.123333-6.112667
162499.75000011203.5000001.999091e+0710.0000001.0000000.0000000.0000000.00000075.00000012.500000...0.2437620.0000440.0626440.0350840.033714-3.700121-1.971325-1.876703-1.060428-0.437920
174999.50000052248.5000002.003091e+0729.0000006.0000001.0000000.0000000.000000109.00000015.000000...0.2578770.0008150.0958280.0570840.0587641.613212-0.355843-0.142779-0.0359560.138799
187499.250000118856.5000002.007110e+0765.00000013.0000003.0000001.0000000.000000150.00000015.000000...0.2653280.1020250.1254380.0790770.0874892.8327081.2629141.7643350.9414690.681163
199999.000000196805.0000002.015121e+07246.00000039.0000007.0000006.0000001.00000020000.00000015.000000...0.2916180.1532651.3588130.1563550.21477512.33887218.85621812.9504985.9132732.624622

8 rows × 29 columns

Step 3:特征與標簽構建

1) 提取數值類型特征列名

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns print(numerical_cols) Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object') categorical_cols = Train_data.select_dtypes(include = 'object').columns print(categorical_cols) Index(['notRepairedDamage'], dtype='object')

2) 構建訓練和測試樣本

## 選擇特征列 feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']] feature_cols = [col for col in feature_cols if 'Type' not in col]## 提前特征列,標簽列構造訓練樣本和測試樣本 X_data = Train_data[feature_cols] Y_data = Train_data['price']X_test = TestA_data[feature_cols]print('X train shape:',X_data.shape) print('X test shape:',X_test.shape) X train shape: (150000, 18) X test shape: (50000, 18) ## 定義了一個統計函數,方便后續信息統計 def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))

3) 統計標簽的基本分布信息

print('Sta of label:') Sta_inf(Y_data) Sta of label: _min 11 _max: 99999 _mean 5923.32733333 _ptp 99988 _std 7501.97346988 _var 56279605.9427 ## 繪制標簽的統計圖,查看標簽分布 plt.hist(Y_data) plt.show() plt.close()

4) 缺省值用-1填補

X_data = X_data.fillna(-1) X_test = X_test.fillna(-1)

Step 4:模型訓練與預測

1) 利用xgb進行五折交叉驗證查看模型的參數效果

## xgb-Model xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'scores_train = [] scores = []## 5折交叉驗證方式 sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score)print('Train mae:',np.mean(score_train)) print('Val mae',np.mean(scores)) Train mae: 628.086664863 Val mae 715.990013454

2) 定義xgb和lgb模型函數

def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return modeldef build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm

3)切分數據集(Train,Val)進行模型訓練,評價和預測

## Split data with val x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3) print('Train lgb...') model_lgb = build_model_lgb(x_train,y_train) val_lgb = model_lgb.predict(x_val) MAE_lgb = mean_absolute_error(y_val,val_lgb) print('MAE of val with lgb:',MAE_lgb)print('Predict lgb...') model_lgb_pre = build_model_lgb(X_data,Y_data) subA_lgb = model_lgb_pre.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb) Train lgb... MAE of val with lgb: 689.084070621 Predict lgb... Sta of Predict lgb: _min -519.150259864 _max: 88575.1087721 _mean 5922.98242599 _ptp 89094.259032 _std 7377.29714126 _var 54424513.1104 print('Train xgb...') model_xgb = build_model_xgb(x_train,y_train) val_xgb = model_xgb.predict(x_val) MAE_xgb = mean_absolute_error(y_val,val_xgb) print('MAE of val with xgb:',MAE_xgb)print('Predict xgb...') model_xgb_pre = build_model_xgb(X_data,Y_data) subA_xgb = model_xgb_pre.predict(X_test) print('Sta of Predict xgb:') Sta_inf(subA_xgb) Train xgb... MAE of val with xgb: 715.37757816 Predict xgb... Sta of Predict xgb: _min -165.479 _max: 90051.8 _mean 5922.9 _ptp 90217.3 _std 7361.13 _var 5.41862e+07

4)進行兩模型的結果加權融合

## 這里我們采取了簡單的加權融合的方式 val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb val_Weighted[val_Weighted<0]=10 # 由于我們發現預測的最小值有負數,而真實情況下,price為負是不存在的,由此我們進行對應的后修正 print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted)) MAE of val with Weighted ensemble: 687.275745703 sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb## 查看預測值的統計進行 plt.hist(Y_data) plt.show() plt.close()

5)輸出結果

sub = pd.DataFrame() sub['SaleID'] = X_test.SaleID sub['price'] = sub_Weighted sub.to_csv('./sub_Weighted.csv',index=False) sub.head() SaleIDprice01234
039533.727414
1386.081960
27791.974571
311835.211966
4585.420407

總結

以上是生活随笔為你收集整理的【算法竞赛学习】二手车交易价格预测-Baseline的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。

主站蜘蛛池模板: 高清国产在线观看 | 中文字幕一区二区三区免费 | 国产精品免费av一区二区 | 成人免费版欧美州 | 国产精品香蕉在线 | 日本免费在线播放 | 久艹视频在线观看 | 亚洲熟伦熟女新五十路熟妇 | 久久小视频| 真人毛片视频 | 日日摸日日操 | 18禁裸乳无遮挡啪啪无码免费 | av网站在线观看不卡 | 亚洲成人激情小说 | 久久亚洲精品国产 | 九九视频免费看 | 欧美精品一区二区三区四区 | 椎名由奈在线观看 | 91成人国产综合久久精品 | 成人手机在线免费视频 | 国产aⅴ精品一区二区三区久久 | 玖玖爱免费视频 | 日本女v片 | 日韩欧美一二区 | 三级在线观看网站 | 日本欧美色 | 奇米四色在线观看 | 国产伦精品一区二区三区照片 | 久久国产精品网站 | 亚洲国产欧美日韩在线 | 久久久久人 | 日韩精品久久一区二区 | 日本黄网在线观看 | 少妇又色又紧又黄又刺激免费 | 丰满人妻翻云覆雨呻吟视频 | 午夜激情在线观看视频 | 黄色小视频免费看 | 国产91免费看 | 天天爱天天操 | 久久久久久综合网 | 欧洲精品无码一区二区 | 蜜桃成人网 | 在线看的av网站 | 91全免费 | 欧美成人乱码一二三四区免费 | 永久在线观看 | 国产最新精品视频 | 黄色精品网站 | 欧洲视频一区二区三区 | 女性生殖扒开酷刑vk | 久久久久久中文 | 国产区视频 | 免费成人黄色网址 | 青草超碰 | 在线激情| 台湾极品xxx少妇 | 在线免费黄 | av女优天堂在线观看 | 久久国产精彩视频 | 在线亚洲人成电影网站色www | 四虎音影 | 日本三不卡| 西欧毛片 | 又黄又免费的视频 | 国产又爽又黄的视频 | 黄色污污网站 | 伊人网亚洲 | 老色鬼网站 | yy色综合 | 国产丰满美女做爰 | 欧美激情在线观看 | 精品无码人妻一区二区三区 | 国产乱淫av片免费看 | 日韩欧美一区二区区 | 初尝人妻少妇中文字幕 | 337p粉嫩大胆噜噜噜噜69影视 | 超碰在线网站 | 国产午夜精品理论片在线 | 亚洲精品一区二区三区在线观看 | 国产精品久久久久影院 | 久久人人爽人人爽 | 国产在线观看免费 | 久久国产传媒 | 91亚洲欧美激情 | 妞干网精品 | 一区二区三区四区不卡 | 一本色道久久88综合无码 | 精品小视频在线观看 | 欧美一区二区高清 | 国产三级中文字幕 | 日本少妇三级 | 国产欧美日韩综合精品一区二区 | 制服师生在线 | 99精品国产成人一区二区 | 蜜桃久久久aaaa成人网一区 | 欧美日韩一区二区三区在线观看 | 国产精品无码在线 | a天堂在线视频 | 国产福利在线免费观看 |