二手车价格预测 | 构建AI模型并部署Web应用 ⛵
💡 作者:韓信子@ShowMeAI
📘 數據分析實戰系列:https://www.showmeai.tech/tutorials/40
📘 機器學習實戰系列:https://www.showmeai.tech/tutorials/41
📘 本文地址:https://www.showmeai.tech/article-detail/300
📢 聲明:版權所有,轉載請聯系平臺與作者并注明出處
📢 收藏ShowMeAI查看更多精彩內容
一份來自『RESEARCH AND MARKETS』的二手車報告預計,從 2022 年到 2030 年,全球二手車市場將以 6.1% 的復合年增長率增長,到 2030 年達到 2.67 萬億美元。人工智能技術的廣泛使用增加了車主和買家之間的透明度,提升了購買體驗,極大地推動了二手車市場的增長。
基于機器學習對二手車交易價格進行預估,這一技術已經在二手車交易平臺中廣泛使用。在本篇內容中,ShowMeAI 會完整構建用于二手車價格預估的模型,并部署成web應用。
💡 數據分析處理&特征工程
本案例涉及的數據集可以在 🏆 kaggle汽車價格預測 獲取,也可以在ShowMeAI的百度網盤地址直接下載。
🏆 實戰數據集下載(百度網盤):公眾號『ShowMeAI研究中心』回復『實戰』,或者點擊 這里 獲取本文 [11] 構建AI模型并部署Web應用,預測二手車價格 『CarPrice 二手車價格預測數據集』
? ShowMeAI官方GitHub:https://github.com/ShowMeAI-Hub
① 數據探索
數據分析處理涉及的工具和技能,歡迎大家查閱ShowMeAI對應的教程和工具速查表,快學快用。
- 圖解數據分析:從入門到精通系列教程
- 數據科學工具庫速查表 | Pandas 速查表
- 數據科學工具庫速查表 | Seaborn 速查表
我們先加載數據并初步查看信息。
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import pickle %matplotlib.inlinedf=pd.read_csv('CarPrice_Assignment.csv') df.head()數據 Dataframe 的數據預覽如下:
我們對屬性字段做點分析,看看哪些字段與價格最相關,我們先計算相關性矩陣
df.corr()再對相關性進行熱力圖可視化。
sns.set(rc={"figure.figsize":(20, 20)}) sns.heatmap(df.corr(), annot = True)其中各字段和price的相關性如下圖所示,我們可以看到其中有些字段和結果之間有非常強的相關性。
我們可以對數值型字段,分別和price目標字段進行繪圖詳細分析,如下:
for col in df.columns: if df[col].dtypes != 'object':sns.lmplot(data = df, x = col, y = 'price')可視化結果圖如下:
我們把一些與價格相關性低(r<0.15)的字段刪除掉:
df.drop(['car_ID'], axis = 1, inplace = True) to_drop = ['peakrpm', 'compressionratio', 'stroke', 'symboling'] df.drop(df[to_drop], axis = 1, inplace = True)② 特征工程
特征工程涉及的方法技能,歡迎大家查閱ShowMeAI對應的教程文章,快學快用。
- 機器學習實戰 | 機器學習特征工程最全解讀
車名列包括品牌和型號,我們對其拆分并僅保留品牌:
df['CarName'] = df['CarName'].apply(lambda x: x.split()[0])輸出:
我們發現有一些車品牌的別稱或者拼寫錯誤,我們做一點數據清洗如下:
df['CarName'] = df['CarName'].str.lower() df['CarName']=df['CarName'].replace({'vw':'volkswagen','vokswagen':'volkswagen','toyouta':'toyota','maxda':'mazda','porcshce':'porsche'})再對不同車品牌的數量做繪圖,如下:
sns.set(rc={'figure.figsize':(30,10)}) sns.countplot(data = df, x='CarName')③ 特征編碼&數據變換
下面我們要做進一步的特征工程:
- 類別型特征
大部分機器學習模型并不能處理類別型數據,我們會手動對其進行編碼操作。類別型特征的編碼可以采用 序號編碼 或者 獨熱向量編碼(具體參見ShowMeAI文章 機器學習實戰 | 機器學習特征工程最全解讀),獨熱向量編碼示意圖如下:
- 數值型特征
針對不同的模型,有不同的處理方式,比如幅度縮放和分布調整。
下面我們先將數據集的字段分為兩類:類別型和數值型:
categorical = [] numerical = [] for col in df.columns:if df[col].dtypes == 'object':categorical.append(col)else:numerical.append(col)下面我們使用pandas中的啞變量變換操作把所有標記為“categorical”的特征進行獨熱向量編碼。
# 獨熱向量編碼 x1 = pd.get_dummies(df[categorical], drop_first = False) x2 = df[numerical] X = pd.concat([x2,x1], axis = 1) X.drop('price', axis = 1, inplace = True)下面我們對數值型特征進行處理,首先我們看看標簽字段price,我們先繪制一下它的分布,如下:
sns.histplot(data=df, x="price", kde=True)大家從圖上可以看出這是一個有偏分布。我們對它做一個對數處理,以使其更接近正態分布。(另外一個考量是,如果我們以對數后的結果作為標簽來建模學習,那還原回 price 的過程,會使用指數操作,這能保證我們得到的價格一定是正數) ,代碼如下:
#修復偏態分布 df["price_log"]=np.log(df["price"]) sns.histplot(data=df, x="price_log", kde=True)校正過后的數據分布更接近正態分布了,做過這些基礎處理之后,我們準備開始建模了。
💡 機器學習建模
① 數據集切分&數據變換
讓我們拆分數據集為訓練和測試集,并對其進行基本的數據變換操作:
#切分數據 from sklearn.model_selection import train_test_splity = df['price_log'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.333, random_state=1)#特征工程-幅度縮放 from sklearn.preprocessing import StandardScaler sc= StandardScaler() X_train[:, :(len(x1.columns))]= sc.fit_transform(X_train[:, :(len(x1.columns))]) X_test[:, :(len(x1.columns))]= sc.transform(X_test[:, :(len(x1.columns))])② 建模&調優
建模涉及的方法技能,歡迎大家查閱ShowMeAI對應的教程文章,快學快用。
- 機器學習實戰 | SKLearn最全應用指南
我們這里的數據集并不大(樣本數不多),基于模型復雜度和效果考慮,我們先測試 4 個模型,看看哪一個表現最好。
- Lasso regression
- Ridge regression
- 隨機森林回歸器
- XGBoost回歸器
我們先從scikit-learn導入對應的模型,如下:
#回歸模型 from sklearn.linear_model import Lasso, Ridge from sklearn.ensemble import RandomForestRegressor import xgboost as xgb③ 建模 pipeline
為了讓整個建模過程更加緊湊簡介,我們創建一個pipeline來訓練和調優模型。 具體步驟為:
- 使用隨機超參數訓練評估每個模型。
- 使用網格搜索調優每個模型的超參數。
- 用找到的最佳參數重新訓練評估模型。
我們先從 scikit-learn 導入網格搜索:
from sklearn.model_selection import GridSearchCV接著我們構建一個全面的評估指標函數,打印每個擬合模型的指標(R 平方、均方根誤差和平均絕對誤差等):
def metrics(model):res_r2 = []res_RMSE = []res_MSE = []model.fit(X_train, y_train)Y_pred = model.predict(X_test) #計算R方r2 = round(r2_score(y_test, Y_pred),4)print( 'R2_Score: ', r2)res_r2.append(r2) #計算RMSErmse = round(mean_squared_error(np.exp(y_test),np.exp(Y_pred), squared=False), 2)print("RMSE: ",rmse)res_RMSE.append(rmse) #計算MAEmse = round(mean_absolute_error(np.exp(y_test),np.exp(Y_pred)), 2)print("MAE: ", mse)res_MSE.append(mse)下面要構建pipeline了:
# 候選模型 models={'rfr':RandomForestRegressor(bootstrap=False, max_depth=15, max_features='sqrt', min_samples_split=2, n_estimators=100),'lasso':Lasso(alpha=0.005, fit_intercept=True),'ridge':Ridge(alpha = 10, fit_intercept=True), 'xgb':xgb.XGBRegressor(bootstrap=True, max_depth=2, max_features = 'auto', min_sample_split = 2, n_estimators = 100) }# 不同的模型不同建模方法 for mod in models:if mod == 'rfr' or mod == 'xgb':print('Untuned metrics for: ', mod)metrics(models[mod])print('\n')print('Starting grid search for: ', mod)params = {"n_estimators" : [10,100, 1000, 2000, 4000, 6000],"max_features" : ["auto", "sqrt", "log2"],"max_depth" : [2, 4, 8, 12, 15],"min_samples_split" : [2,4,8],"bootstrap": [True, False],}if mod == 'rfr':rfr = RandomForestRegressor()grid = GridSearchCV(rfr, params, verbose=5, cv=2)grid.fit(X_train, y_train)print("Best score: ", grid.best_score_ )print("Best: params", grid.best_params_)else:xgboost = xgb.XGBRegressor()grid = GridSearchCV(xgboost, params, verbose=5, cv=2)grid.fit(X_train, y_train)print("Best score: ", grid.best_score_ )print("Best: params", grid.best_params_)else:print('Untuned metrics for: ', mod)metrics(models[mod])print('\n')print('Starting grid search for: ', mod)params = {"alpha": [0.005, 0.05, 0.1, 1, 10, 100, 290, 500],"fit_intercept": [True, False]}if mod == 'lasso':lasso = Lasso()grid = GridSearchCV(lasso, params, verbose = 5, cv = 2)grid.fit(X_train, y_train)print("Best score: ", grid.best_score_ ) print("Best: params", grid.best_params_)else:ridge = Ridge()grid = GridSearchCV(ridge, params, verbose = 5, cv = 2)grid.fit(X_train, y_train)print("Best score: ", grid.best_score_ )print("Best: params", grid.best_params_)以下是隨機調整模型的結果:
在未調超參數的情況下,我們看到差異不大的R方結果,但 Lasso 的誤差最小。
我們再看看網格搜索的結果,以找到每個模型的最佳參數:
現在讓我們將這些參數應用于每個模型,并查看結果:
調參后的結果相比默認超參數,都有提升,但 Lasso回歸依舊有最佳的效果(與本例的數據集樣本量和特征相關性有關),我們最終保留Lasso回歸模型并存儲模型到本地。
lasso_reg = Lasso(alpha = 0.005, fit_intercept = True) pickle.dump(lasso_reg, open('model.pkl','wb'))💡 web應用開發
下面我們把上面得到的模型部署到網頁端,形成一個可以實時預估的應用,我們這里使用 gradio 庫來開發 Web 應用程序,實際的web應用預估包含下面的步驟:
- 用戶在網頁表單中輸入數據
- 處理數據(特征編碼&變換)
- 數據處理以匹配模型輸入格式
- 預測并呈現給用戶的價格
① 基本開發
首先,我們導入原始數據集和做過數據處理(獨熱向量編碼)的數據集,并保留它們各自的列。
# df的列 #Columns of the df df = pd.read_csv('df_columns') df.drop(['Unnamed: 0','price'], axis = 1, inplace=True) cols = df.columns# df的啞變量列 dummy = pd.read_csv('dummy_df') dummy.drop('Unnamed: 0', axis = 1, inplace=True) cols_to_use = dummy.columns接下來,對于類別型特征,我們構建web應用端下拉選項:
# 構建應用中的候選值# 車品牌首字母大寫 cars = df['CarName'].unique().tolist() carNameCap = [] for col in cars:carNameCap.append(col.capitalize())#fueltype字段 fuel = df['fueltype'].unique().tolist() fuelCap = [] for fu in fuel:fuelCap.append(fu.capitalize())#carbod, engine type, fuel systems等字段 carb = df['carbody'].unique().tolist() engtype = df['enginetype'].unique().tolist() fuelsys = df['fuelsystem'].unique().tolist()OK,我們會針對上面這些模型預估需要用到的類別型字段,開發下拉功能并添加候選項。
下面我們定義一個函數進行數據處理,并預估返回價格:
# 數據變換處理以匹配模型 def transform(data):# 數據幅度縮放sc = StandardScaler()# 導入模型model= pickle.load(open('model.pkl','rb'))# 新數據Dataframenew_df = pd.DataFrame([data],columns = cols) # 區分類別型和數值型特征cat = []num = []for col in new_df.columns:if new_df[col].dtypes == 'object':cat.append(col)else:num.append(col) x1_new = pd.get_dummies(new_df[cat], drop_first = False)x2_new = new_df[num]X_new = pd.concat([x2_new,x1_new], axis = 1)final_df = pd.DataFrame(columns = cols_to_use)final_df = pd.concat([final_df, X_new])final_df = final_df.fillna(0)X_new = final_df.valuesX_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, :(len(x1_new.columns))]) output = model.predict(X_new)return "The price of the car " + str(round(np.exp(output)[0],2)) + "$"下面我們在gradio web應用程序中創建元素,我們會為類別型字段構建下拉菜單或復選框,為數值型字段構建輸入框。 參考代碼如下:
# 類別型 car = gr.Dropdown(label = "Car brand", choices=carNameCap) # 數值型 curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)現在,讓我們在界面中添加所有內容:
一切就緒就可以部署了!
② 部署
下面我們把上面得到應用部署一下,首先我們對于應用的 ip 和端口做一點設定
export GRADIO_SERVER_NAME=0.0.0.0 export GRADIO_SERVER_PORT="$PORT"大家確定使用pip安裝好下述依賴:
numpy pandas scikit-learn gradio Flask argparse gunicorn rq接著運行 python WebApp.py 就可以測試應用程序了,WebApp.py內容如下:
import gradio as gr import numpy as np import pandas as pd import pickle from sklearn.preprocessing import StandardScaler# 數據字典 asp = {'Standard':'std','Turbo':'turbo' }drivew = {'Rear wheel drive': 'rwd','Front wheel drive': 'fwd', '4 wheel drive': '4wd' }cylnum = {2: 'two',3: 'three', 4: 'four',5: 'five', 6: 'six', 8: 'eight',12: 'twelve' }# 原始df字段名 df = pd.read_csv('df_columns') df.drop(['Unnamed: 0','price'], axis = 1, inplace=True) cols = df.columns# 獨熱向量編碼過后的字段名 dummy = pd.read_csv('dummy_df') dummy.drop('Unnamed: 0', axis = 1, inplace=True) cols_to_use = dummy.columns# 車品牌名 cars = df['CarName'].unique().tolist() carNameCap = [] for col in cars:carNameCap.append(col.capitalize())# fuel fuel = df['fueltype'].unique().tolist() fuelCap = [] for fu in fuel:fuelCap.append(fu.capitalize())#For carbod, engine type, fuel systme carb = df['carbody'].unique().tolist() engtype = df['enginetype'].unique().tolist() fuelsys = df['fuelsystem'].unique().tolist()#Function to model data to fit the model def transform(data):# 數值型幅度縮放sc= StandardScaler()# 導入模型lasso_reg = pickle.load(open('model.pkl','rb'))# 新數據Dataframenew_df = pd.DataFrame([data],columns = cols)# 切分類別型與數值型字段cat = []num = []for col in new_df.columns: if new_df[col].dtypes == 'object': cat.append(col)else: num.append(col)# 構建模型所需數據格式x1_new = pd.get_dummies(new_df[cat], drop_first = False)x2_new = new_df[num]X_new = pd.concat([x2_new,x1_new], axis = 1)final_df = pd.DataFrame(columns = cols_to_use)final_df = pd.concat([final_df, X_new])final_df = final_df.fillna(0)final_df = pd.concat([final_df,dummy])X_new = final_df.valuesX_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:, :(len(x1_new.columns))])print(X_new[-1].reshape(-1, 1))output = lasso_reg.predict(X_new[-1].reshape(1, -1))return "The price of the car " + str(round(np.exp(output)[0],2)) + "$"# 預估價格的主函數 def predict_price(car, fueltype, aspiration, doornumber, carbody, drivewheel, enginelocation, wheelbase, carlength, carwidth, carheight, curbweight, enginetype, cylindernumber, enginesize, fuelsystem, boreratio, horsepower, citympg, highwaympg): new_data = [car.lower(), fueltype.lower(), asp[aspiration], doornumber.lower(), carbody, drivew[drivewheel], enginelocation.lower(),wheelbase, carlength, carwidth, carheight, curbweight, enginetype, cylnum[cylindernumber], enginesize, fuelsystem, boreratio, horsepower, citympg, highwaympg]return transform(new_data) car = gr.Dropdown(label = "Car brand", choices=carNameCap)fueltype = gr.Radio(label = "Fuel Type", choices = fuelCap)aspiration = gr.Radio(label = "Aspiration type", choices = ["Standard", "Turbo"])doornumber = gr.Radio(label = "Number of doors", choices = ["Two", "Four"])carbody = gr.Dropdown(label ="Car body type", choices = carb)drivewheel = gr.Radio(label = "Drive wheel", choices = ['Rear wheel drive', 'Front wheel drive', '4 wheel drive'])enginelocation = gr.Radio(label = "Engine location", choices = ['Front', 'Rear'])wheelbase = gr.Slider(label = "Distance between the wheels on the side of the car (in inches)", minimum = 50, maximum = 300)carlength = gr.Slider(label = "Length of the car (in inches)", minimum = 50, maximum = 300)carwidth = gr.Slider(label = "Width of the car (in inches)", minimum = 50, maximum = 300)carheight = gr.Slider(label = "Height of the car (in inches)", minimum = 50, maximum = 300)curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)enginetype = gr.Dropdown(label = "Engine type", choices = engtype)cylindernumber = gr.Radio(label = "Cylinder number", choices = [2, 3, 4, 5, 6, 8, 12])enginesize = gr.Slider(label = "Engine size (swept volume of all the pistons inside the cylinders)", minimum = 50, maximum = 500)fuelsystem = gr.Dropdown(label = "Fuel system (link to ressource: ", choices = fuelsys)boreratio = gr.Slider(label = "Bore ratio (ratio between cylinder bore diameter and piston stroke)", minimum = 1, maximum = 6)horsepower = gr.Slider(label = "Horse power of the car", minimum = 25, maximum = 400)citympg = gr.Slider(label = "Mileage in city (in km)", minimum = 0, maximum = 100)highwaympg = gr.Slider(label = "Mileage on highway (in km)", minimum = 0, maximum = 100)Output = gr.Textbox()app = gr.Interface(title="Predict the price of a car based on its specs", fn=predict_price,inputs=[car,fueltype,aspiration,doornumber,carbody,drivewheel, enginelocation, wheelbase,carlength, carwidth, carheight, curbweight,enginetype, cylindernumber, enginesize,fuelsystem,boreratio,horsepower, citympg, highwaympg],outputs=Output)app.launch()最終的應用結果如下,可以自己勾選與填入特征進行模型預估!
參考資料
- 🏆 實戰數據集下載(百度網盤):公眾號『ShowMeAI研究中心』回復『實戰』,或者點擊 這里 獲取本文 [11] 構建AI模型并部署Web應用,預測二手車價格 『CarPrice 二手車價格預測數據集』
- ? ShowMeAI官方GitHub:https://github.com/ShowMeAI-Hub
- 📘 圖解數據分析:從入門到精通系列教程 https://www.showmeai.tech/tutorials/33
- 📘 數據科學工具庫速查表 | Pandas 速查表 https://www.showmeai.tech/article-detail/101
- 📘 數據科學工具庫速查表 | Seaborn 速查表 https://www.showmeai.tech/article-detail/105
- 📘 機器學習實戰 | 機器學習特征工程最全解讀 https://www.showmeai.tech/article-detail/208
- 📘 機器學習實戰 | SKLearn最全應用指南 https://www.showmeai.tech/article-detail/203
/div>
總結
以上是生活随笔為你收集整理的二手车价格预测 | 构建AI模型并部署Web应用 ⛵的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 什么是PWM死区
- 下一篇: AI研习丨专题:面向防疫的5G巡检机器人