讯飞开发者大赛-环境空气质量评价挑战赛baseline
生活随笔
收集整理的這篇文章主要介紹了
讯飞开发者大赛-环境空气质量评价挑战赛baseline
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
前言
最近訊飛開發者大賽如火如荼地進行著,各賽道賽題都具有挑戰性,大家都可以參與挑戰
大賽地址:http://challenge.xfyun.cn/?ch=ds-sq-bm
環境空氣質量評價挑戰賽
數據說明
具體的數據只有報名后即可下載,數據量并不大,初賽訓練集和測試集都只有幾百條數據
評價指標
本模型依據提交的結果文件,利用均方根誤差(RMSE)評價模型。
(1) 樣本的相對綜合污染系數 IPRC,用于判斷樣本之間的相對污染程度。
(2) 基于IPRC,計算RMSE. 其中m為樣本數,y為IPRC真實值,y_pred為IPRC預測值。
對于初學者來說,有一個baseline比較好上手,所以初步選了一個XGBoost模型作為baseline的模型,線上提交結果分數有0.08247,代碼如下:
import lightgbm as lgb import xgboost as xgb import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns import sklearn from sklearn import metrics from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from sklearn.model_selection import StratifiedKFold, KFold import math import datetime from sklearn.preprocessing import LabelEncoder import re from sklearn.linear_model import Ridge from catboost import CatBoostRegressor from sklearn.ensemble import RandomForestRegressor from sklearn import ensemble from sklearn.preprocessing import Imputer from sklearn import preprocessing from sklearn.model_selection import KFold, StratifiedKFold from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict, GridSearchCV from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, make_scorer from sklearn.model_selection import train_test_split,cross_val_score from sklearn.metrics import roc_auc_scoretrain=pd.read_csv('保定2016年.csv') test=pd.read_csv('石家莊20160701-20170701.csv') data=pd.concat([train,test]) #Encoder quality_le = LabelEncoder() quality_le.fit(data['質量等級'].values) data['質量等級'] = quality_le.transform(data['質量等級'].values) #簡單時間處理 data['日期'] = pd.to_datetime(data['日期'],format='%Y-%m-%d') data['month']=data['日期'].dt.month data['day']=data['日期'].dt.day data['weekday']=data['日期'].dt.weekdaytrain_new=data[data['IPRC'].notnull()] test_new=data[data['IPRC'].isnull()]train_x = train_new.drop(['日期','IPRC'],axis=1) # 訓練集輸入 target = train_new['IPRC'] # 訓練集標簽 test_x = test_new.drop(['日期','IPRC'],axis=1) # 測試集輸入 #xgb xlf=xgb.XGBRegressor(max_depth=7,learning_rate=0.05,n_estimators=10000,subsample=0.8) answers = [] score = 0 n_fold = 5 folds = KFold(n_splits=n_fold, shuffle=True,random_state=1314) for fold_n, (train_index, valid_index) in enumerate(folds.split(train_x)):X_train, X_valid = train_x.iloc[train_index], train_x.iloc[valid_index]y_train, y_valid = target[train_index], target[valid_index]xlf.fit(X_train,y_train,eval_set=[(X_valid, y_valid)],verbose=100,early_stopping_rounds=100)y_pre=xlf.predict(X_valid)print('每一折驗證分數:'+str(mean_squared_error(y_valid,y_pre)))score = score + mean_squared_error(y_valid,y_pre)y_pred_valid = xlf.predict(test_x)answers.append(y_pred_valid) xgb_pre=sum(answers)/n_fold print('xgb驗證分數'+str(math.sqrt(score/n_fold))) result=pd.DataFrame() result['date']=test['日期'] result['IPRC']=xgb_pre result.to_csv('空氣質量.csv',index=False)#保存結果寫在最后
本人才疏學淺,如果有錯誤的地方請包涵并指正,有問題也可以提出討論,祝大家在大賽中取得好成績!
總結
以上是生活随笔為你收集整理的讯飞开发者大赛-环境空气质量评价挑战赛baseline的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 2020年终总结(苦难与坚韧并行)
- 下一篇: outlook2007邮件中预览PDF