讯飞开发者大赛-环境空气质量评价挑战赛baseline
前言
最近訊飛開發(fā)者大賽如火如荼地進(jìn)行著,各賽道賽題都具有挑戰(zhàn)性,大家都可以參與挑戰(zhàn)
大賽地址:http://challenge.xfyun.cn/?ch=ds-sq-bm
環(huán)境空氣質(zhì)量評(píng)價(jià)挑戰(zhàn)賽
數(shù)據(jù)說明
具體的數(shù)據(jù)只有報(bào)名后即可下載,數(shù)據(jù)量并不大,初賽訓(xùn)練集和測(cè)試集都只有幾百條數(shù)據(jù)
評(píng)價(jià)指標(biāo)
本模型依據(jù)提交的結(jié)果文件,利用均方根誤差(RMSE)評(píng)價(jià)模型。
(1) 樣本的相對(duì)綜合污染系數(shù) IPRC,用于判斷樣本之間的相對(duì)污染程度。
(2) 基于IPRC,計(jì)算RMSE. 其中m為樣本數(shù),y為IPRC真實(shí)值,y_pred為IPRC預(yù)測(cè)值。
對(duì)于初學(xué)者來說,有一個(gè)baseline比較好上手,所以初步選了一個(gè)XGBoost模型作為baseline的模型,線上提交結(jié)果分?jǐn)?shù)有0.08247,代碼如下:
import lightgbm as lgb import xgboost as xgb import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns import sklearn from sklearn import metrics from sklearn.model_selection import KFold from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from sklearn.model_selection import StratifiedKFold, KFold import math import datetime from sklearn.preprocessing import LabelEncoder import re from sklearn.linear_model import Ridge from catboost import CatBoostRegressor from sklearn.ensemble import RandomForestRegressor from sklearn import ensemble from sklearn.preprocessing import Imputer from sklearn import preprocessing from sklearn.model_selection import KFold, StratifiedKFold from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict, GridSearchCV from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error, make_scorer from sklearn.model_selection import train_test_split,cross_val_score from sklearn.metrics import roc_auc_scoretrain=pd.read_csv('保定2016年.csv') test=pd.read_csv('石家莊20160701-20170701.csv') data=pd.concat([train,test]) #Encoder quality_le = LabelEncoder() quality_le.fit(data['質(zhì)量等級(jí)'].values) data['質(zhì)量等級(jí)'] = quality_le.transform(data['質(zhì)量等級(jí)'].values) #簡(jiǎn)單時(shí)間處理 data['日期'] = pd.to_datetime(data['日期'],format='%Y-%m-%d') data['month']=data['日期'].dt.month data['day']=data['日期'].dt.day data['weekday']=data['日期'].dt.weekdaytrain_new=data[data['IPRC'].notnull()] test_new=data[data['IPRC'].isnull()]train_x = train_new.drop(['日期','IPRC'],axis=1) # 訓(xùn)練集輸入 target = train_new['IPRC'] # 訓(xùn)練集標(biāo)簽 test_x = test_new.drop(['日期','IPRC'],axis=1) # 測(cè)試集輸入 #xgb xlf=xgb.XGBRegressor(max_depth=7,learning_rate=0.05,n_estimators=10000,subsample=0.8) answers = [] score = 0 n_fold = 5 folds = KFold(n_splits=n_fold, shuffle=True,random_state=1314) for fold_n, (train_index, valid_index) in enumerate(folds.split(train_x)):X_train, X_valid = train_x.iloc[train_index], train_x.iloc[valid_index]y_train, y_valid = target[train_index], target[valid_index]xlf.fit(X_train,y_train,eval_set=[(X_valid, y_valid)],verbose=100,early_stopping_rounds=100)y_pre=xlf.predict(X_valid)print('每一折驗(yàn)證分?jǐn)?shù):'+str(mean_squared_error(y_valid,y_pre)))score = score + mean_squared_error(y_valid,y_pre)y_pred_valid = xlf.predict(test_x)answers.append(y_pred_valid) xgb_pre=sum(answers)/n_fold print('xgb驗(yàn)證分?jǐn)?shù)'+str(math.sqrt(score/n_fold))) result=pd.DataFrame() result['date']=test['日期'] result['IPRC']=xgb_pre result.to_csv('空氣質(zhì)量.csv',index=False)#保存結(jié)果寫在最后
本人才疏學(xué)淺,如果有錯(cuò)誤的地方請(qǐng)包涵并指正,有問題也可以提出討論,祝大家在大賽中取得好成績(jī)!
總結(jié)
以上是生活随笔為你收集整理的讯飞开发者大赛-环境空气质量评价挑战赛baseline的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 2020年终总结(苦难与坚韧并行)
- 下一篇: outlook2007邮件中预览PDF