日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

美国交通事故分析(2017)(项目练习_5)

發布時間:2024/3/7 编程问答 27 豆豆
生活随笔 收集整理的這篇文章主要介紹了 美国交通事故分析(2017)(项目练习_5) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

目錄

        • 1.項目摘要說明
        • 2.數據處理(僅為分析處理,建模的處理放在后面)
        • 3.數據可視化應用
        • 4.利用xgboost對嚴重程度建模預測
          • 4.1建模前預處理

1.項目摘要說明

項目目的:對于數據分析的練習
數據來源:kaggle
源碼.數據集以及字段說明 百度云鏈接:
地址:https://pan.baidu.com/s/1UD5HD69bNEsX2EkjaQ1IPg
提取碼:8gd8
本項目分析目標:

  • 對數據進行基礎分析 發生事故最多的州,什么時候容易發生事故,事故發生時天氣狀況及可視化應用:講述2017美國發生事故的總體情況等等
  • 利用xgboost對事故嚴重程度進行預測,查看事故嚴重程度和什么因素比較有關

2.數據處理(僅為分析處理,建模的處理放在后面)

原數據集(US_Accidents_Dec19.csv)是一個數據量49列共300W數據量包含2016到2019的交通事故,但考慮到電腦硬件及時間問題,僅選取2017年間的事故進行分析(詳情源文件可見)

#截取2017年的 import pandas as pd data = pd.read_csv('./US_Accidents_Dec19.csv') datacopy = data.copy() datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time']) datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year) data1 = datacopy[datacopy['year']==2017] data1.to_csv('./USaccident2017.csv')

對USaccident2017.csv開始分析
導入需要使用的包

import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import folium import pandas as pd import webbrowser from pyecharts import options as opts from pyecharts.charts import Page, Pie, Bar, Line, Scatter from sklearn.preprocessing import RobustScaler from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split import xgboost as xgb data = pd.read_csv('./USaccident2017.csv') data.shape #(717483, 51) data.head() Unnamed: 0IDSourceTMCSeverityStart_TimeEnd_TimeStart_LatStart_LngEnd_LatEnd_LngDistance(mi)DescriptionNumberStreetSideCityCountyStateZipcodeCountryTimezoneAirport_CodeWeather_TimestampTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilightyear01234
9206A-9207MapQuest201.032017-01-01 00:17:362017-01-01 00:47:1237.925392-122.320595NaNNaN0.01Accident on I-80 Westbound at Exit 15 Cutting ...NaNI-80 EREl CerritoContra CostaCA94530USUS/PacificKCCR2017-01-01 00:53:0044.140.879.029.9110.0WSW5.8NaNPartly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseNightNightNightNight2017
9207A-9208MapQuest201.032017-01-01 00:26:082017-01-01 01:16:0637.878185-122.307175NaNNaN0.01Accident on I-580 Southbound at Exit 12 I-80 I...NaNI-580 WRBerkeleyAlamedaCA94710USUS/PacificKOAK2017-01-01 00:53:0051.1NaN83.029.9710.0West11.5NaNOvercastFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight2017
9208A-9209MapQuest201.022017-01-01 00:53:412017-01-01 01:22:3538.014820-121.640579NaNNaN0.00Accident on Taylor Rd Southbound at Bethel Isl...2998.0Taylor LnROakleyContra CostaCA94561USUS/PacificKCCR2017-01-01 00:53:0044.140.879.029.9110.0WSW5.8NaNPartly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight2017
9209A-9210MapQuest241.032017-01-01 01:18:512017-01-01 01:48:0137.912056-122.323982NaNNaN0.01Lane blocked and queueing traffic due to accid...NaNBayview AveRRichmondContra CostaCA94804USUS/PacificKCCR2017-01-01 01:11:0044.142.582.029.959.0SW3.5NaNMostly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseNightNightNightNight2017
9210A-9211MapQuest222.032017-01-01 01:20:122017-01-01 01:49:4737.925392-122.320595NaNNaN0.01Queueing traffic due to accident on I-80 Westb...NaNI-80 EREl CerritoContra CostaCA94530USUS/PacificKCCR2017-01-01 01:11:0044.142.582.029.959.0SW3.5NaNMostly CloudyFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseNightNightNightNight2017

字段說明

https://www.jianshu.com/p/9e597dc8ae71

#查看空值情況 data.isnull().sum()[data.isnull().sum()!=0]

#處理空值 #無影響或者不分析的列 刪除 deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)','Civil_Twilight', 'Nautical_Twilight','Astronomical_Twilight', 'year','Number'] data1 = data.drop(deletelist, axis=1) #刪除有空值的行 data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset']) #溫度濕度氣壓能見度用均值填補 data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean()) data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean()) data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean()) data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean()) #風速使用近鄰填補 data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4) #天氣狀況風向用眾數填補 data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].mode()) data1['Wind_Direction'] = data1['Wind_Direction'].fillna(data1['Wind_Direction'].mode()) #降水量沒有就用0填補 data1['Precipitation(in)'] = data1['Precipitation(in)'].fillna(0) #風向,把同樣含義單詞的合并起來 occupation = {"CALM":"Calm", "N":"North", "S":"South", "W":"West", "E":"East", "VAR":"Variable"} f = lambda x : occupation.get(x,x) #在occupation中找對應的值 data1['Wind_Direction']= data1['Wind_Direction'].map(f) #最后矯正索引因為刪除了部分行 data1.index = range(len(data1))

3.數據可視化應用

#發生事故最多的州 a=(Bar(init_opts=opts.InitOpts(width="2000px",height="400px")) .add_xaxis(data1['State'].value_counts().index.tolist()) .add_yaxis('各州事故發生數量',data1['State'].value_counts().tolist(),color='#499C9F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) ) a.render_notebook()


(點擊可放大圖片)
前5多事故發生的州分別為:CA(加利福尼亞州) TX(得克薩斯州) FL(佛羅里達州) NY(紐約州)NC(北卡羅來納州)
都是比較發達人口較多的地區

#事故發生時間 x1=pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符類型 from pyecharts.charts import Line b = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('各時間事故發生數',pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().tolist() ,color='#F7BA0B',is_smooth=True).set_series_opts(label_opts=opts.LabelOpts(is_show= False), markarea_opts=opts.MarkAreaOpts(data=[opts.MarkAreaItem(name="早高峰", x=("6", "9")),opts.MarkAreaItem(name="晚高峰", x=("15", "18"))])).set_global_opts(xaxis_opts=opts.AxisOpts(name='時間/時',name_location = "center",name_gap= 40))) b.render_notebook()


早高峰晚高峰有明顯的凸出來,是交通最繁雜車流量最多的時候

#事故發生月份 x1=pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符類型 from pyecharts.charts import Line q = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('月份事故發生數',pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().tolist() ,color='#AED54C',is_smooth=True,areastyle_opts=opts.AreaStyleOpts(opacity=0.5)).set_global_opts(xaxis_opts=opts.AxisOpts(name='month',name_location = "center",name_gap= 40))) q.render_notebook()


下半年事故發生數明顯多于上半年,也許是因為下半年節假日較多且工作量較多

#各天氣下發生事故數量 weather10 = data1['Weather_Condition'].value_counts().head(10) c=(Bar() .add_xaxis(weather10.index.tolist()) .add_yaxis('各天氣下發生事故數量',weather10.tolist(),color='#48A43F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) .set_global_opts( xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15))#解決標簽名太長 )) c.render_notebook()


可見,大部分事故發生時是clear天氣晴朗的,但是后3.4個為陰天或多云,天氣原因還是會部分影響事故發生

#天氣晴朗下發生事故類型 Clear_wearther = data1[:][data1['Weather_Condition']=='Clear'] occupation = {1:"輕微事故", 2:"一般事故", 3:"較大事故", 4:"重大事故"} f = lambda x : occupation.get(x,x) #在occupation中找對應的值 Clear_wearther['Severity']= Clear_wearther['Severity'].map(f) Clear_wearther['Severity'].value_counts().index d = (Pie().add("hotel",[list(z) for z in zip(['一般事故', '較大事故', '重大事故', '輕微事故'],Clear_wearther['Severity'].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: ozvdkddzhkzd%")) ) d.render_notebook()

#查看發生事故時各環境條件 #是否是白天 1就是白天 m = lambda x : 1 if x=='Day' else 0 data1['Sunrise_Sunset'] = data1['Sunrise_Sunset'].apply(m) #是否有降雨 m = lambda x : 1 if x>0 else 0 data1['PrecipitationORnot'] = data1['Precipitation(in)'].apply(m) df0=pd.concat([data1['Crossing'].value_counts(),data1['PrecipitationORnot'].value_counts(),data1['Sunrise_Sunset'].value_counts().sort_index(),data1['Traffic_Signal'].value_counts(),data1['Give_Way'].value_counts(),data1['Bump'].value_counts()],axis=1)h = (Bar().add_xaxis(['是否路口','有無降水','是否白天','有無信號燈','有無讓路標志','有無減速帶']).add_yaxis("0", df0.loc[False].tolist(), stack="stack1",color='#992572').add_yaxis("1",df0.loc[True].tolist(), stack="stack1",color='#4A203B').set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title="Bar-堆疊數據(全部)"))) h.render_notebook()


事故發生時大部分為白天,非路口,無降水,無信號燈,無讓路標志,無減速帶

#事故發生時能見度 #按百度百科的能見度表分級 data1["Visibility_bin"] = "差" data1.loc[(data1["Visibility(mi)"]>2.5)&(data1["Visibility(mi)"]<=6.5), "Visibility_bin"] = "中等" data1.loc[(data1["Visibility(mi)"]>6.5)&(data1["Visibility(mi)"]<=12), "Visibility_bin"] = "良好" data1.loc[(data1["Visibility(mi)"]>12), "Visibility_bin"] = "很好" d = (Pie().add("hotel",[list(z) for z in zip(['良好', '中等', '差', '很好'],data1["Visibility_bin"].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: ozvdkddzhkzd%")).set_global_opts(title_opts=opts.TitleOpts(title="事故發生時能見度"))) d.render_notebook()


大部分事故發生時的可見度都時良好的即可見度在6.5英里和12英里之間(1英里約等于1.6公里)

# 這里使用folium庫畫美國地圖并在data1中隨機取3000個事故地點在地圖上打點 incidents = folium.map.FeatureGroup() datasample=data1.sample(3000) # Loop through the 200 crimes and add each to the incidents feature group for lat, lng, in zip(datasample.Start_Lat,datasample.Start_Lng):incidents.add_child(folium.CircleMarker([lat, lng],radius=3, # define how big you want the circle markers to becolor='yellow',fill=True,fill_color='red',fill_opacity=0.4))# Add incidents to map US_map = folium.Map(location=[38, -100], zoom_start=4) US_map.add_child(incidents)


可見事故大部分是在沿海發達地區發生

4.利用xgboost對嚴重程度建模預測

4.1建模前預處理
dataX = data1.copy() #提取月份和小時 dataX['month'] = dataX['Start_Time'].apply(lambda x:x.month) dataX['hour'] = dataX['Start_Time'].apply(lambda x:x.hour) 刪除對建模無用的特征 deletelist2=['Source','Side','Start_Time', 'End_Time','Description','Street','City','County', 'State', 'Zipcode', 'Country', 'Timezone','Wind_Direction'] dataX = dataX.drop(deletelist2, axis=1) #把false換成0 true換成1 list3=['Amenity','Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal','Turning_Loop'] m = lambda x : 1 if x else 0 for i in list3:dataX[i] = dataX[i].apply(m) #嚴重程度為1的數量太少極度不均衡,刪掉 dataX = dataX.drop(index=(dataX.loc[(dataX['Severity']==1)].index)) dataX.index = range(len(dataX)) dataX.head() SeverityStart_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234
337.925392-122.3205950.0144.179.029.9110.05.80.0Partly Cloudy0000000000010010
337.878185-122.3071750.0151.183.029.9710.011.50.0Overcast0010000000000010
238.014820-121.6405790.0044.179.029.9110.05.80.0Partly Cloudy0000000000000010
337.912056-122.3239820.0144.182.029.959.03.50.0Mostly Cloudy0000000000000011
337.925392-122.3205950.0144.182.029.959.03.50.0Mostly Cloudy0000000000010011
y = dataX['Severity']#標簽列 Xw=dataX['Weather_Condition']#需要獨熱編碼的列 X = dataX.drop(['Severity','Weather_Condition'],axis=1) #獨熱編碼 enc = OneHotEncoder(categories='auto',handle_unknown='ignore').fit(Xw.values.reshape(-1,1)) result = enc.transform(Xw.values.reshape(-1,1)).toarray() Xw1=pd.DataFrame(result) Xw1.shape #(71669,78)

獨熱編碼后Weather_Condition這一列變成了(71669,78)
這樣會發生過擬合,所以用PCA降維

pca=PCA(n_components=5) pca.fit(Xw1) col = pca.transform(Xw1) Xw1 = pd.DataFrame(col) Xw1.head() 0123401234
-0.365071-0.1052920.643741-0.661940-0.141977
-0.5440940.715639-0.290545-0.018498-0.060453
-0.365071-0.1052920.643741-0.661940-0.141977
-0.482130-0.685853-0.468909-0.024822-0.071162
-0.482130-0.685853-0.468909-0.024822-0.071162
#對其他特征進行標準化 columns=X.columns.tolist() robustS=RobustScaler() X = pd.DataFrame(robustS.fit_transform(X),columns=columns) X.head() Start_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)AmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234
0.287847-0.9958950.5-0.8493720.371429-0.4545450.0-0.1896550.00.00.00.00.00.00.00.00.00.00.00.01.00.0-1.0-1.166667-1.500
0.281328-0.9955060.5-0.5564850.485714-0.1818180.00.7931030.00.00.01.00.00.00.00.00.00.00.00.00.00.0-1.0-1.166667-1.500
0.300197-0.9761550.0-0.8493720.371429-0.4545450.0-0.1896550.00.00.00.00.00.00.00.00.00.00.00.00.00.0-1.0-1.166667-1.500
0.286006-0.9959930.5-0.8493720.457143-0.272727-1.0-0.5862070.00.00.00.00.00.00.00.00.00.00.00.00.00.0-1.0-1.166667-1.375
0.287847-0.9958950.5-0.8493720.457143-0.272727-1.0-0.5862070.00.00.00.00.00.00.00.00.00.00.00.01.00.0-1.0-1.166667-1.375

拼接X 和xw1

X1 = pd.concat([X,Xw1],axis = 1) X1.shape #(716669, 30) #xgboost分類標簽只接受0到類別數,即0-2,234轉換為012 def f(x):if x==2:return 0elif x==3:return 1else:return 2 y1 = y.apply(f) y1.value_counts()

0 461657
1 230899
2 24113
Name: Severity, dtype: int64

xgboost建模

param1 = {'booster': 'gbtree','objective': 'multi:softmax', # 多分類的問題'num_class': 3, # 類別數,與 multisoftmax 并用'gamma': 0.1, # 用于控制是否后剪枝的參數,越大越保守,一般0.1、0.2這樣子。'max_depth': 12, # 構建樹的深度,越大越容易過擬合'lambda': 2, # 控制模型復雜度的權重值的L2正則化項參數,參數越大,模型越不容易過擬合。'subsample': 0.7, # 隨機采樣訓練樣本'colsample_bytree': 0.7, # 生成樹時進行的列采樣'min_child_weight': 3,'silent': 1, # 設置成1則沒有運行信息輸出,最好是設置為0.'eta': 0.007, # 如同學習率'seed': 1000,'nthread': 4, # cpu 線程數 } X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=0) xg_train = xgb.DMatrix(X_train, label=y_train) xg_test = xgb.DMatrix( X_test, label=y_test) bst1 = xgb.train(param1, xg_train) pred1 = bst1.predict( xg_test ) print(accuracy_score(y_test, pred1))

0 .7609220422230595 (準確率)
查看特征重要性

from xgboost import plot_importance plot_importance(bst3) plt.show()

調參部分(更新中)

總結

以上是生活随笔為你收集整理的美国交通事故分析(2017)(项目练习_5)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。