當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

美国交通事故分析(2017)(项目练习_5)

發布時間：2024/3/7 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了美国交通事故分析(2017)(项目练习_5) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

#截取2017年的 import pandas as pd data = pd.read_csv('./US_Accidents_Dec19.csv') datacopy = data.copy() datacopy['Start_Time'] = pd.to_datetime(datacopy['Start_Time']) datacopy['year'] = datacopy['Start_Time'].apply(lambda x:x.year) data1 = datacopy[datacopy['year']==2017] data1.to_csv('./USaccident2017.csv')

對USaccident2017.csv開始分析
導入需要使用的包

import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.decomposition import PCA import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import folium import pandas as pd import webbrowser from pyecharts import options as opts from pyecharts.charts import Page, Pie, Bar, Line, Scatter from sklearn.preprocessing import RobustScaler from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.model_selection import train_test_split import xgboost as xgb data = pd.read_csv('./USaccident2017.csv') data.shape #(717483, 51) data.head() Unnamed: 0IDSourceTMCSeverityStart_TimeEnd_TimeStart_LatStart_LngEnd_LatEnd_LngDistance(mi)DescriptionNumberStreetSideCityCountyStateZipcodeCountryTimezoneAirport_CodeWeather_TimestampTemperature(F)Wind_Chill(F)Humidity(%)Pressure(in)Visibility(mi)Wind_DirectionWind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_SunsetCivil_TwilightNautical_TwilightAstronomical_Twilightyear01234

9206

A-9207

MapQuest

201.0

2017-01-01 00:17:36

2017-01-01 00:47:12

37.925392

-122.320595

NaN

0.01

Accident on I-80 Westbound at Exit 15 Cutting ...

NaN

I-80 E

El Cerrito

Contra Costa

94530

US/Pacific

KCCR

2017-01-01 00:53:00

44.1

40.8

79.0

29.91

10.0

WSW

5.8

NaN

Partly Cloudy

False

True

False

Night

2017

9207

A-9208

MapQuest

201.0

2017-01-01 00:26:08

2017-01-01 01:16:06

37.878185

-122.307175

NaN

0.01

Accident on I-580 Southbound at Exit 12 I-80 I...

NaN

I-580 W

Berkeley

Alameda

94710

US/Pacific

KOAK

2017-01-01 00:53:00

51.1

NaN

83.0

29.97

10.0

West

11.5

NaN

Overcast

False

True

False

Night

2017

9208

A-9209

MapQuest

201.0

2017-01-01 00:53:41

2017-01-01 01:22:35

38.014820

-121.640579

NaN

0.00

Accident on Taylor Rd Southbound at Bethel Isl...

2998.0

Taylor Ln

Oakley

Contra Costa

94561

US/Pacific

KCCR

2017-01-01 00:53:00

44.1

40.8

79.0

29.91

10.0

WSW

5.8

NaN

Partly Cloudy

False

Night

2017

9209

A-9210

MapQuest

241.0

2017-01-01 01:18:51

2017-01-01 01:48:01

37.912056

-122.323982

NaN

0.01

Lane blocked and queueing traffic due to accid...

NaN

Bayview Ave

Richmond

Contra Costa

94804

US/Pacific

KCCR

2017-01-01 01:11:00

44.1

42.5

82.0

29.95

9.0

3.5

NaN

Mostly Cloudy

False

Night

2017

9210

A-9211

MapQuest

222.0

2017-01-01 01:20:12

2017-01-01 01:49:47

37.925392

-122.320595

NaN

0.01

Queueing traffic due to accident on I-80 Westb...

NaN

I-80 E

El Cerrito

Contra Costa

94530

US/Pacific

KCCR

2017-01-01 01:11:00

44.1

42.5

82.0

29.95

9.0

3.5

NaN

Mostly Cloudy

False

True

False

Night

2017

字段說明

https://www.jianshu.com/p/9e597dc8ae71

#查看空值情況 data.isnull().sum()[data.isnull().sum()!=0]

#處理空值 #無影響或者不分析的列刪除 deletelist= ['Unnamed: 0', 'ID','TMC', 'End_Lat', 'End_Lng', 'Airport_Code','Weather_Timestamp','Wind_Chill(F)','Civil_Twilight', 'Nautical_Twilight','Astronomical_Twilight', 'year','Number'] data1 = data.drop(deletelist, axis=1) #刪除有空值的行 data1 = data1.dropna(axis = 0,subset=['City','Zipcode','Timezone','Sunrise_Sunset']) #溫度濕度氣壓能見度用均值填補 data1['Temperature(F)'] = data1['Temperature(F)'].fillna(data1['Temperature(F)'].mean()) data1['Humidity(%)'] = data1['Humidity(%)'].fillna(data1['Humidity(%)'].mean()) data1['Pressure(in)'] = data1['Pressure(in)'].fillna(data1['Pressure(in)'].mean()) data1['Visibility(mi)'] = data1['Visibility(mi)'].fillna(data1['Visibility(mi)'].mean()) #風速使用近鄰填補 data1['Wind_Speed(mph)'] = data1['Wind_Speed(mph)'].interpolate(method='nearest', order=4) #天氣狀況風向用眾數填補 data1['Weather_Condition'] = data1['Weather_Condition'].fillna(data1['Weather_Condition'].mode()) data1['Wind_Direction'] = data1['Wind_Direction'].fillna(data1['Wind_Direction'].mode()) #降水量沒有就用0填補 data1['Precipitation(in)'] = data1['Precipitation(in)'].fillna(0) #風向，把同樣含義單詞的合并起來 occupation = {"CALM":"Calm", "N":"North", "S":"South", "W":"West", "E":"East", "VAR":"Variable"} f = lambda x : occupation.get(x,x) #在occupation中找對應的值 data1['Wind_Direction']= data1['Wind_Direction'].map(f) #最后矯正索引因為刪除了部分行 data1.index = range(len(data1))

3.數據可視化應用

#發生事故最多的州 a=(Bar(init_opts=opts.InitOpts(width="2000px",height="400px")) .add_xaxis(data1['State'].value_counts().index.tolist()) .add_yaxis('各州事故發生數量',data1['State'].value_counts().tolist(),color='#499C9F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) ) a.render_notebook()

(點擊可放大圖片)
前5多事故發生的州分別為：CA(加利福尼亞州) TX(得克薩斯州) FL(佛羅里達州) NY(紐約州)NC(北卡羅來納州)
都是比較發達人口較多的地區

#事故發生時間 x1=pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符類型 from pyecharts.charts import Line b = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('各時間事故發生數',pd.DatetimeIndex(data1["Start_Time"]).hour.value_counts().sort_index().tolist() ,color='#F7BA0B',is_smooth=True).set_series_opts(label_opts=opts.LabelOpts(is_show= False), markarea_opts=opts.MarkAreaOpts(data=[opts.MarkAreaItem(name="早高峰", x=("6", "9")),opts.MarkAreaItem(name="晚高峰", x=("15", "18"))])).set_global_opts(xaxis_opts=opts.AxisOpts(name='時間/時',name_location = "center",name_gap= 40))) b.render_notebook()

早高峰晚高峰有明顯的凸出來，是交通最繁雜車流量最多的時候

#事故發生月份 x1=pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().index.tolist() x=[str(i) for i in x1] #pyehchart需要字符類型 from pyecharts.charts import Line q = (Line(init_opts=opts.InitOpts(width="1000px",height="400px")).add_xaxis(x).add_yaxis('月份事故發生數',pd.DatetimeIndex(data1["Start_Time"]).month.value_counts().sort_index().tolist() ,color='#AED54C',is_smooth=True,areastyle_opts=opts.AreaStyleOpts(opacity=0.5)).set_global_opts(xaxis_opts=opts.AxisOpts(name='month',name_location = "center",name_gap= 40))) q.render_notebook()

下半年事故發生數明顯多于上半年，也許是因為下半年節假日較多且工作量較多

#各天氣下發生事故數量 weather10 = data1['Weather_Condition'].value_counts().head(10) c=(Bar() .add_xaxis(weather10.index.tolist()) .add_yaxis('各天氣下發生事故數量',weather10.tolist(),color='#48A43F') .set_series_opts(label_opts=opts.LabelOpts(is_show= False)) .set_global_opts( xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15))#解決標簽名太長 )) c.render_notebook()

可見，大部分事故發生時是clear天氣晴朗的，但是后3.4個為陰天或多云，天氣原因還是會部分影響事故發生

#天氣晴朗下發生事故類型 Clear_wearther = data1[:][data1['Weather_Condition']=='Clear'] occupation = {1:"輕微事故", 2:"一般事故", 3:"較大事故", 4:"重大事故"} f = lambda x : occupation.get(x,x) #在occupation中找對應的值 Clear_wearther['Severity']= Clear_wearther['Severity'].map(f) Clear_wearther['Severity'].value_counts().index d = (Pie().add("hotel",[list(z) for z in zip(['一般事故', '較大事故', '重大事故', '輕微事故'],Clear_wearther['Severity'].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: ozvdkddzhkzd%")) ) d.render_notebook()

#查看發生事故時各環境條件 #是否是白天 1就是白天 m = lambda x : 1 if x=='Day' else 0 data1['Sunrise_Sunset'] = data1['Sunrise_Sunset'].apply(m) #是否有降雨 m = lambda x : 1 if x>0 else 0 data1['PrecipitationORnot'] = data1['Precipitation(in)'].apply(m) df0=pd.concat([data1['Crossing'].value_counts(),data1['PrecipitationORnot'].value_counts(),data1['Sunrise_Sunset'].value_counts().sort_index(),data1['Traffic_Signal'].value_counts(),data1['Give_Way'].value_counts(),data1['Bump'].value_counts()],axis=1)h = (Bar().add_xaxis(['是否路口','有無降水','是否白天','有無信號燈','有無讓路標志','有無減速帶']).add_yaxis("0", df0.loc[False].tolist(), stack="stack1",color='#992572').add_yaxis("1",df0.loc[True].tolist(), stack="stack1",color='#4A203B').set_series_opts(label_opts=opts.LabelOpts(is_show=False)).set_global_opts(title_opts=opts.TitleOpts(title="Bar-堆疊數據（全部）"))) h.render_notebook()

事故發生時大部分為白天，非路口，無降水，無信號燈，無讓路標志，無減速帶

#事故發生時能見度 #按百度百科的能見度表分級 data1["Visibility_bin"] = "差" data1.loc[(data1["Visibility(mi)"]>2.5)&(data1["Visibility(mi)"]<=6.5), "Visibility_bin"] = "中等" data1.loc[(data1["Visibility(mi)"]>6.5)&(data1["Visibility(mi)"]<=12), "Visibility_bin"] = "良好" data1.loc[(data1["Visibility(mi)"]>12), "Visibility_bin"] = "很好" d = (Pie().add("hotel",[list(z) for z in zip(['良好', '中等', '差', '很好'],data1["Visibility_bin"].value_counts().tolist())]).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: ozvdkddzhkzd%")).set_global_opts(title_opts=opts.TitleOpts(title="事故發生時能見度"))) d.render_notebook()

大部分事故發生時的可見度都時良好的即可見度在6.5英里和12英里之間(1英里約等于1.6公里)

# 這里使用folium庫畫美國地圖并在data1中隨機取3000個事故地點在地圖上打點 incidents = folium.map.FeatureGroup() datasample=data1.sample(3000) # Loop through the 200 crimes and add each to the incidents feature group for lat, lng, in zip(datasample.Start_Lat,datasample.Start_Lng):incidents.add_child(folium.CircleMarker([lat, lng],radius=3, # define how big you want the circle markers to becolor='yellow',fill=True,fill_color='red',fill_opacity=0.4))# Add incidents to map US_map = folium.Map(location=[38, -100], zoom_start=4) US_map.add_child(incidents)

可見事故大部分是在沿海發達地區發生

4.利用xgboost對嚴重程度建模預測

4.1建模前預處理

dataX = data1.copy() #提取月份和小時 dataX['month'] = dataX['Start_Time'].apply(lambda x:x.month) dataX['hour'] = dataX['Start_Time'].apply(lambda x:x.hour) 刪除對建模無用的特征 deletelist2=['Source','Side','Start_Time', 'End_Time','Description','Street','City','County', 'State', 'Zipcode', 'Country', 'Timezone','Wind_Direction'] dataX = dataX.drop(deletelist2, axis=1) #把false換成0 true換成1 list3=['Amenity','Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway','Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal','Turning_Loop'] m = lambda x : 1 if x else 0 for i in list3:dataX[i] = dataX[i].apply(m) #嚴重程度為1的數量太少極度不均衡，刪掉 dataX = dataX.drop(index=(dataX.loc[(dataX['Severity']==1)].index)) dataX.index = range(len(dataX)) dataX.head() SeverityStart_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)Weather_ConditionAmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234

3	37.925392	-122.320595	0.01	44.1	79.0	29.91	10.0	5.8	Partly Cloudy	0	1	1	0
3	37.878185	-122.307175	0.01	51.1	83.0	29.97	10.0	11.5	Overcast	1	0	1	0
2	38.014820	-121.640579	0.00	44.1	79.0	29.91	10.0	5.8	Partly Cloudy	0	0	1	0
3	37.912056	-122.323982	0.01	44.1	82.0	29.95	9.0	3.5	Mostly Cloudy	0	0	1	1
3	37.925392	-122.320595	0.01	44.1	82.0	29.95	9.0	3.5	Mostly Cloudy	0	1	1	1

y = dataX['Severity']#標簽列 Xw=dataX['Weather_Condition']#需要獨熱編碼的列 X = dataX.drop(['Severity','Weather_Condition'],axis=1) #獨熱編碼 enc = OneHotEncoder(categories='auto',handle_unknown='ignore').fit(Xw.values.reshape(-1,1)) result = enc.transform(Xw.values.reshape(-1,1)).toarray() Xw1=pd.DataFrame(result) Xw1.shape #(71669,78)

獨熱編碼后Weather_Condition這一列變成了(71669,78)
這樣會發生過擬合，所以用PCA降維

pca=PCA(n_components=5) pca.fit(Xw1) col = pca.transform(Xw1) Xw1 = pd.DataFrame(col) Xw1.head() 0123401234

-0.365071	-0.105292	0.643741	-0.661940	-0.141977
-0.544094	0.715639	-0.290545	-0.018498	-0.060453
-0.365071	-0.105292	0.643741	-0.661940	-0.141977
-0.482130	-0.685853	-0.468909	-0.024822	-0.071162
-0.482130	-0.685853	-0.468909	-0.024822	-0.071162

#對其他特征進行標準化 columns=X.columns.tolist() robustS=RobustScaler() X = pd.DataFrame(robustS.fit_transform(X),columns=columns) X.head() Start_LatStart_LngDistance(mi)Temperature(F)Humidity(%)Pressure(in)Visibility(mi)Wind_Speed(mph)Precipitation(in)AmenityBumpCrossingGive_WayJunctionNo_ExitRailwayRoundaboutStationStopTraffic_CalmingTraffic_SignalTurning_LoopSunrise_Sunsetmonthhour01234

0.287847	-0.995895	0.5	-0.849372	0.371429	-0.454545	0.0	-0.189655	0.0	1.0	-1.0	-1.166667	-1.500
0.281328	-0.995506	0.5	-0.556485	0.485714	-0.181818	0.0	0.793103	1.0	0.0	-1.0	-1.166667	-1.500
0.300197	-0.976155	0.0	-0.849372	0.371429	-0.454545	0.0	-0.189655	0.0	0.0	-1.0	-1.166667	-1.500
0.286006	-0.995993	0.5	-0.849372	0.457143	-0.272727	-1.0	-0.586207	0.0	0.0	-1.0	-1.166667	-1.375
0.287847	-0.995895	0.5	-0.849372	0.457143	-0.272727	-1.0	-0.586207	0.0	1.0	-1.0	-1.166667	-1.375

拼接X 和xw1

X1 = pd.concat([X,Xw1],axis = 1) X1.shape #(716669, 30) #xgboost分類標簽只接受0到類別數，即0-2，234轉換為012 def f(x):if x==2:return 0elif x==3:return 1else:return 2 y1 = y.apply(f) y1.value_counts()

0 461657
1 230899
2 24113
Name: Severity, dtype: int64

xgboost建模

param1 = {'booster': 'gbtree','objective': 'multi:softmax', # 多分類的問題'num_class': 3, # 類別數，與 multisoftmax 并用'gamma': 0.1, # 用于控制是否后剪枝的參數,越大越保守，一般0.1、0.2這樣子。'max_depth': 12, # 構建樹的深度，越大越容易過擬合'lambda': 2, # 控制模型復雜度的權重值的L2正則化項參數，參數越大，模型越不容易過擬合。'subsample': 0.7, # 隨機采樣訓練樣本'colsample_bytree': 0.7, # 生成樹時進行的列采樣'min_child_weight': 3,'silent': 1, # 設置成1則沒有運行信息輸出，最好是設置為0.'eta': 0.007, # 如同學習率'seed': 1000,'nthread': 4, # cpu 線程數 } X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.3, random_state=0) xg_train = xgb.DMatrix(X_train, label=y_train) xg_test = xgb.DMatrix( X_test, label=y_test) bst1 = xgb.train(param1, xg_train) pred1 = bst1.predict( xg_test ) print(accuracy_score(y_test, pred1))

0 .7609220422230595 （準確率）
查看特征重要性

from xgboost import plot_importance plot_importance(bst3) plt.show()

調參部分(更新中)

總結

以上是生活随笔為你收集整理的美国交通事故分析(2017)(项目练习_5)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： java class加载_Java 类加
下一篇：【项目分享】使用 PointNet 进行

生活随笔

生活随笔

编程问答

美国交通事故分析(2017)(项目练习_5)

目錄

1.項目摘要說明

2.數據處理(僅為分析處理，建模的處理放在后面)

3.數據可視化應用

4.利用xgboost對嚴重程度建模預測

4.1建模前預處理

總結