生活随笔
收集整理的這篇文章主要介紹了
R语言扩展包dplyr——数据清洗和整理
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
R語言擴展包dplyr——數(shù)據(jù)清洗和整理
標簽:?數(shù)據(jù)R語言數(shù)據(jù)清洗數(shù)據(jù)整理 2015-01-22 18:04?7357人閱讀? ?分類: R Programming(11)?
版權聲明:本文為博主原創(chuàng)文章,未經(jīng)博主允許不得轉載。
該包主要用于數(shù)據(jù)清洗和整理,coursera課程鏈接:Getting and Cleaning Data
也可以載入swirl包,加載課Getting and Cleaning Data跟著學習。
如下:
?
[html]?view plaincopy
library(swirl)??install_from_swirl("Getting?and?Cleaning?Data")??swirl()?? ?
此文主要是參考R自帶的簡介:Introduce to dplyr
1、示范數(shù)據(jù)
?
[html]?view plaincopy
>?library(nycflights13)??>?dim(flights)??[1]?336776?????16??>?head(flights,?3)??Source:?local?data?frame?[3?x?16]??????year?month?day?dep_time?dep_delay?arr_time?arr_delay?carrier?tailnum?flight?origin?dest?air_time??1?2013?????1???1??????517?????????2??????830????????11??????UA??N14228???1545????EWR??IAH??????227??2?2013?????1???1??????533?????????4??????850????????20??????UA??N24211???1714????LGA??IAH??????227??3?2013?????1???1??????542?????????2??????923????????33??????AA??N619AA???1141????JFK??MIA??????160??Variables?not?shown:?distance?(dbl),?hour?(dbl),?minute?(dbl)?? 2、將過長的數(shù)據(jù)整理成友好的tbl_df數(shù)據(jù)
?
?
[html]?view plaincopy
>?flights_df?<-?tbl_df(flights)??>?flights_df?? ?
?
3、篩選filter()
?
[html]?view plaincopy
>?filter(flights_df,?month?==?1,?day?==?1)??Source:?local?data?frame?[842?x?16]???????year?month?day?dep_time?dep_delay?arr_time?arr_delay?carrier?tailnum?flight?origin?dest?air_time??1??2013?????1???1??????517?????????2??????830????????11??????UA??N14228???1545????EWR??IAH??????227??2??2013?????1???1??????533?????????4??????850????????20??????UA??N24211???1714????LGA??IAH??????227?? 篩選出month=1和day=1的數(shù)據(jù)
?
同樣效果的,
?
[html]?view plaincopy
flights_df[flights_df$month?==?1?&?flights_df$day?==?1,?]?? 4、選出幾行數(shù)據(jù)slice()
?
?
[html]?view plaincopy
slice(flights_df,?1:10)?? 5、排列arrange()
?
?
[html]?view plaincopy
>arrange(flights_df,?year,?month,?day)?? 將flights_df數(shù)據(jù)按照year,month,day的升序排列。
?
降序
?
[html]?view plaincopy
>arrange(flights_df,?year,?desc(month),?day)?? R語言當中的自帶函數(shù)
?
?
[html]?view plaincopy
flights_df[order(flights$year,?flights_df$month,?flights_df$day),?]??flights_df[order(desc(flights_df$arr_delay)),?]?? ?
?
6、選擇select()
通過列名來選擇所要的數(shù)據(jù)
[html]?view plaincopy
select(flights_df,?year,?month,?day)?? 選出三列數(shù)據(jù)
使用:符號
[html]?view plaincopy
select(flights_df,?year:day)?? 使用-來刪除不要的列表
?
?
[html]?view plaincopy
select(flights_df,?-(year:day))?? 7、變形mutate()
?
產(chǎn)生新的列
?
[html]?view plaincopy
>?mutate(flights_df,??+????????gain?=?arr_delay?-?dep_delay,??+????????speed?=?distance?/?air_time?*?60)?? ?
?
8、匯總summarize()
[html]?view plaincopy
<pre?name="code"?class="html">>?summarise(flights,??+???????????delay?=?mean(dep_delay,?na.rm?=?TRUE)?? 求dep_delay的均值
?
?
9、隨機選出樣本
?
[html]?view plaincopy
sample_n(flights_df,?10)?? 隨機選出10個樣本
[html]?view plaincopy
sample_frac(flights_df,?0.01)?? 隨機選出1%個樣本
?
10、分組group_py()
?
[html]?view plaincopy
by_tailnum?<-?group_by(flights,?tailnum)??#確定組別為tailnum,賦值為by_tailnum??delay?<-?summarise(by_tailnum,?????????????????????count?=?n(),?????????????????????dist?=?mean(distance,?na.rm?=?TRUE),?????????????????????delay?=?mean(arr_delay,?na.rm?=?TRUE))??#匯總flights里地tailnum組的分類數(shù)量,及其組別對應的distance和arr_delay的均值??delay?<-?filter(delay,?count?>?20,?dist?<?2000)??ggplot(delay,?aes(dist,?delay))?+??????geom_point(aes(size?=?count),?alpha?=?1/2)?+??????geom_smooth()?+??????scale_size_area()?? ?
?
結果都需要通過賦值存儲
?
[html]?view plaincopy
a1?<-?group_by(flights,?year,?month,?day)??a2?<-?select(a1,?arr_delay,?dep_delay)??a3?<-?summarise(a2,????arr?=?mean(arr_delay,?na.rm?=?TRUE),????dep?=?mean(dep_delay,?na.rm?=?TRUE))??a4?<-?filter(a3,?arr?>?30?|?dep?>?30)?? 11、引入鏈接符%>%
?
使用時把數(shù)據(jù)名作為開頭,然后依次對數(shù)據(jù)進行多步操作:
?
[html]?view plaincopy
flights?%>%??????group_by(year,?month,?day)?%>%??????select(arr_delay,?dep_delay)?%>%??????summarise(??????????arr?=?mean(arr_delay,?na.rm?=?TRUE),??????????dep?=?mean(dep_delay,?na.rm?=?TRUE)??????)?%>%??????filter(arr?>?30?|?dep?>?30)?? 前面都免去了數(shù)據(jù)名
?
?
若想要進行更多地了解這個包,可以參考其自帶的說明書(60頁):dplyr
總結
以上是生活随笔為你收集整理的R语言扩展包dplyr——数据清洗和整理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。