當(dāng)前位置：首頁(yè) > 编程语言 > c/c++ >内容正文

c/c++

c++语言get:_用C++给R语言加速：Rcpp简单用法

發(fā)布時(shí)間：2023/12/15 c/c++ 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 c++语言get:_用C++给R语言加速：Rcpp简单用法小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

作者：黃天元，復(fù)旦大學(xué)博士在讀，熱愛(ài)數(shù)據(jù)科學(xué)與開(kāi)源工具（R），致力于利用數(shù)據(jù)科學(xué)迅速積累行業(yè)經(jīng)驗(yàn)優(yōu)勢(shì)和科學(xué)知識(shí)發(fā)現(xiàn)，涉獵內(nèi)容包括但不限于信息計(jì)量、機(jī)器學(xué)習(xí)、數(shù)據(jù)可視化、應(yīng)用統(tǒng)計(jì)建模、知識(shí)圖譜等，著有《R語(yǔ)言高效數(shù)據(jù)處理指南》（《R語(yǔ)言數(shù)據(jù)高效處理指南》(黃天元)【摘要書(shū)評(píng) 試讀】- 京東圖書(shū)）。知乎專(zhuān)欄：R語(yǔ)言數(shù)據(jù)挖掘。郵箱：huang.tian-yuan@qq.com.歡迎合作交流。

最近要做一個(gè)小任務(wù)，它的描述非常簡(jiǎn)單，就是識(shí)別向量的變化。比如一個(gè)整數(shù)序列是“4,4,4,5,5,6,6,5,5,5,4,4”，那么我們要根據(jù)數(shù)字的連續(xù)關(guān)系來(lái)分組，輸出應(yīng)該是“1,1,1,2,2,3,3,4,4,4,5,5”。這個(gè)函數(shù)用R寫(xiě)起來(lái)非常簡(jiǎn)單，稍加思考草擬如下：

get_id = function(x){z = vector()y = NULLfor (i in seq_along(x)) {if(i == 1) y = 1else if(x[i] != x[i-1]) y = y + 1z = c(z,y)}z }

不妨做個(gè)小測(cè)試：

get_id = function(x){z = vector()y = NULLfor (i in seq_along(x)) {if(i == 1) y = 1else if(x[i] != x[i-1]) y = y + 1z = c(z,y)}z } c(rep(33L,3),rep(44L,4),rep(33L,3)) #> [1] 33 33 33 44 44 44 44 33 33 33 get_id(c(rep(33L,3),rep(44L,4),rep(33L,3))) #> [1] 1 1 1 2 2 2 2 3 3 3

得到的結(jié)果絕對(duì)是準(zhǔn)確的，而且按照這些代碼，基本可以識(shí)別不同的數(shù)據(jù)類(lèi)型，只要這些數(shù)據(jù)能夠用“==”來(lái)判斷是否相同（可能用setequal函數(shù)的健壯性更好）。

但是當(dāng)數(shù)據(jù)量很大的時(shí)候，這樣寫(xiě)是否足夠快，就很重要了。這意味著看你要等一小時(shí)、一天還是一個(gè)月。想起自己小時(shí)候還學(xué)過(guò)C++，就希望嘗試用Rcpp來(lái)加速，擬了代碼如下：

library(Rcpp)# 函數(shù)名稱(chēng)為get_id_c cppFunction('IntegerVector get_id_c(IntegerVector x){int n = x.size();IntegerVector out(n);for (int i = 0; i < n; i++) {if(i == 1) out[i] = 1;else if(x[i] == x[i-1]) out[i] = out[i-1];else out[i] = out[i-1] + 1;}return out; }')

需要聲明的是，C++需要定義數(shù)據(jù)類(lèi)型，因?yàn)槿蝿?wù)是正整數(shù)，所以函數(shù)就接受一個(gè)整數(shù)向量，輸出一個(gè)整數(shù)向量。多年不用C++，寫(xiě)這么一段代碼居然調(diào)試過(guò)程就出了3次錯(cuò)，慚愧。但是對(duì)性能的提升效果非常顯著，我們先做一些簡(jiǎn)單嘗試。先嘗試1萬(wàn)個(gè)整數(shù)：

library(pacman) p_load(tidyfst) sys_time_print({res1 = get_id(c(rep(33L,1e4),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e4),rep(44L,4),rep(33L,3))) })setequal(res1,res2)

0.14s和0.00s的區(qū)別，可能體會(huì)不夠深。那么來(lái)10萬(wàn)個(gè)整數(shù)試試：

sys_time_print({res1 = get_id(c(rep(33L,1e5),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e5),rep(44L,4),rep(33L,3))) })setequal(res1,res2)

13s vs 0s，有點(diǎn)不能忍了。那么，我們來(lái)100萬(wàn)個(gè)整數(shù)再來(lái)試試：

# 不要嘗試這個(gè) sys_time_print({res1 = get_id(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })# 可以嘗試這個(gè) sys_time_print({res2 = get_id_c(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })setequal(res1,res2)

好的，關(guān)于這段代碼：

sys_time_print({res1 = get_id(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })

可以不要嘗試了，因?yàn)橹苯涌ㄋ懒恕５侨绻肦cpp構(gòu)造的函數(shù)，那么就放心試吧，我們還遠(yuǎn)遠(yuǎn)沒(méi)有探知其算力上限。可以觀察一下結(jié)果：

我們可以看到，1億個(gè)整數(shù)，也就是0.69秒；10億是7.15秒。雖然想嘗試百億，但是我的計(jì)算機(jī)內(nèi)存已經(jīng)不夠了。

總結(jié)一下，永遠(yuǎn)不敢說(shuō)R的速度不夠快，只是自己代碼寫(xiě)得爛而已（盡管完成了實(shí)現(xiàn)，其實(shí)get_id這個(gè)函數(shù)優(yōu)化的空間是很多的），方法總比問(wèn)題多。不多說(shuō)了，去溫習(xí)C++，學(xué)習(xí)Rcpp去了。后面如果有閑暇，來(lái)做一個(gè)Rcpp的學(xué)習(xí)系列。放一個(gè)核心資料鏈接：

Seamless R and C++ Integration?rcpp.org

根據(jù)評(píng)論區(qū)提示，重新寫(xiě)了R的代碼再來(lái)比較。代碼如下：

library(pacman) p_load(Rcpp,tidyfst)get_id = function(x){z = vector()for (i in seq_along(x)) {if(i == 1) z[i] = 1else if(x[i] != x[i-1]) z[i] = z[i-1] + 1else z[i] = z[i-1]}z }cppFunction('IntegerVector get_id_c(IntegerVector x){int n = x.size();IntegerVector out(n);for (int i = 0; i < n; i++) {if(i == 1) out[i] = 1;else if(x[i] == x[i-1]) out[i] = out[i-1];else out[i] = out[i-1] + 1;}return out; }')sys_time_print({res1 = get_id(c(rep(33L,1e4),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e4),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e5),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e5),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e7),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e7),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e8),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e8),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e9),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e9),rep(44L,4),rep(33L,3))) })setequal(res1,res2)

1萬(wàn)：

10萬(wàn)：

100萬(wàn)：

1000萬(wàn)：

1億：

結(jié)論：還是Rcpp香。

三更：R代碼提前設(shè)置向量長(zhǎng)度，再比較。

library(pacman) p_load(Rcpp,tidyfst)get_id = function(x){z = integer(length(x))for (i in seq_along(x)) {if(i == 1) z[i] = 1else if(x[i] != x[i-1]) z[i] = z[i-1] + 1else z[i] = z[i-1]}z }cppFunction('IntegerVector get_id_c(IntegerVector x){int n = x.size();IntegerVector out(n);for (int i = 0; i < n; i++) {if(i == 1) out[i] = 1;else if(x[i] == x[i-1]) out[i] = out[i-1];else out[i] = out[i-1] + 1;}return out; }')sys_time_print({res1 = get_id(c(rep(33L,1e4),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e4),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e5),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e5),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e6),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e7),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e7),rep(44L,4),rep(33L,3))) })setequal(res1,res2)sys_time_print({res1 = get_id(c(rep(33L,1e8),rep(44L,4),rep(33L,3))) })sys_time_print({res2 = get_id_c(c(rep(33L,1e8),rep(44L,4),rep(33L,3))) })setequal(res1,res2)

四更：對(duì)于這個(gè)任務(wù)來(lái)講，data.table的rleid函數(shù)是最快的。R語(yǔ)言中的終極魔咒，能找到現(xiàn)成的千萬(wàn)不要自己寫(xiě)，總有巨佬在前頭。不過(guò)直到10億個(gè)整數(shù)才有難以忍受的差距。

1億

10億

總結(jié)

以上是生活随笔為你收集整理的c++语言get:_用C++给R语言加速：Rcpp简单用法的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：索尼更新 PS VR 2 FAQ 页面：
下一篇： java调用c的sdk_如何使用java

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

c/c++

c++语言get:_用C++给R语言加速：Rcpp简单用法

總結(jié)