【转】RHadoop实践系列之二:RHadoop安装与使用
RHadoop實踐系列之二:RHadoop安裝與使用
RHadoop實踐系列文章,包含了R語言與Hadoop結合進行海量數據分析。Hadoop主要用來存儲海量數據,R語言完成MapReduce 算法,用來替代Java的MapReduce實現。有了RHadoop可以讓廣大的R語言愛好者,有更強大的工具處理大數據1G, 10G, 100G, TB, PB。 由于大數據所帶來的單機性能問題,可能會一去不復返了。
RHadoop實踐是一套系列文章,主要包括”Hadoop環境搭建”,”RHadoop安裝與使用”,”R實現MapReduce的協同過濾算法”,”HBase和rhbase的安裝與使用”。對于單獨的R語言愛好者,Java愛好者,或者Hadoop愛好者來說,同時具備三種語言知識并不容 易。此文雖為入門文章,但R,Java,Hadoop基礎知識還是需要大家提前掌握。
關于作者:
- 張丹(Conan), 程序員Java,R,PHP,Javascript
- weibo:@Conan_Z
- blog:?http://blog.fens.me
- email: bsspirit@gmail.com
轉載請注明出處:
http://blog.fens.me/rhadoop-rhadoop/
第二篇 RHadoop安裝與使用部分,分為3個章節。
1. 環境準備 2. RHadoop安裝 3. RHadoop程序用例每一章節,都會分為”文字說明部分”和”代碼部分”,保持文字說明與代碼的連貫性。
注:Hadoop環境搭建的詳細記錄,請查看 同系列上一篇文章 “RHadoop實踐系列文章之Hadoop環境搭建”。
由于兩篇文章并非同一時間所寫,hadoop版本及操作系統,分步式環境都略有不同。
兩篇文章相互獨立,請大家在理解的基礎上動手實驗,不要完成依賴兩篇文章中的運行命令。
環境準備
文字說明部分:
首先環境準備,這里我選擇了Linux Ubuntu操作系統12.04的64位版本,大家可以根據自己的使用習慣選擇順手的Linux。
但JDK一定要用Oracle SUN官方的版本,請從官網下載,操作系統的自帶的OpenJDK會有各種不兼容。JDK請選擇1.6.x的版本,JDK1.7版本也會有各種的不兼容情況。
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Hadoop的環境安裝,請參考RHadoop實踐系統”Hadoop環境搭建”的一文。
R語言請安裝2.15以后的版本,2.14是不能夠支持RHadoop的。
如果你也使用Linux Ubuntu操作系統12.04,請先更新軟件包源,否則只能下載到2.14版本的R。
代碼部分:
1. 操作系統Ubuntu 12.04 x64
~ uname -a Linux domU-00-16-3e-00-00-85 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux2 JAVA環境
~ java -versionjava version "1.6.0_29" Java(TM) SE Runtime Environment (build 1.6.0_29-b11) Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)3 HADOOP環境(這里只需要hadoop)
hadoop-1.0.3 hbase-0.94.2 hive-0.9.0 pig-0.10.0 sqoop-1.4.2 thrift-0.8.0 zookeeper-3.4.44 R的環境
R version 2.15.3 (2013-03-01) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-pc-linux-gnu (64-bit)4.1 如果是Ubuntu 12.04,請更新源再下載R2.15.3版本
sh -c "echo deb http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/ >>/etc/apt/sources.list" apt-get update apt-get install r-baseRHadoop安裝
文字說明部分:
RHadoop是RevolutionAnalytics的工程的項目,開源實現代碼在GitHub社區可以找到。RHadoop包含三個R包 (rmr,rhdfs,rhbase),分別是對應Hadoop系統架構中的,MapReduce, HDFS, HBase 三個部分。由于這三個庫不能在CRAN中找到,所以需要自己下載。
https://github.com/RevolutionAnalytics/RHadoop/wiki
接下我們需要先安裝這三個庫的依賴庫。
首先是rJava,上個章節我們已經配置好了JDK1.6的環境,運行R CMD javareconf命令,R的程序從系統變量中會讀取Java配置。然后打開R程序,通過install.packages的方式,安裝rJava。
然后,我還要安裝其他的幾個依賴庫,reshape2,Rcpp,iterators,itertools,digest,RJSONIO,functional,通過install.packages都可以直接安裝。
接下安裝rhdfs庫,在環境變量中增加 HADOOP_CMD 和 HADOOP_STREAMING 兩個變量,可以用export在當前命令窗口中增加。但為下次方便使用,最好把變量增加到系統環境變更/etc/environment文件中。再用 R CMD INSTALL安裝rhdfs包,就可以順利完成了。
安裝rmr庫,使用R CMD INSTALL也可以順利完成了。
安裝rhbase庫,后面”HBase和rhbase的安裝與使用”文章中會繼續介紹,這里暫時跳過。
最后,我們可以查看一下,RHADOOP都安裝了哪些庫。
由于我的硬盤是外接的,使用mount和軟連接(ln -s)掛載了R類庫的目錄,所以是R的類庫在/disk1/system下面
/disk1/system/usr/local/lib/R/site-library/
一般R的類庫目錄是/usr/lib/R/site-library或者/usr/local/lib/R/site-library,用戶也可以使用whereis R的命令查詢,自己電腦上R類庫的安裝位置
代碼部分:
1. 下載RHadoop相關的3個程序包
https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
rmr-2.1.0 rhdfs-1.0.5 rhbase-1.12. 復制到/root/R目錄
~/R# pwd /root/R~/R# ls rhbase_1.1.tar.gz rhdfs_1.0.5.tar.gz rmr2_2.1.0.tar.gz3. 安裝依賴庫
命令行執行 ~ R CMD javareconf ~ R啟動R程序 install.packages("rJava") install.packages("reshape2") install.packages("Rcpp") install.packages("iterators") install.packages("itertools") install.packages("digest") install.packages("RJSONIO") install.packages("functional")4. 安裝rhdfs庫
~ export HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop ~ export HADOOP_STREAMING=/root/hadoop/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar (rmr2會用到) ~ R CMD INSTALL /root/R/rhdfs_1.0.5.tar.gz4.1 最好把HADOOP_CMD設置到環境變量
~ vi /etc/environmentHADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoopHADOOP_STREAMING=/root/hadoop/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar. /etc/environment5. 安裝rmr庫
~ R CMD INSTALL rmr2_2.1.0.tar.gz6. 安裝rhbase庫 (暫時跳過)
7. 所有的安裝包
~ ls /disk1/system/usr/local/lib/R/site-library/ digest functional iterators itertools plyr Rcpp reshape2 rhdfs rJava RJSONIO rmr2 stringrRHadoop程序用例
文字說明部分:
安裝好rhdfs和rmr兩個包后,我們就可以使用R嘗試一些hadoop的操作了。
首先,是基本的hdfs的文件操作。
查看hdfs文件目錄
hadoop的命令:hadoop fs -ls /user
R語言函數:hdfs.ls(”/user/“)
查看hadoop數據文件
hadoop的命令:hadoop fs -cat /user/hdfs/o_same_school/part-m-00000
R語言函數:hdfs.cat(”/user/hdfs/o_same_school/part-m-00000″)
接下來,我們執行一個rmr算法的任務
普通的R語言程序:
> small.ints = 1:10 > sapply(small.ints, function(x) x^2)MapReduce的R語言程序:
> small.ints = to.dfs(1:10) > mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2)) > from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")因為MapReduce只能訪問HDFS文件系統,先要用to.dfs把數據存儲到HDFS文件系統里。MapReduce的運算結果再用from.dfs函數從HDFS文件系統中取出。
第二個,rmr的例子是wordcount,對文件中的單詞計數
> input<- '/user/hdfs/o_same_school/part-m-00000' > wordcount = function(input, output = NULL, pattern = " "){wc.map = function(., lines) {keyval(unlist( strsplit( x = lines,split = pattern)),1)}wc.reduce =function(word, counts ) {keyval(word, sum(counts))} mapreduce(input = input ,output = output, input.format = "text",map = wc.map, reduce = wc.reduce,combine = T) }> wordcount(input) > from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")我在HDFS上提前放置了數據文件/user/hdfs/o_same_school/part-m-00000。寫wordcount的MapReduce函數,執行wordcount函數,最后用from.dfs從HDFS中取得結果。
代碼部分:
1. rhdfs包的使用
啟動R程序 > library(rhdfs)Loading required package: rJava HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop Be sure to run hdfs.init()> hdfs.init()1.1 命令查看hadoop目錄
~ hadoop fs -ls /userFound 4 items drwxr-xr-x - root supergroup 0 2013-02-01 12:15 /user/conan drwxr-xr-x - root supergroup 0 2013-03-06 17:24 /user/hdfs drwxr-xr-x - root supergroup 0 2013-02-26 16:51 /user/hive drwxr-xr-x - root supergroup 0 2013-03-06 17:21 /user/root1.2 rhdfs查看hadoop目錄
> hdfs.ls("/user/")permission owner group size modtime file 1 drwxr-xr-x root supergroup 0 2013-02-01 12:15 /user/conan 2 drwxr-xr-x root supergroup 0 2013-03-06 17:24 /user/hdfs 3 drwxr-xr-x root supergroup 0 2013-02-26 16:51 /user/hive 4 drwxr-xr-x root supergroup 0 2013-03-06 17:21 /user/root1.3 命令查看hadoop數據文件
~ hadoop fs -cat /user/hdfs/o_same_school/part-m-0000010,3,tsinghua university,2004-05-26 15:21:00.0 23,4007,北京第一七一中學,2004-05-31 06:51:53.0 51,4016,大連理工大學,2004-05-27 09:38:31.0 89,4017,Amherst College,2004-06-01 16:18:56.0 92,4017,斯坦福大學,2012-11-28 10:33:25.0 99,4017,Stanford University Graduate School of Business,2013-02-19 12:17:15.0 113,4017,Stanford University,2013-02-19 12:17:15.0 123,4019,St Paul's Co-educational College - Hong Kong,2004-05-27 18:04:17.0 138,4019,香港蘇浙小學,2004-05-27 18:59:58.0 172,4020,University,2004-05-27 19:14:34.0 182,4026,ff,2004-05-28 04:42:37.0 183,4026,ff,2004-05-28 04:42:37.0 189,4033,tsinghua,2011-09-14 12:00:38.0 195,4035,ba,2004-05-31 07:10:24.0 196,4035,ma,2004-05-31 07:10:24.0 197,4035,southampton university,2013-01-07 15:35:18.0 246,4067,美國史丹佛大學,2004-06-12 10:42:10.0 254,4067,美國史丹佛大學,2004-06-12 10:42:10.0 255,4067,美國休士頓大學,2004-06-12 10:42:10.0 257,4068,清華大學,2004-06-12 10:42:10.0 258,4068,北京八中,2004-06-12 17:34:02.0 262,4068,香港中文大學,2004-06-12 17:34:02.0 310,4070,首都師范大學初等教育學院,2004-06-14 15:35:52.0 312,4070,北京師范大學經濟學院,2004-06-14 15:35:52.01.4 rhdfs查看hadoop數據文件
> hdfs.cat("/user/hdfs/o_same_school/part-m-00000")[1] "10,3,tsinghua university,2004-05-26 15:21:00.0"[2] "23,4007,北京第一七一中學,2004-05-31 06:51:53.0"[3] "51,4016,大連理工大學,2004-05-27 09:38:31.0"[4] "89,4017,Amherst College,2004-06-01 16:18:56.0"[5] "92,4017,斯坦福大學,2012-11-28 10:33:25.0"[6] "99,4017,Stanford University Graduate School of Business,2013-02-19 12:17:15.0"[7] "113,4017,Stanford University,2013-02-19 12:17:15.0"[8] "123,4019,St Paul's Co-educational College - Hong Kong,2004-05-27 18:04:17.0"[9] "138,4019,香港蘇浙小學,2004-05-27 18:59:58.0" [10] "172,4020,University,2004-05-27 19:14:34.0" [11] "182,4026,ff,2004-05-28 04:42:37.0" [12] "183,4026,ff,2004-05-28 04:42:37.0" [13] "189,4033,tsinghua,2011-09-14 12:00:38.0" [14] "195,4035,ba,2004-05-31 07:10:24.0" [15] "196,4035,ma,2004-05-31 07:10:24.0" [16] "197,4035,southampton university,2013-01-07 15:35:18.0" [17] "246,4067,美國史丹佛大學,2004-06-12 10:42:10.0" [18] "254,4067,美國史丹佛大學,2004-06-12 10:42:10.0" [19] "255,4067,美國休士頓大學,2004-06-12 10:42:10.0" [20] "257,4068,清華大學,2004-06-12 10:42:10.0" [21] "258,4068,北京八中,2004-06-12 17:34:02.0" [22] "262,4068,香港中文大學,2004-06-12 17:34:02.0" [23] "310,4070,首都師范大學初等教育學院,2004-06-14 15:35:52.0" [24] "312,4070,北京師范大學經濟學院,2004-06-14 15:35:52.0"2. rmr2包的使用
啟動R程序 > library(rmr2)Loading required package: Rcpp Loading required package: RJSONIO Loading required package: digest Loading required package: functional Loading required package: stringr Loading required package: plyr Loading required package: reshape22.1 執行r任務
> small.ints = 1:10 > sapply(small.ints, function(x) x^2)[1] 1 4 9 16 25 36 49 64 81 1002.2 執行rmr2任務
> small.ints = to.dfs(1:10)13/03/07 12:12:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/03/07 12:12:55 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 13/03/07 12:12:55 INFO compress.CodecPool: Got brand-new compressor> mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))packageJobJar: [/tmp/RtmpWnzxl4/rmr-local-env5deb2b300d03, /tmp/RtmpWnzxl4/rmr-global-env5deb398a522b, /tmp/RtmpWnzxl4/rmr-streaming-map5deb1552172d, /root/hadoop/tmp/hadoop-unjar7838617732558795635/] [] /tmp/streamjob4380275136001813619.jar tmpDir=null 13/03/07 12:12:59 INFO mapred.FileInputFormat: Total input paths to process : 1 13/03/07 12:12:59 INFO streaming.StreamJob: getLocalDirs(): [/root/hadoop/tmp/mapred/local] 13/03/07 12:12:59 INFO streaming.StreamJob: Running job: job_201302261738_0293 13/03/07 12:12:59 INFO streaming.StreamJob: To kill this job, run: 13/03/07 12:12:59 INFO streaming.StreamJob: /disk1/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://r.qa.tianji.com:9001 -kill job_201302261738_0293 13/03/07 12:12:59 INFO streaming.StreamJob: Tracking URL: http://192.168.1.243:50030/jobdetails.jsp?jobid=job_201302261738_0293 13/03/07 12:13:00 INFO streaming.StreamJob: map 0% reduce 0% 13/03/07 12:13:15 INFO streaming.StreamJob: map 100% reduce 0% 13/03/07 12:13:21 INFO streaming.StreamJob: map 100% reduce 100% 13/03/07 12:13:21 INFO streaming.StreamJob: Job complete: job_201302261738_0293 13/03/07 12:13:21 INFO streaming.StreamJob: Output: /tmp/RtmpWnzxl4/file5deb791fcbd5> from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")$key NULL$valv[1,] 1 1[2,] 2 4[3,] 3 9[4,] 4 16[5,] 5 25[6,] 6 36[7,] 7 49[8,] 8 64[9,] 9 81 [10,] 10 1002.3 wordcount執行rmr2任務
> input<- '/user/hdfs/o_same_school/part-m-00000' > wordcount = function(input, output = NULL, pattern = " "){wc.map = function(., lines) {keyval(unlist( strsplit( x = lines,split = pattern)),1)}wc.reduce =function(word, counts ) {keyval(word, sum(counts))} mapreduce(input = input ,output = output, input.format = "text",map = wc.map, reduce = wc.reduce,combine = T) }> wordcount(input)packageJobJar: [/tmp/RtmpfZUFEa/rmr-local-env6cac64020a8f, /tmp/RtmpfZUFEa/rmr-global-env6cac73016df3, /tmp/RtmpfZUFEa/rmr-streaming-map6cac7f145e02, /tmp/RtmpfZUFEa/rmr-streaming-reduce6cac238dbcf, /tmp/RtmpfZUFEa/rmr-streaming-combine6cac2b9098d4, /root/hadoop/tmp/hadoop-unjar6584585621285839347/] [] /tmp/streamjob9195921761644130661.jar tmpDir=null 13/03/07 12:34:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library 13/03/07 12:34:41 WARN snappy.LoadSnappy: Snappy native library not loaded 13/03/07 12:34:41 INFO mapred.FileInputFormat: Total input paths to process : 1 13/03/07 12:34:41 INFO streaming.StreamJob: getLocalDirs(): [/root/hadoop/tmp/mapred/local] 13/03/07 12:34:41 INFO streaming.StreamJob: Running job: job_201302261738_0296 13/03/07 12:34:41 INFO streaming.StreamJob: To kill this job, run: 13/03/07 12:34:41 INFO streaming.StreamJob: /disk1/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job -Dmapred.job.tracker=hdfs://r.qa.tianji.com:9001 -kill job_201302261738_0296 13/03/07 12:34:41 INFO streaming.StreamJob: Tracking URL: http://192.168.1.243:50030/jobdetails.jsp?jobid=job_201302261738_0296 13/03/07 12:34:42 INFO streaming.StreamJob: map 0% reduce 0% 13/03/07 12:34:59 INFO streaming.StreamJob: map 100% reduce 0% 13/03/07 12:35:08 INFO streaming.StreamJob: map 100% reduce 17% 13/03/07 12:35:14 INFO streaming.StreamJob: map 100% reduce 100% 13/03/07 12:35:20 INFO streaming.StreamJob: Job complete: job_201302261738_0296 13/03/07 12:35:20 INFO streaming.StreamJob: Output: /tmp/RtmpfZUFEa/file6cac626aa4a7> from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")$key[1] "-"[2] "04:42:37.0"[3] "06:51:53.0"[4] "07:10:24.0"[5] "09:38:31.0"[6] "10:33:25.0"[7] "10,3,tsinghua"[8] "10:42:10.0"[9] "113,4017,Stanford" [10] "12:00:38.0" [11] "12:17:15.0" [12] "123,4019,St" [13] "138,4019,香港蘇浙小學,2004-05-27" [14] "15:21:00.0" [15] "15:35:18.0" [16] "15:35:52.0" [17] "16:18:56.0" [18] "172,4020,University,2004-05-27" [19] "17:34:02.0" [20] "18:04:17.0" [21] "182,4026,ff,2004-05-28" [22] "183,4026,ff,2004-05-28" [23] "18:59:58.0" [24] "189,4033,tsinghua,2011-09-14" [25] "19:14:34.0" [26] "195,4035,ba,2004-05-31" [27] "196,4035,ma,2004-05-31" [28] "197,4035,southampton" [29] "23,4007,北京第一七一中學,2004-05-31" [30] "246,4067,美國史丹佛大學,2004-06-12" [31] "254,4067,美國史丹佛大學,2004-06-12" [32] "255,4067,美國休士頓大學,2004-06-12" [33] "257,4068,清華大學,2004-06-12" [34] "258,4068,北京八中,2004-06-12" [35] "262,4068,香港中文大學,2004-06-12" [36] "312,4070,北京師范大學經濟學院,2004-06-14" [37] "51,4016,大連理工大學,2004-05-27" [38] "89,4017,Amherst" [39] "92,4017,斯坦福大學,2012-11-28" [40] "99,4017,Stanford" [41] "Business,2013-02-19" [42] "Co-educational" [43] "College" [44] "College,2004-06-01" [45] "Graduate" [46] "Hong" [47] "Kong,2004-05-27" [48] "of" [49] "Paul's" [50] "School" [51] "University" [52] "university,2004-05-26" [53] "university,2013-01-07" [54] "University,2013-02-19" [55] "310,4070,首都師范大學初等教育學院,2004-06-14"$val[1] 1 2 1 2 1 1 1 4 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1轉載請注明出處:
http://blog.fens.me/rhadoop-rhadoop/
轉載于:https://www.cnblogs.com/zhengrunjian/p/4530827.html
總結
以上是生活随笔為你收集整理的【转】RHadoop实践系列之二:RHadoop安装与使用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: scrapy框架异常--no more
- 下一篇: 广义表及其存储方式简介