日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop vs Spark性能对比

發(fā)布時(shí)間:2025/6/15 编程问答 18 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Hadoop vs Spark性能对比 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

基于Spark-0.4和Hadoop-0.20.2

1. Kmeans

數(shù)據(jù):自己產(chǎn)生的三維數(shù)據(jù),分別圍繞正方形的8個(gè)頂點(diǎn)

{0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10},

{10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10}

Point number

189,918,082 (1億9千萬個(gè)三維點(diǎn))

Capacity

10GB

HDFS Location

/user/LijieXu/Kmeans/Square-10GB.txt

程序邏輯:

讀取HDFS上的block到內(nèi)存,每個(gè)block轉(zhuǎn)化為RDD,里面包含vector。

然后對(duì)RDD進(jìn)行map操作,抽取每個(gè)vector(point)對(duì)應(yīng)的類號(hào),輸出(K,V)為(class,(Point,1)),組成新的RDD。

然后再reduce之前,對(duì)每個(gè)新的RDD進(jìn)行combine,在RDD內(nèi)部算出每個(gè)class的中心和。使得每個(gè)RDD的輸出只有最多K個(gè)KV對(duì)。

最后進(jìn)行reduce得到新的RDD(內(nèi)容的Key是class,Value是中心和,再經(jīng)過map后得到最后的中心。

先上傳到HDFS上,然后在Master上運(yùn)行

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-10GB.txt 8 2.0

迭代執(zhí)行Kmeans算法。

一共160個(gè)task。(160 * 64MB = 10GB)

利用了32個(gè)CPU cores,18.9GB的內(nèi)存。

每個(gè)機(jī)器的內(nèi)存消耗為4.5GB (共40GB)(本身points數(shù)據(jù)10GB*2,Map后中間數(shù)據(jù)(K, V) => (int, (vector, 1)) (大概10GB)

最后結(jié)果:

0.505246194 s

Final centers: Map(5 -> (13.997101228817169, 9.208875044622895, -2.494072457488311), 8 -> (-2.33522333047955, 9.128892414676326, 1.7923150585737604), 7 -> (8.658031587043952, 2.162306996983008, 17.670646829079146), 3 -> (11.530154433698268, 0.17834347219956842, 9.224352885937776), 4 -> (12.722903153986868, 8.812883284216143, 0.6564509961064319), 1 -> (6.458644369071984, 11.345681702383024, 7.041924994173552), 6 -> (12.887793408866614, -1.5189406469928937, 9.526393664105957), 2 -> (2.3345459304412164, 2.0173098597285533, 1.4772489989976143))

50MB/s 10GB => 3.5min

10MB/s 10GB => 15min

在20GB的數(shù)據(jù)上測(cè)試

Point number

377,370,313 (3億7千萬個(gè)三維點(diǎn))

Capacity

20GB

HDFS Location

/user/LijieXu/Kmeans/Square-20GB.txt

運(yùn)行測(cè)試命令:

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:9000/user/LijieXu/Kmeans/Square-20GB.txt 8 2.0 | tee mylogs/sqaure-20GB-kmeans.log

得到聚類結(jié)果:

Final centers: Map(5 -> (-0.47785701742763115, -1.5901830956323306, -0.18453046159033773), 8 -> (1.1073911553593858, 9.051671594514225, -0.44722211311446924), 7 -> (1.4960397239284795, 10.173412443492643, -1.7932911100570954), 3 -> (-1.4771114031182642, 9.046878176063172, -2.4747981387714444), 4 -> (-0.2796747780312184, 0.06910629855122015, 10.268115903887612), 1 -> (10.467618592186486, -1.168580362309453, -1.0462842137817263), 6 -> (0.7569895433952736, 0.8615441990490469, 9.552726007309518), 2 -> (10.807948500515304, -0.5368803187391366, 0.04258123037074164))

基本就是8個(gè)中心點(diǎn)

內(nèi)存消耗:(每個(gè)節(jié)點(diǎn)大約5.8GB),共50GB左右。

內(nèi)存分析:

20GB原始數(shù)據(jù),20GB的Map輸出

迭代次數(shù)

時(shí)間

1

108 s

2

0.93 s

12/06/05 11:11:08 INFO spark.CacheTracker: Looking for RDD partition 2:302

12/06/05 11:11:08 INFO spark.CacheTracker: Found partition in cache!

在20GB的數(shù)據(jù)上測(cè)試(迭代更多的次數(shù))

root@master:/opt/spark# ./run spark.examples.SparkKMeans master@master:5050 hdfs://master:900

0/user/LijieXu/Kmeans/Square-20GB.txt 8 0.8

Task數(shù)目:320

時(shí)間:

迭代次數(shù)

時(shí)間

1

100.9 s

2

0.93 s

3

4.6 s

4

3.9 s

5

3.9 s

6

3.9 s

迭代輪數(shù)對(duì)內(nèi)存容量的影響:

基本沒有什么影響,主要內(nèi)存消耗:20GB的輸入數(shù)據(jù)RDD,20GB的中間數(shù)據(jù)。

Final centers: Map(5 -> (-4.728089224526789E-5, 3.17334874733142E-5, -2.0605806380414582E-4), 8 -> (1.1841686358289191E-4, 10.000062966002101, 9.999933240005394), 7 -> (9.999976672588097, 10.000199556926772, -2.0695123602840933E-4), 3 -> (-1.3506815993198176E-4, 9.999948270638338, 2.328148782609023E-5), 4 -> (3.2493629851483764E-4, -7.892413981250518E-5, 10.00002515017671), 1 -> (10.00004313126956, 7.431996896171192E-6, 7.590402882208648E-5), 6 -> (9.999982611661382, 10.000144597573051, 10.000037734639696), 2 -> (9.999958673426654, -1.1917651103354863E-4, 9.99990217533504))

結(jié)果可視化

2. HdfsTest

測(cè)試邏輯:

package?spark.examples

import?spark._

object?HdfsTest {

def?main(args: Array[String]) {

val?sc =?new?SparkContext(args(0), "HdfsTest")

val?file = sc.textFile(args(1))

val?mapped = file.map(s => s.length).cache()

for?(iter <- 1 to 10) {

val?start = System.currentTimeMillis()

for?(x <- mapped) { x + 2 }

//?println("Processing: " + x)

val?end = System.currentTimeMillis()

println("Iteration " + iter + " took " + (end-start) + " ms")

}

}

}

首先去HDFS上讀取一個(gè)文本文件保存在file

再次計(jì)算file中每行的字符數(shù),保存在內(nèi)存RDD的mapped中

然后讀取mapped中的每一個(gè)字符數(shù),將其加2,計(jì)算讀取+相加的耗時(shí)

只有map,沒有reduce。

測(cè)試10GB的Wiki

實(shí)際測(cè)試的是RDD的讀取性能。

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000:/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt

測(cè)試結(jié)果:

Iteration 1 took 12900 ms = 12s

Iteration 2 took 388 ms

Iteration 3 took 472 ms

Iteration 4 took 490 ms

Iteration 5 took 459 ms

Iteration 6 took 492 ms

Iteration 7 took 480 ms

Iteration 8 took 501 ms

Iteration 9 took 479 ms

Iteration 10 took 432 ms

每個(gè)node的內(nèi)存消耗為2.7GB (共9.4GB * 3)

實(shí)際測(cè)試的是RDD的讀取性能。

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt

測(cè)試90GB的RandomText數(shù)據(jù)

root@master:/opt/spark# ./run spark.examples.HdfsTest master@master:5050 hdfs://master:9000/user/LijieXu/RandomText90GB/RandomText90GB

耗時(shí):

迭代次數(shù)

耗時(shí)

1

111.905310882 s

2

4.681715228 s

3

4.469296148 s

4

4.441203887 s

5

1.999792125 s

6

2.151376037 s

7

1.889345699 s

8

1.847487668 s

9

1.827241743 s

10

1.747547323 s

內(nèi)存總消耗30GB左右。

單個(gè)節(jié)點(diǎn)的資源消耗:

3. 測(cè)試WordCount

寫程序:

import?spark.SparkContext

import?SparkContext._

object?WordCount {

def?main(args: Array[String]) {

if?(args.length < 2) {

System.err.println("Usage: wordcount <master> <jar>")

System.exit(1)

}

val?sp =?new?SparkContext(args(0), "wordcount", "/opt/spark", List(args(1)))

val?file = sp.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt");

val?counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://master:9000/user/Output/WikiResult3")

}

}

打包成mySpark.jar,上傳到Master的/opt/spark/newProgram。

運(yùn)行程序:

root@master:/opt/spark# ./run -cp newProgram/mySpark.jar WordCount master@master:5050 newProgram/mySpark.jar

Mesos自動(dòng)將jar拷貝到執(zhí)行節(jié)點(diǎn),然后執(zhí)行。

內(nèi)存消耗:(10GB輸入file + 10GB的flatMap + 15GB的Map中間結(jié)果(word,1))

還有部分內(nèi)存不知道分配到哪里了。

耗時(shí):50 sec(未經(jīng)過排序)

Hadoop WordCount耗時(shí):120 sec到140 sec

結(jié)果未排序

單個(gè)節(jié)點(diǎn):

Hadoop測(cè)試

Kmeans

運(yùn)行Mahout里的Kmeans

root@master:/opt/mahout-distribution-0.6# bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -Dmapred.reduce.tasks=36 -i /user/LijieXu/Kmeans/Square-20GB.txt -o output -t1 3 -t2 1.5 -cd 0.8 -k 8 -x 6

在運(yùn)行(320個(gè)map,1個(gè)reduce)

Canopy Driver running buildClusters over input: output/data

時(shí)某個(gè)slave的資源消耗情況

Completed Jobs

Jobid

Name

Map Total

Reduce Total

Time

job_201206050916_0029

Input Driver running over input: /user/LijieXu/Kmeans/Square-10GB.txt

160

0

1分2秒

job_201206050916_0030

KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed

160

1

1分6秒

job_201206050916_0031

KMeans Driver running runIteration over clustersIn: output/clusters-1

160

1

1分7秒

job_201206050916_0032

KMeans Driver running runIteration over clustersIn: output/clusters-2

160

1

1分7秒

job_201206050916_0033

KMeans Driver running runIteration over clustersIn: output/clusters-3

160

1

1分6秒

job_201206050916_0034

KMeans Driver running runIteration over clustersIn: output/clusters-4

160

1

1分6秒

job_201206050916_0035

KMeans Driver running runIteration over clustersIn: output/clusters-5

160

1

1分5秒

job_201206050916_0036

KMeans Driver running clusterData over input: output/data

160

0

55秒

job_201206050916_0037

Input Driver running over input: /user/LijieXu/Kmeans/Square-20GB.txt

320

0

1分31秒

job_201206050916_0038

KMeans Driver running runIteration over clustersIn: output/clusters-0/part-randomSeed

320

36

1分46秒

job_201206050916_0039

KMeans Driver running runIteration over clustersIn: output/clusters-1

320

36

1分46秒

job_201206050916_0040

KMeans Driver running runIteration over clustersIn: output/clusters-2

320

36

1分46秒

job_201206050916_0041

KMeans Driver running runIteration over clustersIn: output/clusters-3

320

36

1分47秒

job_201206050916_0042

KMeans Driver running clusterData over input: output/data

320

0

1分34秒

運(yùn)行多次10GB、20GB上的Kmeans,資源消耗

Hadoop WordCount測(cè)試

Spark交互式運(yùn)行

進(jìn)入Master的/opt/spark

運(yùn)行

MASTER=master@master:5050 ./spark-shell

打開Mesos版本的spark

在master:8080可以看到framework

Active Frameworks

ID

User

Name

Running Tasks

CPUs

MEM

Max Share

Connected

201206050924-0-0018

root

Spark shell

0

0

0.0 MB

0.00

2012-06-06 21:12:56

scala> val file = sc.textFile("hdfs://master:9000/user/LijieXu/Wikipedia/txt/enwiki-20110405.txt")

scala> file.first

scala> val words = file.map(_.split(' ')).filter(_.size < 100) //得到RDD[Array[String]]

scala> words.cache

scala> words.filter(_.contains("Beijing")).count

12/06/06 22:12:33 INFO SparkContext: Job finished in 10.862765819 s

res1: Long = 855

scala> words.filter(_.contains("Beijing")).count

12/06/06 22:12:52 INFO SparkContext: Job finished in 0.71051464 s

res2: Long = 855

scala> words.filter(_.contains("Shanghai")).count

12/06/06 22:13:23 INFO SparkContext: Job finished in 0.667734427 s

res3: Long = 614

scala> words.filter(_.contains("Guangzhou")).count

12/06/06 22:13:42 INFO SparkContext: Job finished in 0.800617719 s

res4: Long = 134

由于GC的問題,不能cache很大的數(shù)據(jù)集。

總結(jié)

以上是生活随笔為你收集整理的Hadoop vs Spark性能对比的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。