當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大数据实战第十六课（上）-Spark-Core04

發布時間：2023/12/18 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据实战第十六课（上）-Spark-Core04 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、上次課回顧

二、Shuffle剖析

2.1 Shuffle簡介
2.2 Shuffle背景
2.3 Shuffle Performance Impact(Shuffle 性能上的影響)

三、shuffle在Spark-shell操作

3.1 IDEA下進行分組
3.2 coalesce和repartition 在生產中的使用
3.3 reduceByKey和groupByKey分析
3.4 圖解reduceByKey和groupByKey的shuffle過程
3.5 探究源碼reduceByKey和groupByKey的combiner

四、擴展：aggregateByKey算子

4.1 collectAsMap

一、上次課回顧

大數據實戰第十五課(上)之-Spark-Core03：
https://blog.csdn.net/zhikanjiani/article/details/91045640#id_4.2

寬窄依賴定義，在容錯方面定義

spark on yarn（client、cluster）

key-value編程

YARN HADOOP_CONF_DIR
對于yarn模式是否需要在$SPARK_HOME/conf下的slaves下修改localhosts為Hadoop002,。

跑yarn的時候只需要這臺機器作為客戶端就行了；為什么spark on yarn說的是它僅僅只需要一個客戶端。

問：Spark on yarn是否需要啟動這些東西？

在$SPARK_HOME/sbin/start-all.sh
/start-master.sh start-slaves.sh slaves

跑Spark on yarn，哐哐哐要把spark節點啟動起來。

只要gateway+spark submit就行了，根本不需要啟動什么進程就行。

二、Shuffle剖析

2.1 Shuffle簡介

回顧：一個action會觸發一個job，一個job遇到shuffle會分裂出一個stage，stage中是一堆task。

參見官網：http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations

需求：

給了你一堆通話記錄call records ==> 統計本月打出去了多少電話
進入手機通話界面：通訊人、通話時間、通話時長、通話記錄。

spark中統計分析都是基于wc，（天時間+撥打，1），天時間+撥打作為一個key，進行reduceByKey（）操作。

相同的天時間+撥打 ==> shuffle到同一個reduce上去，你能進行累加操作么？是不能的

引出：某一種具有特定特征的數據匯聚到某一個節點進行計算，此處進行+1操作
注意：能避免shuffle的操作盡量避免。

Shuffle operations
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism（機制） for re-distributing data（重新分發數據） so that it’s grouped differently across partitions. This typically involves copying data across executors（拷貝數據到機器上，會有磁盤和網絡IO） and machines, making the shuffle a complex and costly operation（是的shuffle成為了一個復雜的并且成本高的操作）.

重新分發數據還跨分區的一個操作，這個典型的操作還涉及到拷貝數據到不同的機器上，還會有磁盤IO和網絡IO，所以shuffle是一個復雜的并且成本高的操作。

2.2 Shuffle背景

To understand what happens during the shuffle we can consider the example of the reduceByKey operation.

我們以reduceByKey來理解shuffle操作中會發生什么.

The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple

reduceByKey操作生成一個新的RDD，每一個key所對應的的值都會被組合成一個元組

the key and the result of executing a reduce function against all values associated with that key

（相同特征的key會被分到一個reduce上去處理）.

The challenge is that not all values for a single key necessarily reside on the same partition. or even the same machine, but they must be co-located to compute the result.

不是所有的key對應的value都是保存在相同的分區下的（帶來的挑戰是：結果是跨分區的，它們必須要在同一個地點協同工作。）

Operations which can cause a shuffle include repartition operations like repartition and coalese, ‘ByKey’ operations （except for counting）like groupByKey and reduceByKey, and join operations like cogroup and join.

有哪些操作可能會產生一些Shuffle？

2.3 Shuffle Performance Impact（性能上的影響）

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O（磁盤IO、數據序列化、網絡IO）. To organize data for the shuffle. Spark generates sets of tasks （Spark會產生一系列的task）- map tasks to organize the data（map task組織數據）, and a set of reduce tasks to aggregate it（reduce task去聚合數據）.This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations（這種方式來自于MapReduce，但是并沒有直接映射到map和reduce操作）.

Spark產生一系列的task ==> spark會產生一堆的stage，shuffle產生新的stage，stage產生一堆的task

Internally，results from individual map tasks are kept in memory until they can’t fit，these are sorted based on the target partition and written to a single file. On the reduce side，tasks read the relevant sorted blocks.

本質上，獨立的map結果保存在內存上，reduce端會讀取相關排序數據（map端輸出的）。

三、Shuffle在Spark-shell操作

1、啟動Spark-shell：

scala> val info = sc.textFile("hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt") info: org.apache.spark.rdd.RDD[String] = hdfs://hadoop002:9000/wordcount/input/ruozeinput.txt MapPartitionsRDD[1] at textFile at <console>:24scala> info.partitions.length res0: Int = 2scala> val info1 = info.coalesce(1) info1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at <console>:25scala> info1.partitions.length res1: Int = 1scala> val info2 = info.coalesce(4) info2: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at <console>:25scala> info2.partitions.length res2: Int = 2scala> val info3 = info.coalesce(4,true) info3: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at <console>:25scala> info3.partitions.length res3: Int = 4scala> info3.collect res4: Array[String] = Array(hello world, hello, hello world john)

解釋coalesce方法、

def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)

傳入一個分區數，傳入一個true或者false，可傳可不傳，

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

調用的就是coalesce，肯定是會僅從shuffle的。

使用collect操作觸發：

scala> info3.collect
res4: Array[String] = Array(hello world, hello, hello world john)

使用repartition操作：

scala> val info4 = info.repartition(5)
info4: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :25
scala> info4.collect
res6: Array[String] = Array(hello world john, hello world, hello)
scala> info.partitions.length
res7: Int = 2

2個分區變為5個分區，對數據重新做分發，使用coalesce，避免你做一個shuffle的動作

3.1 IDEA下進行分組：

package spark01import org.apache.spark.{SparkConf, SparkContext}import scala.collection.mutable.ListBufferobject RepartitionApp {def main(args: Array[String]): Unit = {val sparkConf = new SparkConf()sparkConf.setAppName("LogApp").setMaster("local[2]")val sc = new SparkContext(sparkConf)val students = sc.parallelize(List("黃帆","梅宇豪","秦朗","楊朝珅","王乾","沈兆乘","沈其文","陳思文"),3)students.mapPartitionsWithIndex((index,partition) => {val stus = new ListBuffer[String]while(partition.hasNext){ //迭代分區stus += ("~~~~" + partition.next() + ",哪個組：" + (index+1))}stus.iterator}).foreach(println) //進行打印sc.stop()}} mapPartitionWithIndex()：意思是分分區，加一個組編號在parallelize中設置并行度，明確是3個組；

需求一：

部門裁員，三個組變成二個組，進行如下修改：

students.mapPartitionsWithIndex((index,partition) ==>
變更如下 :
students.coalesce(2).mapPartitionsWithIndex((index,partition)

需求二：

部門裁員前是三個組，把他們重新分組變成5個組
students.repartition(5).mapPartitionsWithIndex((index,partition)

為了直觀顯示partition和repartition操作：

可以運行如下代碼：

package Sparkcore04import org.apache.spark.{SparkConf, SparkContext}import scala.collection.mutable.ListBufferobject RepartitionApp {def main(args: Array[String]): Unit = {val sparkConf = new SparkConf();sparkConf.setAppName("LogApp").setMaster("local[2]");val sc = new SparkContext(sparkConf);val students = sc.parallelize(List("梅宇豪","黃帆","楊超神","薛思雨","朱昱璇","周一虹","王曉嵐","沈兆乘","陳思文"),3);students.mapPartitionsWithIndex((index,partition) =>{val stus = new ListBuffer[String]while(partition.hasNext){stus += ("~~~~" + partition.next() + ",哪個組：" + (index+1))}stus.iterator}).foreach(println)println("---------------------------分割線---------------------------")students.repartition(4).mapPartitionsWithIndex((index,partition) => {val stus = new ListBuffer[String]while(partition.hasNext) {stus += ("~~~" + partition.next() + ",新組" + (index+1))}stus.iterator}).foreach(println)sc.stop()}}

3.2 coalesce和repartition 在生產中的使用：

假設一個RDD中有300個分區，每個分區中只有一條記錄"id=100“，

此時做了一個filter操作(id > 99)，結果就是還是有300個partition，每個partition中只有一條數據

變換起始條件：

原來300個partition，每個partition有10萬條數據，還是做了filter操作(id > 99)，輸出出來每個文件只有一條數據；

如果此時coalesce(1)，以此來進行收斂，對小文件好很多。分區數決定了最終輸出的文件個數。

rePartition應用場景：可以把數據打散，提升并行度。

3.3 ReduceByKey和groupByKey分析

1、手寫一個word count：

在secureCRT上啟動spark-shell --master local[2]
執行如下：sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
查看DAG圖：第一個算子textFile、第二個算子flatMap、第三個算子map，遇到reduceByKey，一拆前面一個stage后面一個stage

兩個stage，做reduceByKey的時候按照（_,1）的數據先寫出來，再讀進去。
reduceByKey的數據結構是：[String,Int]：代表的是單詞出現的個數

2、reduceByKey和groupByKey的數據結構：

scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+)
res4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[9] at reduceByKey at :25
scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey()
res5: org.apache.spark.rdd.RDD[(String, Iterable[Int])] = ShuffledRDD[14] at groupByKey at :25

reduceByKey完成wordcount：

scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).reduceByKey(+).collect
res10: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

groupByKey完成wordcount：

scala> sc.textFile(“file:///home/hadoop/data/ruozeinput.txt”).flatMap(.split("\t")).map((,1)).groupByKey().map( x=> (x._1,x._2.sum)).collect
res11: Array[(String, Int)] = Array((hello,3), (world,2), (john,1))

小結：

對比UI中的兩張圖：reduceByKey讀進來53B，shuffle的數據161B；而groupBykey讀進來的數據是53B，shuffle的數據卻是172B.

groupByKey所有的數據未經計算

reduceByKey做了局部聚合操作，本地做了combiner，combiner的結果再經過shuffle，所以數據量會少一些。

3.4 圖解reduceByKey和groupByKey的shuffle過程

假設有三個map的數據：第一個（a,1）(b,1) 第二個：（a,1）(b,1) （a,1）(b,1) 第三個：（a,1）(b,1) （a,1）(b,1) （a,1）(b,1)

groupByKey的shuffle過程：

reduceByKey的shuffle過程：

為啥reduceByKey的數據量要少一點，因為在map端先做了聚合減少了shuffle的數據量。

擴展aggregateByKey算子：

有些方法使用reduceByKey解決不了，引出新的算子：

源碼面前了無秘密：

groupByKey中的源碼：

在pairRDDFunctions.scala中定義的groupByKey方法：

def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
// groupByKey shouldn’t use map side combine because map side combine does not
// reduce the amount of data shuffled and requires all map side data be inserted
// into a hash table, leading to more objects in the old gen.
val createCombiner = (v: V) => CompactBuffer(v)
val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

我們注意到combine的默認值就是false.

reduceByKey中的源碼：

def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {
require(mergeCombiners != null, “mergeCombiners must be defined”) // required as of Spark 0.9.0

我們注意到combine的默認值就是true.

4.1 collectAsMap

注釋：所有的數據都會被加載到driver的內存，會扛不住掛掉

/*** Return the key-value pairs in this RDD to the master as a Map.** Warning: this doesn't return a multimap (so if you have multiple values to the same key, only* one value per key is preserved in the map returned)** @note this method should only be used if the resulting data is expected to be small, as* all the data is loaded into the driver's memory.*/def collectAsMap(): Map[K, V] = self.withScope {val data = self.collect()val map = new mutable.HashMap[K, V]map.sizeHint(data.length)data.foreach { pair => map.put(pair._1, pair._2) }map}

在RDD.scala中：

記住：只要看到了源碼中有runJob，那么它一定就會觸發action.

/*** Return the number of elements in the RDD.*/def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum /*** Return an array that contains all of the elements in this RDD.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def collect(): Array[T] = withScope {val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)Array.concat(results: _*)}

Array.concat(results: _) ==> 這邊并不是可變參數
點擊concat進入下一層源碼：
-def concat[T: ClassTag](xss: Array[T]): Array[T] //這個才是可變參數的定義

在Scala04課程中有所體現。
println(sum(1.to(10) :_* ))

總結

以上是生活随笔為你收集整理的大数据实战第十六课（上）-Spark-Core04的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

大数据实战第十六课（上）-Spark-Core04

一、上次課回顧

二、Shuffle剖析

2.1 Shuffle簡介

2.2 Shuffle背景

2.3 Shuffle Performance Impact（性能上的影響）

三、Shuffle在Spark-shell操作

1、啟動Spark-shell：

解釋coalesce方法、

3.1 IDEA下進行分組：

需求一：

需求二：

3.2 coalesce和repartition 在生產中的使用：

3.3 ReduceByKey和groupByKey分析

1、手寫一個word count：

2、reduceByKey和groupByKey的數據結構：

reduceByKey完成wordcount：

groupByKey完成wordcount：

小結：

3.4 圖解reduceByKey和groupByKey的shuffle過程

groupByKey的shuffle過程：

reduceByKey的shuffle過程：

擴展aggregateByKey算子：

源碼面前了無秘密：

4.1 collectAsMap

在RDD.scala中：

總結