當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Spark Streaming介绍，DStream,DStream相关操作(来自学习资料)

發布時間：2024/9/27 编程问答 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark Streaming介绍，DStream,DStream相关操作(来自学习资料) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、 Spark Streaming介紹

1． SparkStreaming概述

1.1．什么是Spark Streaming

Spark Streaming類似于Apache Storm，用于流式數據的處理。根據其官方文檔介紹，Spark Streaming有高吞吐量和容錯能力強等特點。SparkStreaming支持的數據輸入源很多，例如：Kafka、Flume、Twitter、ZeroMQ和簡單的TCP套接字等等。數據輸入后可以用Spark的高度抽象原語如：map、reduce、join、window等進行運算。而結果也能保存在很多地方，如HDFS，數據庫等。另外Spark Streaming也能和MLlib（機器學習）以及Graphx完美融合。

1.2．為什么要學習Spark Streaming

1.易用

2.容錯

3.易整合到Spark體系

1.3． Spark與Storm的對比

Spark	Storm

開發語言：Scala	開發語言：Clojure
編程模型：DStream	編程模型：Spout/Bolt

二、 DStream

1．什么是DStream

Discretized Stream是Spark Streaming的基礎抽象，代表持續性的數據流和經過各種Spark原語操作后的結果數據流。在內部實現上，DStream是一系列連續的RDD來表示。每個RDD含有一段時間間隔內的數據，如下圖：

對數據的操作也是按照RDD為單位來進行的

計算過程由Spark engine來完成

2． DStream相關操作

DStream上的原語與RDD的類似，分為Transformations（轉換）和Output Operations（輸出）兩種，此外轉換操作中還有一些比較特殊的原語，如：updateStateByKey()、transform()以及各種Window相關的原語。

2.1． Transformationson DStreams

Transformation	Meaning
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])??	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)?????	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

特殊的Transformations

1.UpdateStateByKeyOperation

UpdateStateByKey原語用于記錄歷史記錄，上文中Word Count示例中就用到了該特性。若不用UpdateStateByKey來更新狀態，那么每次數據進來后分析完成后，結果輸出后將不在保存

2.TransformOperation

Transform原語允許DStream上執行任意的RDD-to-RDD函數。通過該函數可以方便的擴展Spark API。此外，MLlib（機器學習）以及Graphx也是通過本函數來進行結合的。

3.WindowOperations

Window Operations有點類似于Storm中的State，可以設置窗口的大小和滑動窗口的間隔來動態的獲取當前Steaming的允許狀態

2.2． OutputOperations on DStreams

Output Operations可以將DStream的數據輸出到外部的數據庫或文件系統，當某個Output Operations原語被調用時（與RDD的Action相同），streaming程序才會開始真正的計算過程。

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
saveAsTextFiles(prefix, [suffix])	Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])	Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])	Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

總結

以上是生活随笔為你收集整理的Spark Streaming介绍，DStream,DStream相关操作(来自学习资料)的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。