當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Spark Streaming 实战案例（一)

發(fā)布時間：2024/1/23 编程问答 56 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark Streaming 实战案例（一) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

本節(jié)主要內(nèi)容

本節(jié)部分內(nèi)容來自官方文檔：http://spark.apache.org/docs/latest/streaming-programming-guide.html#mllib-operations

Spark流式計算簡介

Spark Streaming相關(guān)核心類

入門案例

1. Spark流式計算簡介

Hadoop的MapReduce及Spark SQL等只能進行離線計算，無法滿足實時性要求較高的業(yè)務(wù)需求，例如實時推薦、實時網(wǎng)站性能分析等，流式計算可以解決這些問題。目前有三種比較常用的流式計算框架，它們分別是Storm，Spark Streaming和Samza，各個框架的比較及使用情況，可以參見：http://www.csdn.net/article/2015-03-09/2824135。本節(jié)對Spark Streaming進行重點介紹，Spark Streaming作為Spark的五大核心組件之一，其原生地支持多種數(shù)據(jù)源的接入，而且可以與Spark MLLib、Graphx結(jié)合起來使用，輕松完成分布式環(huán)境下在線機器學(xué)習(xí)算法的設(shè)計。Spark支持的輸入數(shù)據(jù)源及輸出文件如下圖所示：

在后面的案例實戰(zhàn)當(dāng)中，會涉及到這部分內(nèi)容。中間的”Spark Streaming“會對輸入的數(shù)據(jù)源進行處理，然后將結(jié)果輸出，其內(nèi)部工作原理如下圖所示：

Spark Streaming接受實時傳入的數(shù)據(jù)流，然后將數(shù)據(jù)按批次（batch）進行劃分，然后再將這部分?jǐn)?shù)據(jù)交由Spark引擎進行處理，處理完成后將結(jié)果輸出到外部文件。

先看下面一段基于Spark Streaming的word count代碼，它可以很好地幫助初步理解流式計算

import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext}object StreamingWordCount {def main(args: Array[String]) {if (args.length < 1) {System.err.println("Usage: StreamingWordCount <directory>")System.exit(1)}//創(chuàng)建SparkConf對象val sparkConf = new SparkConf().setAppName("HdfsWordCount").setMaster("local[2]")// Create the context//創(chuàng)建StreamingContext對象，與集群進行交互val ssc = new StreamingContext(sparkConf, Seconds(20))// Create the FileInputDStream on the directory and use the// stream to count words in new files created//如果目錄中有新創(chuàng)建的文件，則讀取val lines = ssc.textFileStream(args(0))//分割為單詞val words = lines.flatMap(_.split(" "))//統(tǒng)計單詞出現(xiàn)次數(shù)val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)//打印結(jié)果wordCounts.print()//啟動Spark Streamingssc.start()//一直運行，除非人為干預(yù)再停止ssc.awaitTermination()} }

運行上面的程序后，再通過命令行界面，將文件拷貝到相應(yīng)的文件目錄，具體如下：

程序在運行時，根據(jù)文件創(chuàng)建時間對文件進行處理，在上一次運行時間后創(chuàng)建的文件都會被處理，輸出結(jié)果如下：

2. Spark Streaming相關(guān)核心類

1. DStream（discretized stream）

Spark Streaming提供了對數(shù)據(jù)流的抽象，它就是DStream，它可以通過前述的 Kafka, Flume等數(shù)據(jù)源創(chuàng)建，DStream本質(zhì)上是由一系列的RDD構(gòu)成。各個RDD中的數(shù)據(jù)為對應(yīng)時間間隔（ interval）中流入的數(shù)據(jù)，如下圖所示：

對DStream的所有操作最終都要轉(zhuǎn)換為對RDD的操作，例如前面的StreamingWordCount程序，flatMap操作將作用于DStream中的所有RDD，如下圖所示：

2.StreamingContext
在Spark Streaming當(dāng)中，StreamingContext是整個程序的入口，其創(chuàng)建方式有多種，最常用的是通過SparkConf來創(chuàng)建：

import org.apache.spark._ import org.apache.spark.streaming._val conf = new SparkConf().setAppName(appName).setMaster(master) val ssc = new StreamingContext(conf, Seconds(1))

創(chuàng)建StreamingContext對象時會根據(jù)SparkConf創(chuàng)建SparkContext

/*** Create a StreamingContext by providing the configuration necessary for a new SparkContext.* @param conf a org.apache.spark.SparkConf object specifying Spark parameters* @param batchDuration the time interval at which streaming data will be divided into batches*/def this(conf: SparkConf, batchDuration: Duration) = {this(StreamingContext.createNewSparkContext(conf), null, batchDuration)}

也就是說StreamingContext是對SparkContext的封裝，StreamingContext還有其它幾個構(gòu)造方法，感興趣的可以了解，后期在源碼解析時會對它進行詳細(xì)的講解，創(chuàng)建StreamingContext時會指定batchDuration，它用于設(shè)定批處理時間間隔，需要根據(jù)應(yīng)用程序和集群資源情況去設(shè)定。當(dāng)創(chuàng)建完成StreamingContext之后，再按下列步驟進行：

通過輸入源創(chuàng)建InputDStreaim

對DStreaming進行transformation和output操作，這樣操作構(gòu)成了后期流式計算的邏輯

通過StreamingContext.start()方法啟動接收和處理數(shù)據(jù)的流程

使用streamingContext.awaitTermination()方法等待程序處理結(jié)束（手動停止或出錯停止）

也可以調(diào)用streamingContext.stop()方法結(jié)束程序的運行

關(guān)于StreamingContext有幾個值得注意的地方：

1.StreamingContext啟動后，增加新的操作將不起作用。也就是說在StreamingContext啟動之前，要定義好所有的計算邏輯
2.StreamingContext停止后，不能重新啟動。也就是說要重新計算的話，需要重新運行整個程序。
3.在單個JVM中，一段時間內(nèi)不能出現(xiàn)兩個active狀態(tài)的StreamingContext
4.調(diào)用StreamingContext的stop方法時，SparkContext也將被stop掉，如果希望StreamingContext關(guān)閉時，保留SparkContext,則需要在stop方法中傳入?yún)?shù)stopSparkContext=false
/**
* Stop the execution of the streams immediately (does not wait for all received data
* to be processed). By default, if stopSparkContext is not specified, the underlying
* SparkContext will also be stopped. This implicit behavior can be configured using the
* SparkConf configuration spark.streaming.stopSparkContextByDefault.
*
* @param stopSparkContext If true, stops the associated SparkContext. The underlying SparkContext
* will be stopped regardless of whether this StreamingContext has been
* started.
*/
def stop(
stopSparkContext: Boolean = conf.getBoolean(“spark.streaming.stopSparkContextByDefault”, true)
): Unit = synchronized {
stop(stopSparkContext, false)
}
5.SparkContext對象可以被多個StreamingContexts重復(fù)使用，但需要前一個StreamingContexts停止后再創(chuàng)建下一個StreamingContext對象。

3. InputDStreams及Receivers
InputDStream指的是從數(shù)據(jù)流的源頭接受的輸入數(shù)據(jù)流，在前面的StreamingWordCount程序當(dāng)中，val lines = ssc.textFileStream(args(0)) 就是一種InputDStream。除文件流外，每個input DStream都關(guān)聯(lián)一個Receiver對象，該Receiver對象接收數(shù)據(jù)源傳來的數(shù)據(jù)并將其保存在內(nèi)存中以便后期Spark處理。

Spark Streaimg提供兩種原生支持的流數(shù)據(jù)源：

Basic sources（基礎(chǔ)流數(shù)據(jù)源）。直接通過StreamingContext API創(chuàng)建，例如文件系統(tǒng)（本地文件系統(tǒng)及分布式文件系統(tǒng)）、Socket連接及Akka的Actor。
文件流（File Streams）的創(chuàng)建方式:
a. streamingContext.fileStreamKeyClass, ValueClass, InputFormatClass
b. streamingContext.textFileStream(dataDirectory)
實時上textFileStream方法最終調(diào)用的也是fileStream方法
def textFileStream(directory: String): DStream[String] = withNamedScope(“text file stream”) {
fileStreamLongWritable, Text, TextInputFormat.map(_._2.toString)
}

基于Akka Actor流數(shù)據(jù)的創(chuàng)建方式：
streamingContext.actorStream(actorProps, actor-name)

基于Socket流數(shù)據(jù)的創(chuàng)建方式：
ssc.socketTextStream(hostname: String,port: Int,storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2）

基于RDD隊列的流數(shù)據(jù)創(chuàng)建方式：
streamingContext.queueStream(queueOfRDDs)

Advanced sources（高級流數(shù)據(jù)源）。如Kafka, Flume, Kinesis, Twitter等，需要借助外部工具類，在運行時需要外部依賴（下一節(jié)內(nèi)容中介紹）

Spark Streaming還支持用戶
3. Custom Sources（自定義流數(shù)據(jù)源），它需要用戶定義receiver，該部分內(nèi)容也放在下一節(jié)介紹

最后有兩個需要注意的地方：

在本地運行Spark Streaming時，master URL不能使用“l(fā)ocal” 或 “l(fā)ocal[1]”,因為當(dāng)input DStream與receiver（如sockets, Kafka, Flume等）關(guān)聯(lián)時，receiver自身就需要一個線程來運行，此時便沒有線程去處理接收到的數(shù)據(jù)。因此，在本地運行SparkStreaming程序時，要使用“l(fā)ocal[n]”作為master URL,n要大于receiver的數(shù)量。

在集群上運行Spark Streaming時，分配給Spark Streaming程序的CPU核數(shù)也必須大于receiver的數(shù)量，否則系統(tǒng)將只接受數(shù)據(jù)，無法處理數(shù)據(jù)。

3. 入門案例

為方便后期查看運行結(jié)果，修改日志級別為Level.WARN

import org.apache.spark.Loggingimport org.apache.log4j.{Level, Logger}/** Utility functions for Spark Streaming examples. */ object StreamingExamples extends Logging {/** Set reasonable logging levels for streaming if the user has not configured log4j. */def setStreamingLogLevels() {val log4jInitialized = Logger.getRootLogger.getAllAppenders.hasMoreElementsif (!log4jInitialized) {// We first log something to initialize Spark's default logging, then we override the// logging level.logInfo("Setting log level to [WARN] for streaming example." +" To override add a custom log4j.properties to the classpath.")Logger.getRootLogger.setLevel(Level.WARN)}} }

NetworkWordCount
基于Socket流數(shù)據(jù)

object NetworkWordCount {def main(args: Array[String]) {if (args.length < 2) {System.err.println("Usage: NetworkWordCount <hostname> <port>")System.exit(1)}//修改日志層次為Level.WARNStreamingExamples.setStreamingLogLevels()// Create the context with a 1 second batch sizeval sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[4]")val ssc = new StreamingContext(sparkConf, Seconds(1))// Create a socket stream on target ip:port and count the// words in input stream of \n delimited text (eg. generated by 'nc')// Note that no duplication in storage level only for running locally.// Replication necessary in distributed scenario for fault tolerance.//創(chuàng)建SocketInputDStream，接收來自ip:port發(fā)送來的流數(shù)據(jù)val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)val words = lines.flatMap(_.split(" "))val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)wordCounts.print()ssc.start()ssc.awaitTermination()} }

配置運行時參數(shù)

使用

//啟動netcat server root@sparkmaster:~/streaming# nc -lk 9999

運行NetworkWordCount 程序，然后在netcat server運行的控制臺輸入任意字符串

root@sparkmaster:~/streaming# nc -lk 9999 Hello WORLD HELLO WORLD WORLD TEWST NIMA

QueueStream
基于RDD隊列的流數(shù)據(jù)

import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.streaming.{Seconds, StreamingContext}import scala.collection.mutableobject QueueStream {def main(args: Array[String]) {StreamingExamples.setStreamingLogLevels()val sparkConf = new SparkConf().setAppName("QueueStream").setMaster("local[4]")// Create the contextval ssc = new StreamingContext(sparkConf, Seconds(1))// Create the queue through which RDDs can be pushed to// a QueueInputDStream//創(chuàng)建RDD隊列val rddQueue = new mutable.SynchronizedQueue[RDD[Int]]()// Create the QueueInputDStream and use it do some processing// 創(chuàng)建QueueInputDStream val inputStream = ssc.queueStream(rddQueue)//處理隊列中的RDD數(shù)據(jù)val mappedStream = inputStream.map(x => (x % 10, 1))val reducedStream = mappedStream.reduceByKey(_ + _)//打印結(jié)果reducedStream.print()//啟動計算ssc.start()// Create and push some RDDs intofor (i <- 1 to 30) {rddQueue += ssc.sparkContext.makeRDD(1 to 3000, 10)Thread.sleep(1000)//通過程序停止StreamingContext的運行ssc.stop()} }

總結(jié)

以上是生活随笔為你收集整理的Spark Streaming 实战案例（一)的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ArcGis中空间连接join
下一篇： Spark Streaming 实战案例