spark-streaming first insight
一、
Spark Streaming 構(gòu)建在Spark core API之上,具備可伸縮,高吞吐,可容錯(cuò)的流處理模塊。
1)支持多種數(shù)據(jù)源,如Kafka,Flume,Socket,文件等;
- Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
- Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.
2)處理完成數(shù)據(jù)可寫入Kafka,Hdfs,本地文件等多種地方;
?
DStream:
Spark Streaming對(duì)持續(xù)流入的數(shù)據(jù)有個(gè)高層的抽像:
It represents a continuous stream of data
a DStream is represented by a continuous series of RDDs,Each RDD in a DStream contains data from a certain interval
Any operation applied on a DStream translates to operations on the underlying RDDs.
?
什么是RDD?
RDD是Resilient Distributed Dataset的縮寫,中文譯為彈性分布式數(shù)據(jù)集,是Spark中最重要的概念。
RDD是只讀的、分區(qū)的,可容錯(cuò)的數(shù)據(jù)集合。
?
何為彈性?
RDD可在內(nèi)存、磁盤之間任意切換
RDD可以轉(zhuǎn)換成其它RDD,可由其它RDD生成
RDD可存儲(chǔ)任意類型數(shù)據(jù)
?
二、基本概念
1)add dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.3.1</version>
</dependency>
其它想關(guān)依賴查詢:
https://search.maven.org/search?q=g:org.apache.spark%20AND%20v:2.2.0
?
2)文件作為DStream源,是如何被監(jiān)控的?
1)文件格式須一致
2)根據(jù)modify time開成流,而非create time
3)處理時(shí),當(dāng)前文件變更不會(huì)在此window處理,即不會(huì)reread
4)可以調(diào)用 FileSystem.setTimes()來修改文件時(shí)間,使其在下個(gè)window被處理,即使文件內(nèi)容未被修改過
?
三、Transform operation
window operation
?
Spark Streaming also provides?windowed computations, which allow you to apply transformations over a sliding window of data.
every time the window?slides?over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream.?
在一個(gè)時(shí)間窗口內(nèi)的RDD被合并為一個(gè)RDD來處理。
Any window operation needs to specify two parameters:
window length: The duration of the window
sliding interval: The interval at which the window operation if performed
?
四、Output operation
使用foreachRDD
dstream.foreachRDD?is a powerful primitive that allows data to be sent out to external systems. However, it is important to understand how to use this primitive correctly and efficiently.?
?
CheckPoint概念
?
Performance Tuning
?
Fault-tolerance Semantics
?
轉(zhuǎn)載于:https://www.cnblogs.com/gm-201705/p/9533271.html
總結(jié)
以上是生活随笔為你收集整理的spark-streaming first insight的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Go语言Zap日志库使用封装(日志分割)
- 下一篇: 程序设计入门-C语言基础知识-翁恺-第六