當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Spark入门（十）之Distinct去重

發(fā)布時(shí)間：2023/12/3 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark入门（十）之Distinct去重小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

?一、Distinct

計(jì)算文本里面的每個(gè)單詞去重，輸出結(jié)果。

二、maven設(shè)置

<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.mk</groupId><artifactId>spark-test</artifactId><version>1.0</version><name>spark-test</name><url>http://spark.mk.com</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.source>1.8</maven.compiler.source><maven.compiler.target>1.8</maven.compiler.target><scala.version>2.11.1</scala.version><spark.version>2.4.4</spark.version><hadoop.version>2.6.0</hadoop.version></properties><dependencies><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>${scala.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency></dependencies><build><pluginManagement><plugins><plugin><artifactId>maven-clean-plugin</artifactId><version>3.1.0</version></plugin><plugin><artifactId>maven-resources-plugin</artifactId><version>3.0.2</version></plugin><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.8.0</version></plugin><plugin><artifactId>maven-surefire-plugin</artifactId><version>2.22.1</version></plugin><plugin><artifactId>maven-jar-plugin</artifactId><version>3.0.2</version></plugin></plugins></pluginManagement></build> </project>

三、編程代碼?

public class DistinctApp implements SparkConfInfo{public static void main(String[]args){String filePath = "F:\\test\\log.txt";SparkSession sparkSession = new DistinctApp().getSparkConf("distinct");List<String> words = sparkSession.sparkContext().textFile(filePath, 4).toJavaRDD().flatMap(v -> Arrays.asList(v.split("[(\\s+)(\r?\n),.。'’]")).iterator()).filter(v -> v.matches("[a-zA-Z-]+")).map(String::toLowerCase).distinct().collect();words.forEach((v) -> System.out.println(v));sparkSession.stop();} }public interface SparkConfInfo {default SparkSession getSparkConf(String appName){SparkConf sparkConf = new SparkConf();if(System.getProperty("os.name").toLowerCase().contains("win")) {sparkConf.setMaster("local[4]");System.out.println("使用本地模擬是spark");}else{sparkConf.setMaster("spark://hadoop01:7077,hadoop02:7077,hadoop03:7077");sparkConf.set("spark.driver.host","192.168.150.1");//本地ip，必須與spark集群能夠相互訪問，如：同一個(gè)局域網(wǎng)sparkConf.setJars(new String[] {".\\out\\artifacts\\spark_test\\spark-test.jar"});//項(xiàng)目構(gòu)建生成的路徑}SparkSession session = SparkSession.builder().appName(appName).config(sparkConf).config(sparkConf).getOrCreate();return session;} }

文件內(nèi)容

Spark Streaming is an extension of the core Spark API that enables scalable,high-throughput, fault-tolerant stream processing of live 。data streams. Data， can be ，ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems,Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages. databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

輸出

discretized stream created are dstream fault-tolerant guide sockets can writing between how python live complex operations sources window using map databases fact kinesis filesystems other enables algorithms applying throughout find spark different either a high-level all pushed tcp high-throughput in code scalable which introduced sequence an called rdds the scala this learning provides internally is processed apply tabs start snippets with will data abstraction as graph shows java streaming out input api from join presented reduce kafka s you represented extension that let or on many be such flume dstreams processing expressed to core finally languages of by like programs functions represents and choose continuous machine dashboards write streams

總結(jié)

以上是生活随笔為你收集整理的Spark入门（十）之Distinct去重的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。