Spark SQL 快速入门系列(五)SparkSQL 访问 Hive
文章目錄
- 訪問(wèn) Hive
- SparkSQL 整合 Hive
- 訪問(wèn) Hive 表
- idea實(shí)現(xiàn)SparkSQL連接hive
訪問(wèn) Hive
導(dǎo)讀
1,整合 SparkSQL 和 Hive, 使用 Hive 的 MetaStore 元信息庫(kù)
2,使用 SparkSQL 查詢 Hive 表
3,案例, 使用常見(jiàn) HiveSQL
4,寫入內(nèi)容到 Hive 表
SparkSQL 整合 Hive
導(dǎo)讀
1,開(kāi)啟 Hive 的 MetaStore 獨(dú)立進(jìn)程
2,整合 SparkSQL 和 Hive 的 MetaStore
和一個(gè)文件格式不同, Hive 是一個(gè)外部的數(shù)據(jù)存儲(chǔ)和查詢引擎, 所以如果 Spark 要訪問(wèn) Hive 的話, 就需要先整合 Hive
整合什么 ?
如果要討論 SparkSQL 如何和 Hive 進(jìn)行整合, 首要考慮的事應(yīng)該是 Hive 有什么, 有什么就整合什么就可以
- MetaStore, 元數(shù)據(jù)存儲(chǔ)
SparkSQL 內(nèi)置的有一個(gè) MetaStore, 通過(guò)嵌入式數(shù)據(jù)庫(kù) Derby 保存元信息, 但是對(duì)于生產(chǎn)環(huán)境來(lái)說(shuō), 還是應(yīng)該使用 Hive 的 MetaStore, 一是更成熟, 功能更強(qiáng), 二是可以使用 Hive 的元信息 - 查詢引擎
SparkSQL 內(nèi)置了 HiveSQL 的支持, 所以無(wú)需整合
為什么要開(kāi)啟 Hive 的 MetaStore
Hive 的 MetaStore 是一個(gè) Hive 的組件, 一個(gè) Hive 提供的程序, 用以保存和訪問(wèn)表的元數(shù)據(jù), 整個(gè) Hive 的結(jié)構(gòu)大致如下
由上圖可知道, 其實(shí) Hive 中主要的組件就三個(gè), HiveServer2 負(fù)責(zé)接受外部系統(tǒng)的查詢請(qǐng)求, 例如 JDBC, HiveServer2 接收到查詢請(qǐng)求后, 交給 Driver 處理, Driver 會(huì)首先去詢問(wèn) MetaStore 表在哪存, 后 Driver 程序通過(guò) MR 程序來(lái)訪問(wèn) HDFS 從而獲取結(jié)果返回給查詢請(qǐng)求者
而 Hive 的 MetaStore 對(duì) SparkSQL 的意義非常重大, 如果 SparkSQL 可以直接訪問(wèn) Hive 的 MetaStore, 則理論上可以做到和 Hive 一樣的事情, 例如通過(guò) Hive 表查詢數(shù)據(jù)
而 Hive 的 MetaStore 的運(yùn)行模式有三種
- 內(nèi)嵌 Derby 數(shù)據(jù)庫(kù)模式
這種模式不必說(shuō)了, 自然是在測(cè)試的時(shí)候使用, 生產(chǎn)環(huán)境不太可能使用嵌入式數(shù)據(jù)庫(kù), 一是不穩(wěn)定, 二是這個(gè) Derby 是單連接的, 不支持并發(fā)
- Local 模式
Local 和 Remote 都是訪問(wèn) MySQL 數(shù)據(jù)庫(kù)作為存儲(chǔ)元數(shù)據(jù)的地方, 但是 Local 模式的 MetaStore 沒(méi)有獨(dú)立進(jìn)程, 依附于 HiveServer2 的進(jìn)程
- Remote 模式
和 Loca 模式一樣, 訪問(wèn) MySQL 數(shù)據(jù)庫(kù)存放元數(shù)據(jù), 但是 Remote 的 MetaStore 運(yùn)行在獨(dú)立的進(jìn)程中
我們顯然要選擇 Remote 模式, 因?yàn)橐屍洫?dú)立運(yùn)行, 這樣才能讓 SparkSQL 一直可以訪問(wèn)
Hive 開(kāi)啟 MetaStore
Step 1: 修改 hive-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration><property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://Bigdata01:3306/metastore?createDatabaseIfNotExist=true</value><description>JDBC connect string for a JDBC metastore</description></property><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.jdbc.Driver</value><description>Driver class name for a JDBC metastore</description></property><property><name>javax.jdo.option.ConnectionUserName</name><value>root</value><description>username to use against metastore database</description></property><property><name>javax.jdo.option.ConnectionPassword</name><value>000000</value><description>password to use against metastore database</description></property><property><name>hive.cli.print.header</name><value>true</value></property><property><name>hive.cli.print.current.db</name><value>true</value></property><property><name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value></property><property><name>hive.metastore.local</name><value>false</value></property><property><name>hive.metastore.uris</name><value>thrift://Bigdata01:9083</value> </property></configuration>Step 2: 啟動(dòng) Hive MetaStore
nohup /opt/module/hive/bin/hive --service metastore 2>&1 >> /opt/module/hive/logs/log.log &SparkSQL 整合 Hive 的 MetaStore
即使不去整合 MetaStore, Spark 也有一個(gè)內(nèi)置的 MateStore, 使用 Derby 嵌入式數(shù)據(jù)庫(kù)保存數(shù)據(jù), 但是這種方式不適合生產(chǎn)環(huán)境, 因?yàn)檫@種模式同一時(shí)間只能有一個(gè) SparkSession 使用, 所以生產(chǎn)環(huán)境更推薦使用 Hive 的 MetaStore
SparkSQL 整合 Hive 的 MetaStore 主要思路就是要通過(guò)配置能夠訪問(wèn)它, 并且能夠使用 HDFS 保存 WareHouse, 這些配置信息一般存在于 Hadoop 和 HDFS 的配置文件中, 所以可以直接拷貝 Hadoop 和 Hive 的配置文件到 Spark 的配置目錄
cd /opt/module/hadoop/etc/hadoop cp hive-site.xml core-site.xml hdfs-site.xml /opt/module/spark/conf/ scp -r /opt/module/spark/conf Bigdata02:`pwd` scp -r /opt/module/spark/conf Bigdata03:`pwd`Spark 需要 hive-site.xml 的原因是, 要讀取 Hive 的配置信息, 主要是元數(shù)據(jù)倉(cāng)庫(kù)的位置等信息
Spark 需要 core-site.xml 的原因是, 要讀取安全有關(guān)的配置
Spark 需要 hdfs-site.xml 的原因是, 有可能需要在 HDFS 中放置表文件, 所以需要 HDFS 的配置
如果不希望通過(guò)拷貝文件的方式整合 Hive, 也可以在 SparkSession 啟動(dòng)的時(shí)候, 通過(guò)指定 Hive 的 MetaStore 的位置來(lái)訪問(wèn), 但是更推薦整合的方式
訪問(wèn) Hive 表
導(dǎo)讀
1,在 Hive 中創(chuàng)建表
2,使用 SparkSQL 訪問(wèn) Hive 中已經(jīng)存在的表
3,使用 SparkSQL 創(chuàng)建 Hive 表
4,使用 SparkSQL 修改 Hive 表中的數(shù)據(jù)
創(chuàng)建文件名稱 :studenttabl10k
添加數(shù)據(jù)如下:(只添加150行)
在 Hive 中創(chuàng)建表
第一步, 需要先將文件上傳到集群中, 使用如下命令上傳到 HDFS 中
第二步, 使用 Hive 或者 Beeline 執(zhí)行如下 SQL
CREATE DATABASE IF NOT EXISTS spark_integrition;USE spark_integrition;CREATE EXTERNAL TABLE student (name STRING,age INT,gpa string ) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/hive/warehouse';LOAD DATA INPATH '/input/studenttab10k' OVERWRITE INTO TABLE student;通過(guò) SparkSQL 查詢 Hive 的表
查詢 Hive 中的表可以直接通過(guò) spark.sql(…?) 來(lái)進(jìn)行, 可以直接在其中訪問(wèn) Hive 的 MetaStore, 前提是一定要將 Hive 的配置文件拷貝到 Spark 的 conf 目錄
[root@Bigdata01 bin]# ./spark-shell --master local[6] 20/09/03 20:55:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://Bigdata01:4040 Spark context available as 'sc' (master = local[6], app id = local-1599137751998). Spark session available as 'spark'. Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 2.4.6/_/Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144) Type in expressions to have them evaluated. Type :help for more information.scala> spark.sql("use spark_integrition") 20/09/03 20:56:45 WARN HiveConf: HiveConf of name hive.metastore.local does not exist res0: org.apache.spark.sql.DataFrame = []scala> spark.sql("select * from student limit 100") res1: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]scala> res1.show() +-------------------+---+----+ | name|age| gpa| +-------------------+---+----+ | ulysses thompson| 64|1.90| | katie carson| 25|3.65| | luke king| 65|0.73| | holly davidson| 57|2.43| | fred miller| 55|3.77| | holly white| 43|0.24| | luke steinbeck| 51|1.14| | nick underhill| 31|2.46| | holly davidson| 59|1.26| | calvin brown| 56|0.72| | rachel robinson| 62|2.25| | tom carson| 35|0.56| | tom johnson| 72|0.99| | irene garcia| 54|1.06| | oscar nixon| 39|3.60| | holly allen| 32|2.58| | oscar hernandez| 19|0.05| | alice ichabod| 65|2.25| | wendy thompson| 30|2.39| |priscilla hernandez| 73|0.23| +-------------------+---+----+ only showing top 20 rows通過(guò) SparkSQL 創(chuàng)建 Hive 表
通過(guò) SparkSQL 可以直接創(chuàng)建 Hive 表, 并且使用 LOAD DATA 加載數(shù)據(jù)
[root@Bigdata01 bin]# ./spark-shell --master local[6] 20/09/03 21:17:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://Bigdata01:4040 Spark context available as 'sc' (master = local[6], app id = local-1599139087222). Spark session available as 'spark'. Welcome to____ __/ __/__ ___ _____/ /___\ \/ _ \/ _ `/ __/ '_//___/ .__/\_,_/_/ /_/\_\ version 2.4.6/_/Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144) Type in expressions to have them evaluated. Type :help for more information.scala> :paste // Entering paste mode (ctrl-D to finish)val createTableStr ="""|create EXTERNAL TABLE student|(| name STRING,| age INT,| gpa string|)|ROW FORMAT DELIMITED| FIELDS TERMINATED BY '\t'| LINES TERMINATED BY '\n'|STORED AS TEXTFILE|LOCATION '/user/hive/warehouse'""".stripMarginspark.sql("CREATE DATABASE IF NOT EXISTS spark_integrition1") spark.sql("USE spark_integrition1") spark.sql(createTableStr) spark.sql("LOAD DATA INPATH '/input/studenttab10k' OVERWRITE INTO TABLE student")// Exiting paste mode, now interpreting.20/09/03 21:20:57 WARN HiveConf: HiveConf of name hive.metastore.local does not exist 20/09/03 21:21:01 ERROR KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! createTableStr: String = " create EXTERNAL TABLE student (name STRING,age INT,gpa string ) ROW FORMAT DELIMITEDFIELDS TERMINATED BY '\t'LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/hive/warehouse'" res0: org.apache.spark.sql.DataFrame = []scala> spark.sql("select * from student limit 100") res1: org.apache.spark.sql.DataFrame = [name: string, age: int ... 1 more field]scala> res1.where('age > 50).show() +-------------------+---+----+ | name|age| gpa| +-------------------+---+----+ | ulysses thompson| 64|1.90| | luke king| 65|0.73| | holly davidson| 57|2.43| | fred miller| 55|3.77| | luke steinbeck| 51|1.14| | holly davidson| 59|1.26| | calvin brown| 56|0.72| | rachel robinson| 62|2.25| | tom johnson| 72|0.99| | irene garcia| 54|1.06| | alice ichabod| 65|2.25| |priscilla hernandez| 73|0.23| |gabriella van buren| 68|1.32| | yuri laertes| 60|1.16| | nick van buren| 68|1.75| | bob ichabod| 56|2.81| | zach steinbeck| 61|2.22| | fred polk| 66|3.69| | alice young| 75|0.31| | mike white| 57|0.69| +-------------------+---+----+ only showing top 20 rows目前 SparkSQL 支持的文件格式有 sequencefile, rcfile, orc, parquet, textfile, avro, 并且也可以指定 serde 的名稱
idea實(shí)現(xiàn)SparkSQL連接hive
使用 SparkSQL 處理數(shù)據(jù)并保存進(jìn) Hive 表
前面都在使用 SparkShell 的方式來(lái)訪問(wèn) Hive, 編寫 SQL, 通過(guò) Spark 獨(dú)立應(yīng)用的形式也可以做到同樣的事, 但是需要一些前置的步驟, 如下
Step 1: 導(dǎo)入 Maven 依賴
<dependency><groupId>org.apache.spark</groupId><artifactId>spark-hive_2.11</artifactId><version>${spark.version}</version> </dependency>Step 2: 配置 SparkSession
如果希望使用 SparkSQL 訪問(wèn) Hive 的話, 需要做兩件事
1,開(kāi)啟 SparkSession 的 Hive 支持
經(jīng)過(guò)這一步配置, SparkSQL 才會(huì)把 SQL 語(yǔ)句當(dāng)作 HiveSQL 來(lái)進(jìn)行解析
2,設(shè)置 WareHouse 的位置
雖然 hive-stie.xml 中已經(jīng)配置了 WareHouse 的位置, 但是在 Spark 2.0.0 后已經(jīng)廢棄了 hive-site.xml 中設(shè)置的 hive.metastore.warehouse.dir, 需要在 SparkSession 中設(shè)置 WareHouse 的位置
設(shè)置 MetaStore 的位置
val spark = SparkSession.builder().appName("hive example").config("spark.sql.warehouse.dir", "/user/hive/warehouse") //1.config("hive.metastore.uris", "thrift://Bigdata01:9083") //2.enableHiveSupport() //3 .getOrCreate()1,設(shè)置 WareHouse 的位置
2,設(shè)置 MetaStore 的位置
3,開(kāi)啟 Hive 支持
配置好了以后, 就可以通過(guò) DataFrame 處理數(shù)據(jù), 后將數(shù)據(jù)結(jié)果推入 Hive 表中了, 在將結(jié)果保存到 Hive 表的時(shí)候, 可以指定保存模式
全套代碼如下:
package com.spark.hiveimport org.apache.spark.sql.{SaveMode, SparkSession} import org.apache.spark.sql.types.{FloatType, IntegerType, StringType, StructField, StructType}object HiveAccess {def main(args: Array[String]): Unit = {//1.創(chuàng)建SparkSession// 1.開(kāi)啟hive支持// 2.指定Metastore 的位置// 3.指定Warehouse 的位置val spark = SparkSession.builder().appName(this.getClass.getSimpleName).enableHiveSupport()//開(kāi)啟hive支持.config("hive.metatore.uris","thrift://Bigdata01:9083").config("spark.sql.warehouse.dir","/user/hive/warehouse").getOrCreate()//隱式轉(zhuǎn)換import spark.implicits._//2.讀取數(shù)據(jù)/*** 1.上傳HDFS, 因?yàn)橐诩褐袌?zhí)行,所以沒(méi)辦法保證程序在哪個(gè)機(jī)器上執(zhí)行* 所以,要把文件上傳到所有機(jī)器中,才能讀取本地文件,* 上傳到HDFS中就可以解決這個(gè)問(wèn)題,所有的機(jī)器都可以讀取HDFS中的文件* 它是一個(gè)外部系統(tǒng)* 2.使用DF讀取文件*/val schema = StructType(List(StructField("name",StringType),StructField("age",IntegerType),StructField("gpa",FloatType)))val dataframe = spark.read//分隔符.option("delimiter","\t")//添加字段 (源碼).schema(schema).csv("hdfs:///input/studenttab10k")val resultDF = dataframe.where('age > 50)//3.寫入數(shù)據(jù)resultDF.write.mode(SaveMode.Overwrite).saveAsTable("spark_integrition1.student")} }通過(guò) mode 指定保存模式, 通過(guò) saveAsTable 保存數(shù)據(jù)到 Hive
打包jar
放入spark 目錄下 將 jar 重命名為spark-sql.jar
提交集群運(yùn)行 (出現(xiàn)如下結(jié)果,則運(yùn)行成功)
[root@Bigdata01 spark]# bin/spark-submit --master spark://Bigdata01:7077 \ > --class com.spark.hive.HiveAccess \ > ./spark-sql.jar 20/09/03 22:28:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 20/09/03 22:28:55 INFO SparkContext: Running Spark version 2.4.6 20/09/03 22:28:55 INFO SparkContext: Submitted application: HiveAccess$ 20/09/03 22:28:55 INFO SecurityManager: Changing view acls to: root 20/09/03 22:28:55 INFO SecurityManager: Changing modify acls to: root 20/09/03 22:28:55 INFO SecurityManager: Changing view acls groups to: 20/09/03 22:28:55 INFO SecurityManager: Changing modify acls groups to: 20/09/03 22:28:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 20/09/03 22:28:57 INFO Utils: Successfully started service 'sparkDriver' on port 40023. 20/09/03 22:28:57 INFO SparkEnv: Registering MapOutputTracker 20/09/03 22:28:57 INFO SparkEnv: Registering BlockManagerMaster 20/09/03 22:28:57 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 20/09/03 22:28:57 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 20/09/03 22:28:57 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-02e6d729-f8d9-4a26-a95d-3a019331e164 20/09/03 22:28:57 INFO MemoryStore: MemoryStore started with capacity 366.3 MB 20/09/03 22:28:57 INFO SparkEnv: Registering OutputCommitCoordinator 20/09/03 22:28:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 20/09/03 22:28:58 INFO Utils: Successfully started service 'SparkUI' on port 4041. 20/09/03 22:28:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://Bigdata01:4041 20/09/03 22:28:59 INFO SparkContext: Added JAR file:/opt/module/spark/./spark-sql.jar at spark://Bigdata01:40023/jars/spark-sql.jar with timestamp 1599143339071 20/09/03 22:28:59 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://Bigdata01:7077... 20/09/03 22:29:00 INFO TransportClientFactory: Successfully created connection to Bigdata01/192.168.168.31:7077 after 331 ms (0 ms spent in bootstraps) 20/09/03 22:29:00 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200903222900-0001 20/09/03 22:29:00 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200903222900-0001/0 on worker-20200903203039-192.168.168.31-54515 (192.168.168.31:54515) with 8 core(s) 20/09/03 22:29:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200903222900-0001/0 on hostPort 192.168.168.31:54515 with 8 core(s), 1024.0 MB RAM 20/09/03 22:29:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200903222900-0001/1 on worker-20200903203048-192.168.168.32-39304 (192.168.168.32:39304) with 6 core(s) 20/09/03 22:29:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200903222900-0001/1 on hostPort 192.168.168.32:39304 with 6 core(s), 1024.0 MB RAM 20/09/03 22:29:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200903222900-0001/2 on worker-20200903203050-192.168.168.33-35682 (192.168.168.33:35682) with 6 core(s) 20/09/03 22:29:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200903222900-0001/2 on hostPort 192.168.168.33:35682 with 6 core(s), 1024.0 MB RAM 20/09/03 22:29:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 58667. 20/09/03 22:29:01 INFO NettyBlockTransferService: Server created on Bigdata01:58667 20/09/03 22:29:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/09/03 22:29:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200903222900-0001/2 is now RUNNING 20/09/03 22:29:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200903222900-0001/0 is now RUNNING 20/09/03 22:29:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, Bigdata01, 58667, None) 20/09/03 22:29:01 INFO BlockManagerMasterEndpoint: Registering block manager Bigdata01:58667 with 366.3 MB RAM, BlockManagerId(driver, Bigdata01, 58667, None) 20/09/03 22:29:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, Bigdata01, 58667, None) 20/09/03 22:29:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, Bigdata01, 58667, None) 20/09/03 22:29:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200903222900-0001/1 is now RUNNING 20/09/03 22:29:12 INFO EventLoggingListener: Logging events to hdfs://Bigdata01:9000/spark_log/app-20200903222900-0001.lz4 20/09/03 22:29:13 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 20/09/03 22:29:14 INFO SharedState: loading hive config file: file:/opt/module/spark/conf/hive-site.xml 20/09/03 22:29:15 INFO SharedState: Setting hive.metastore.warehouse.dir ('/user/hive/warehouse') to the value of spark.sql.warehouse.dir ('/user/hive/warehouse'). 20/09/03 22:29:15 INFO SharedState: Warehouse path is '/user/hive/warehouse'. 20/09/03 22:29:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.168.31:39316) with ID 0 20/09/03 22:29:19 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint 20/09/03 22:29:24 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.168.31:44502 with 366.3 MB RAM, BlockManagerId(0, 192.168.168.31, 44502, None) 20/09/03 22:29:25 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.168.33:60974) with ID 2 20/09/03 22:29:25 INFO InMemoryFileIndex: It took 857 ms to list leaf files for 1 paths. 20/09/03 22:29:26 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.168.33:55821 with 366.3 MB RAM, BlockManagerId(2, 192.168.168.33, 55821, None) 20/09/03 22:29:32 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (192.168.168.32:35910) with ID 1 20/09/03 22:29:36 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. 20/09/03 22:29:38 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.168.32:50317 with 366.3 MB RAM, BlockManagerId(1, 192.168.168.32, 50317, None) 20/09/03 22:29:39 WARN HiveConf: HiveConf of name hive.metastore.local does not exist 20/09/03 22:29:40 INFO metastore: Trying to connect to metastore with URI thrift://Bigdata01:9083 20/09/03 22:29:40 INFO metastore: Connected to metastore. 20/09/03 22:29:43 INFO SessionState: Created local directory: /tmp/c21738d9-28fe-4780-a950-10d38e9e32ca_resources 20/09/03 22:29:43 INFO SessionState: Created HDFS directory: /tmp/hive/root/c21738d9-28fe-4780-a950-10d38e9e32ca 20/09/03 22:29:43 INFO SessionState: Created local directory: /tmp/root/c21738d9-28fe-4780-a950-10d38e9e32ca 20/09/03 22:29:43 INFO SessionState: Created HDFS directory: /tmp/hive/root/c21738d9-28fe-4780-a950-10d38e9e32ca/_tmp_space.db 20/09/03 22:29:43 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is /user/hive/warehouse 20/09/03 22:29:47 INFO FileSourceStrategy: Pruning directories with: 20/09/03 22:29:47 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(age#1),(age#1 > 50) 20/09/03 22:29:47 INFO FileSourceStrategy: Output Data Schema: struct<name: string, age: int, gpa: float ... 1 more fields> 20/09/03 22:29:47 INFO FileSourceScanExec: Pushed Filters: IsNotNull(age),GreaterThan(age,50) 20/09/03 22:29:48 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 20/09/03 22:29:48 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 20/09/03 22:29:48 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 20/09/03 22:29:48 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 20/09/03 22:29:48 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 20/09/03 22:29:50 INFO CodeGenerator: Code generated in 1046.0442 ms 20/09/03 22:29:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 281.9 KB, free 366.0 MB) 20/09/03 22:29:51 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 366.0 MB) 20/09/03 22:29:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on Bigdata01:58667 (size: 24.2 KB, free: 366.3 MB) 20/09/03 22:29:51 INFO SparkContext: Created broadcast 0 from saveAsTable at HiveAccess.scala:54 20/09/03 22:29:53 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes. 20/09/03 22:29:54 INFO SparkContext: Starting job: saveAsTable at HiveAccess.scala:54 20/09/03 22:29:54 INFO DAGScheduler: Got job 0 (saveAsTable at HiveAccess.scala:54) with 1 output partitions 20/09/03 22:29:54 INFO DAGScheduler: Final stage: ResultStage 0 (saveAsTable at HiveAccess.scala:54) 20/09/03 22:29:54 INFO DAGScheduler: Parents of final stage: List() 20/09/03 22:29:54 INFO DAGScheduler: Missing parents: List() 20/09/03 22:29:54 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at saveAsTable at HiveAccess.scala:54), which has no missing parents 20/09/03 22:29:55 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 153.1 KB, free 365.9 MB) 20/09/03 22:29:55 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 55.6 KB, free 365.8 MB) 20/09/03 22:29:55 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on Bigdata01:58667 (size: 55.6 KB, free: 366.2 MB) 20/09/03 22:29:55 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1163 20/09/03 22:29:55 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at saveAsTable at HiveAccess.scala:54) (first 15 tasks are for partitions Vector(0)) 20/09/03 22:29:55 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 20/09/03 22:29:55 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 192.168.168.33, executor 2, partition 0, ANY, 8261 bytes) 20/09/03 22:29:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.168.33:55821 (size: 55.6 KB, free: 366.2 MB) 20/09/03 22:30:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.168.33:55821 (size: 24.2 KB, free: 366.2 MB) 20/09/03 22:30:28 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 32924 ms on 192.168.168.33 (executor 2) (1/1) 20/09/03 22:30:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 20/09/03 22:30:28 INFO DAGScheduler: ResultStage 0 (saveAsTable at HiveAccess.scala:54) finished in 34.088 s 20/09/03 22:30:28 INFO DAGScheduler: Job 0 finished: saveAsTable at HiveAccess.scala:54, took 34.592171 s 20/09/03 22:30:29 INFO FileFormatWriter: Write Job 3b048e0c-6b5e-43ea-aad2-b1e64f4d9657 committed. 20/09/03 22:30:29 INFO FileFormatWriter: Finished processing stats for write job 3b048e0c-6b5e-43ea-aad2-b1e64f4d9657. 20/09/03 22:30:30 INFO InMemoryFileIndex: It took 26 ms to list leaf files for 1 paths. 20/09/03 22:30:30 INFO HiveExternalCatalog: Persisting file based data source table `spark_integrition1`.`student` into Hive metastore in Hive compatible format. 20/09/03 22:30:32 INFO SparkContext: Invoking stop() from shutdown hook 20/09/03 22:30:32 INFO SparkUI: Stopped Spark web UI at http://Bigdata01:4041 20/09/03 22:30:32 INFO StandaloneSchedulerBackend: Shutting down all executors 20/09/03 22:30:32 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down 20/09/03 22:30:32 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 20/09/03 22:30:32 INFO MemoryStore: MemoryStore cleared 20/09/03 22:30:32 INFO BlockManager: BlockManager stopped 20/09/03 22:30:32 INFO BlockManagerMaster: BlockManagerMaster stopped 20/09/03 22:30:32 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 20/09/03 22:30:32 INFO SparkContext: Successfully stopped SparkContext 20/09/03 22:30:32 INFO ShutdownHookManager: Shutdown hook called 20/09/03 22:30:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-5d113d24-2e67-4d1c-a6aa-e75de128da16 20/09/03 22:30:32 INFO ShutdownHookManager: Deleting directory /tmp/spark-f4a4aed1-1746-4e87-9f62-bdaaf6eff438進(jìn)入hive 目錄查詢
hive (spark_integrition1)> select * from student limit 10; OK student.name student.age student.gpa ulysses thompson 64 1.9 luke king 65 0.73 holly davidson 57 2.43 fred miller 55 3.77 luke steinbeck 51 1.14 holly davidson 59 1.26 calvin brown 56 0.72 rachel robinson 62 2.25 tom johnson 72 0.99 irene garcia 54 1.06 Time taken: 0.245 seconds, Fetched: 10 row(s)end
本次分享就到這里了
總結(jié)
以上是生活随笔為你收集整理的Spark SQL 快速入门系列(五)SparkSQL 访问 Hive的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 跟班学习JavaScript第一天——运
- 下一篇: SQL Server AlwaysON从