外部jar包_大数据系列之PySpark读写外部数据库
本文以MySQL和HBASE為例,簡要介紹Spark通過PyMySQL和HadoopAPI算子對外部數(shù)據(jù)庫的讀寫操作
1、PySpark讀寫MySQL
MySQL環(huán)境準(zhǔn)備參考“數(shù)據(jù)庫系列之MySQL主從復(fù)制集群部署”部分
1.1 PyMySQL和MySQLDB模塊
PyMySQL是在Python3.x版本中用于連接MySQL服務(wù)器的一個(gè)庫,Python2中則使用mysqldb,目前在Python 2版本支持PyMySQL。使用以下命令安裝PyMysql模塊:
pip install PyMySQL連接到MySQL數(shù)據(jù)庫
import pymysql# 打開數(shù)據(jù)庫連接
db = pymysql.connect("localhost","testuser","test123","TESTDB" )
# 使用 cursor() 方法創(chuàng)建一個(gè)游標(biāo)對象
cursor cursor = db.cursor()
# 使用 execute() 方法執(zhí)行 SQL 查詢
cursor.execute("SELECT VERSION()")
# 使用 fetchone() 方法獲取單條數(shù)據(jù).
data = cursor.fetchone() print ("Database version : %s " % data)
# 關(guān)閉數(shù)據(jù)庫連接
db.close()
1.2 Spark數(shù)據(jù)寫入MySQL
1)啟動MySQL服務(wù)并檢查
[root@tango-01 bin]# ./mysqld_safe &[root@tango-01 bin]# 180814 15:50:02 mysqld_safe Logging to '/usr/local/mysql/data/error.log'.
180814 15:50:02 mysqld_safe Starting mysqld daemon with databases from /usr/local/mysql/data
[root@tango-01 bin]# ps -ef|grep mysql
2)創(chuàng)建MySQL表
[root@tango-01 bin]# ./mysql -u root -prootmysql> use test;
mysql> create table test_spark(id int(4),info char(8),name char(20),sex char(2));
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| test_spark |
+----------------+
2 rows in set (0.00 sec)
3)向MySQL中寫入數(shù)據(jù)
啟動ipython notebook
建立MySQL連接,寫入數(shù)據(jù)
from pyspark import SparkConf
import pymysql
rawData=['1,info1,tango,F','2,info2,zhangsan,M']
conn = pymysql.connect(user="root",passwd="xxxxxx",host="192.168.112.10",db="test",charset="utf8")
cursor=conn.cursor()
for i in range(len(rawData)):
retData=rawData[i].split(',')
id = retData[0]
info = retData[1]
name = retData[2]
sex = retData[3]
sql = "insert into test_spark(id,info,name,sex) values('%s','%s','%s','%s')" %(id,info,name,sex)
cursor.execute(sql)
conn.commit()
conn.close()
查詢MySQL表數(shù)據(jù)
1.3 Spark讀取MySQL數(shù)據(jù)
1)下載mysql-connect-java驅(qū)動,并存放在spark目錄的jars下
2)運(yùn)行pyspark,執(zhí)行以下語句
[root@tango-spark01 jars]# pyspark>>> from pyspark.sql import SQLContext
>>> sqlContext = SQLContext(sc)
>>> dataframe_mysql = sqlContext.read.format("jdbc").\
... options(url="jdbc:mysql://192.168.112.10:3306/test", driver="com.mysql.jdbc.Driver",
... dbtable="test_spark", user="root", password="xxxxxx").load()
>>> dataframe_mysql.show()
2、PySpark讀寫HBASE
HBASE環(huán)境準(zhǔn)備參考“大數(shù)據(jù)系列之HBASE集群環(huán)境部署”部分,HBASE版本為1.2.6,Hadoop版本為2.9.0,Spark版本為2.3.0。注:使用高版本的HBASE如2.1.0出現(xiàn)NotFoundMethod接口問題。
2.1 Spark讀寫HBASE模塊
1)saveAsNewAPIHadoopDataset模塊
Spark算子saveAsNewAPIHadoopDataset使用新的Hadoop API將RDD輸出到任何Hadoop支持的存儲系統(tǒng),為該存儲系統(tǒng)使用Hadoop Configuration對象。saveAsNewAPIHadoopDataset參數(shù)說明如下:
saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)- conf:HBASE的配置文件
- keyConverter:key值的輸出類型
- valueConverter:value值的輸出類型
2)newAPIHadoopRDD模塊
使用新的Hadoop API讀取數(shù)據(jù),參數(shù)如下:
newAPIHadoopRDD(inputFormatClass, keyClass, valueClass, keyConverter=None, valueConverter=None, conf=None, batchSize=0)- inputFormatClass :Hadoop InputFormat class名稱- keyClass:key Writable class名稱- valueClass:value Writable class名稱- keyConverter:key值的輸入類型- valueConverter:value值的輸入類型- conf:HBASE的配置文件- batchSize:Python對象作為單個(gè)Java對象個(gè)數(shù),默認(rèn)為0,自動選擇
2.2 Spark數(shù)據(jù)寫入HBASE
1)啟動HBASE服務(wù)
[root@tango-spark01 hbase-2.1.0]# ./bin/start-hbase.sh在Master和Slave服務(wù)器使用jps查看HMaster和HRegionServer進(jìn)程:
[root@tango-spark01 logs]# jps1859 ResourceManager
1493 NameNode
4249 HMaster
5578 Jps
1695 SecondaryNameNode
[root@tango-spark02 conf]# jps
1767 NodeManager
3880 HRegionServer
1627 DataNode
4814 Jps
注:啟動HBASE之前需先啟動zookeeper集群和Hadoop集群環(huán)境
2)創(chuàng)建HBASE表
hbase(main):027:0> create 'spark_hbase','userinfo'Created table spark_hbase
Took 2.6556 seconds
=> Hbase::Table - spark_hbase
hbase(main):028:0> put 'spark_hbase','2018001','userinfo:name','zhangsan'
Took 0.0426 seconds
hbase(main):029:0> put 'spark_hbase','2018001','userinfo:age','16'
Took 0.0079 seconds
hbase(main):030:0> put 'spark_hbase','2018001','userinfo:sex','M'
3)配置Spark 在Spark 2.0版本上缺少相關(guān)把hbase的數(shù)據(jù)轉(zhuǎn)換python可讀取的jar包,需要另行下載https://mvnrepository.com/artifact/org.apache.spark/spark-examples_2.11/1.6.0-typesafe-001
上傳jar包到spark lib庫
/usr/local/spark/spark-2.3.0/jars
[root@tango-spark01 jars]# mkdir hbase
[root@tango-spark01 jars]# cd hbase
[root@tango-spark01 hbase]# ls
spark-examples_2.11-1.6.0-typesafe-001.jar
編輯spark-env.sh,添加以下內(nèi)容:
拷貝HBASE下的lib庫到spark下
/usr/local/spark/hbase-2.1.0/lib
[root@tango-spark01 lib]# cp -f hbase-* /usr/local/spark/spark-2.3.0/jars/hbase/
[root@tango-spark01 lib]# cp -f guava-11.0.2.jar /usr/local/spark/spark-2.3.0/jars/hbase/
[root@tango-spark01 lib]# cp -f htrace-core-3.1.0-incubating.jar /usr/local/spark/spark-2.3.0/jars/hbase/
[root@tango-spark01 lib]# cp -f protobuf-java-2.5.0.jar /usr/local/spark/spark-2.3.0/jars/hbase/
重啟HBASE
[root@tango-spark01 hbase-2.1.0]# ./bin/start-hbase.sh
4)向HBASE中寫入數(shù)據(jù)
啟動ipython notebook
配置初始化
table = "spark_hbase"
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum": zk_host,"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
初始化數(shù)據(jù)并序列化轉(zhuǎn)換為RDD
print(rawData)
rddRow = sc.parallelize(rawData).map(lambda x: (x[0:7],x.split(',')))
rddRow.take(5)
調(diào)用saveAsNewAPIHadoopDataset模塊寫入HBASE
查詢HBASE中表數(shù)據(jù),看到插入數(shù)據(jù)
2.3 Spark讀取HBASE數(shù)據(jù)
Spark讀取HBASE數(shù)據(jù)使用newAPIHadoopRDD模塊
1)配置初始化
host = '192.168.112.101'table = 'spark_hbase'
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
2)調(diào)用newAPIHadoopRDD模塊讀取HBASE數(shù)據(jù)
hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf)count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
輸出結(jié)果如下:
參考資料
http://spark.apache.org/docs/latest/api/python/pyspark.html
數(shù)據(jù)庫系列之MySQL主從復(fù)制集群部署
大數(shù)據(jù)系列之HBASE集群環(huán)境部署
總結(jié)
以上是生活随笔為你收集整理的外部jar包_大数据系列之PySpark读写外部数据库的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: java带参方法上机一_Java带参方法
- 下一篇: mysql in 按顺序排序_mysql