spark数据类型
RDD
創建RDD
RDD操作
- union
- intersection
- distinct
- groupByKey
- reduceByKey
- sortByKey
- join leftOuterJoin rightOuterJoin
- aggregate
- reduce
- count
- first
- take
- takeSample
- takeOrdered
- saveAsTextFile
- countByKey
- foreach
DataFrame
DataSet
DataFrame to RDD
| 區別一 | 不支持sparksql | 支持 | 支持 |
| 區別二 | DataSet[Row] |
相互轉化
| RDD | - | val rdd = sc.textFile("") case class Person(name: String, age: String) val a = rdd.map(_.split(",")).map{ line => Person(line(0), line(1))}.toDF | rdd = sc.textFile("") case class Person(name: String, age: String) val a = rdd.map(_.split(",")).map{ line => Person(line(0), line(1))}.toDS |
| DataFrame | val rdd1 = testDF.rdd | - | val testDS = testDF.as[Coltest] |
| DataSet | val rdd2 = testDS.rdd | val testDF = testDS.toDF | - |
數據類型
- LabeledPoint to Libsvm
讀取文件類型
json parquet jdbc orc libsvm csv text
val peopleDF = spark.read.format("json").load("examples/src/main/resources/people.json")Reference
https://blog.csdn.net/gongpulin/article/details/77622107
總結
- 上一篇: spark-submit
- 下一篇: hbase java api