Spark MaprLab-Auction Data分析
2019獨(dú)角獸企業(yè)重金招聘Python工程師標(biāo)準(zhǔn)>>>
一、環(huán)境安裝
1.安裝hadoop
http://my.oschina.net/u/204498/blog/519789
2.安裝spark
3.啟動(dòng)hadoop
4.啟動(dòng)spark
二、
1.數(shù)據(jù)準(zhǔn)備
從MAPR官網(wǎng)上下載數(shù)據(jù)DEV360DATA.zip并上傳到server上。
[hadoop@hftclclw0001?spark-1.5.1-bin-hadoop2.6]$?pwd /home/hadoop/spark-1.5.1-bin-hadoop2.6[hadoop@hftclclw0001?spark-1.5.1-bin-hadoop2.6]$?cd?test-data/[hadoop@hftclclw0001?test-data]$?pwd /home/hadoop/spark-1.5.1-bin-hadoop2.6/test-data/DEV360Data[hadoop@hftclclw0001?DEV360Data]$?ll total?337940 -rwxr-xr-x?1?hadoop?root????575014?Jun?24?16:18?auctiondata.csv????????=>c測(cè)試用到的數(shù)據(jù) -rw-r--r--?1?hadoop?root??57772855?Aug?18?20:11?sfpd.csv -rwxrwxrwx?1?hadoop?root?287692676?Jul?26?20:39?sfpd.json[hadoop@hftclclw0001?DEV360Data]$?more?auctiondata.csv? 8213034705,95,2.927373,jake7870,0,95,117.5,xbox,3 8213034705,115,2.943484,davidbresler2,1,95,117.5,xbox,3 8213034705,100,2.951285,gladimacowgirl,58,95,117.5,xbox,3 8213034705,117.5,2.998947,daysrus,10,95,117.5,xbox,3 8213060420,2,0.065266,donnie4814,5,1,120,xbox,3 8213060420,15.25,0.123218,myreeceyboy,52,1,120,xbox,3 ... ...#數(shù)據(jù)結(jié)構(gòu)如下 auctionid,bid,bidtime,bidder,bidrate,openbid,price,itemtype,daystolve#把數(shù)據(jù)上傳到HDFS中 [hadoop@hftclclw0001?DEV360Data]$?hdfs?dfs?-mkdir?-p?/spark/exer/mapr [hadoop@hftclclw0001?DEV360Data]$?hdfs?dfs?-put?auctiondata.csv?/spark/exer/mapr [hadoop@hftclclw0001?DEV360Data]$?hdfs?dfs?-ls?/spark/exer/mapr Found?1?items -rw-r--r--???2?hadoop?supergroup?????575014?2015-10-29?06:17?/spark/exer/mapr/auctiondata.csv2.運(yùn)行spark-shell 我用的scala.并針對(duì)以下task,進(jìn)行分析
tasks:
a.How many items were sold?
b.How many bids per item type?
c.How many different kinds of item type?
d.What was the minimum number of bids?
e.What was the maximum number of bids?
f.What was the average number of bids?
[hadoop@hftclclw0001?spark-1.5.1-bin-hadoop2.6]$?pwd /home/hadoop/spark-1.5.1-bin-hadoop2.6[hadoop@hftclclw0001?spark-1.5.1-bin-hadoop2.6]$?./bin/spark-shell? ... ... scala?>#首先從HDFS加載數(shù)據(jù)生成RDD scala?>?val?originalRDD?=?sc.textFile("/spark/exer/mapr/auctiondata.csv") ... ... scala?>?originalRDD??????==>我們來分析下originalRDD的類型?RDD[String]?可以看做是一條條String的數(shù)組,Array[String] res26:?org.apache.spark.rdd.RDD[String]?=?MapPartitionsRDD[1]?at?textFile?at?<console>:21##根據(jù)“,”把每一行分隔使用map scala?>?val?auctionRDD?=?originalRDD.map(_.split(",")) scala>?auctionRDD????????==>我們來分析下auctionRDD的類型?RDD[Array[String]]?可以看做是String的數(shù)組,但元素依然是數(shù)組即,可以認(rèn)為Array[Array[string]] res17:?org.apache.spark.rdd.RDD[Array[String]]?=?MapPartitionsRDD[5]?at?map?at?<console>:23a.How many items were sold?
?==> val count = auctionRDD.map(bid => bid(0)).distinct().count()
根據(jù)auctionid去重即可:每條記錄根據(jù)“,”分隔,再去重,再計(jì)數(shù)
#獲取第一列,即獲取auctionid,依然用map #可以這么理解下面一行,由于auctionRDD是Array[Array[String]]那么進(jìn)行map的每個(gè)參數(shù)類型是Array[String],由于actionid是數(shù)組的第一位,即獲取第一個(gè)元素Array(0),注意是()不是[] scala>?val?auctionidRDD?=?auctionRDD.map(_(0)) ... ...scala>?auctionidRDD????????==>我們來分析下auctionidRDD的類型?RDD[String]?,理解為Array[String],即所有的auctionid的數(shù)組 res27:?org.apache.spark.rdd.RDD[String]?=?MapPartitionsRDD[17]?at?map?at?<console>:26#對(duì)auctionidRDD去重 scala?>?val?auctionidDistinctRDD=auctionidRDD.distinct()#計(jì)數(shù) scala?>?auctionidDistinctRDD.count() ... ...b.How many bids per item type?
===> auctionRDD.map(bid => (bid(7),1)).reduceByKey((x,y) => x + y).collect()
#map每一行,獲取出第7列,即itemtype那一列,輸出(itemtype,1) #可以看做輸出的類型是(String,Int)的數(shù)組 scala?>?auctionRDD.map(bid=>(bid(7),1)) res30:?org.apache.spark.rdd.RDD[(String,?Int)]?=?MapPartitionsRDD[26]?at?map?at?<console>:26 ...#reduceByKey即按照key進(jìn)行reduce #解析下reduceByKey對(duì)于相同的key,? #(xbox,1)(xbox,1)(xbox,1)(xbox,1)...(xbox,1)?==>?reduceByKey?==>?(xbox,(..(((1?+?1)?+?1)?+?...?+?1)) scala?>?auctionRDD.map(bid=>(bid(7),1)).reduceByKey((x,y)?=>?x?+?y) #類型依然是(String,Int)的數(shù)組?String=>itemtype?Int已經(jīng)是該itemtype的計(jì)數(shù)總和了 res31:?org.apache.spark.rdd.RDD[(String,?Int)]?=?ShuffledRDD[28]?at?reduceByKey?at?<console>:26#通過collect()?轉(zhuǎn)換成?Array類型數(shù)組 scala?>?auctionRDD.map(bid=>(bid(7),1)).reduceByKey((x,y)?=>?x?+?y).collect()res32:?Array[(String,?Int)]?=?Array((palm,5917),?(cartier,1953),?(xbox,2784))轉(zhuǎn)載于:https://my.oschina.net/u/204498/blog/523576
總結(jié)
以上是生活随笔為你收集整理的Spark MaprLab-Auction Data分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: [Linux][Ubuntu]Linux
- 下一篇: 【数据结构与算法】字符串匹配 AC自动机