當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

大数据离线流程（小练习）

發布時間：2023/12/8 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了大数据离线流程（小练习）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原始數據： LZi2ryWsShY!lovejoy71!433!People & Blogs!111!47234!4.94!65!32!9G3rVGW4JrI!UnfbKKvUG9Q!753jCzdr_4w!QwNb2WZu8hE!0KyD0ZA2RRY!T6_91j86v5I!yJDPn0sPgus!uz50jqNcHRw!cFQUvZD8X0w!kHkdIiadj7E!Y0cHBgzhc6k!ioyQi-rb1DM!ncOP-9pZD7c!FThqh3xmcfw!CuToVngYyzc!ZkR9jFGFijo!bqAMoOufevw!_sf_0ICtCDQ!b2L8Y9AIgBE!OnEMs6jlRfo 預處理之后的數據： LZi2ryWsShY!lovejoy71!433!People&Blogs!111!47234!4.94!65!32!9G3rVGW4JrI&UnfbKKvUG9Q&753jCzdr_4w&QwNb2WZu8hE&0KyD0ZA2RRY&T6_91j86v5I&yJDPn0sPgus&uz50jqNcHRw&cFQUvZD8X0w&kHkdIiadj7E&Y0cHBgzhc6k&ioyQi-rb1DM&ncOP-9pZD7c&FThqh3xmcfw&CuToVngYyzc&ZkR9jFGFijo&bqAMoOufevw&_sf_0ICtCDQ&b2L8Y9AIgBE&OnEMs6jlRfo 對原始數據進行預處理，格式為上面給出的預處理之后的示例數據

通過觀察原始數據形式，可以發現，數據中列與列的分隔符是“!”。視頻可以有多個所屬分類，每個所屬分類用&符號分割，且分割的兩邊有空格字符，同時相關視頻也是可以有多個，多個相關視頻又用“!”進行分割。為了分析數據時方便對存在多個子元素的數據進行操作，

我們首先進行數據重組清洗操作。

即：將每條數據的“視頻類別”用“&”分割，同時去掉兩邊空格，多個“相關視頻id”也使用“&”進行分割

實現效果【截圖】：

實現代碼【截圖】

Map代碼

這里Reduce 可以省略不寫（所以沒有必要畫蛇添足）

驅動代碼

代碼：

package com.czxy.MR;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import java.io.IOException;/*** Created by 一個蔡狗 on 2020/1/7.*/ public class VideoRunner {public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, "VideoR");job.setInputFormatClass(TextInputFormat.class);TextInputFormat.addInputPath(job, new Path("E:\\input\\video\\"));job.setMapperClass(VideoMapper.class);job.setMapOutputKeyClass(NullWritable.class);job.setMapOutputValueClass(Text.class);// job.setReducerClass(VideoReduce.class);// job.setOutputKeyClass(Text.class);// job.setOutputValueClass(NullWritable.class);job.setOutputFormatClass(TextOutputFormat.class);TextOutputFormat.setOutputPath(job, new Path("E:\\output\\video"));System.exit(job.waitForCompletion(true) ? 0 : 1);}static class VideoMapper extends Mapper<LongWritable, Text, NullWritable, Text> {@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String[] split = value.toString().split("!");String h = "";String end = "";for (int i = 0; i < split.length; i++) {if (i < 9) {h += split[i] + "!";if (h.contains("&")) {h = h.replace(" & ", "&");}} else {end += split[i];if (i != split.length - 1) {end += "&";}}}//健壯性判斷if (end.equals("")){end=null;}String t = h+end;System.out.println(t);context.write(NullWritable.get(),new Text(t));}}// 可以省略不寫 // static class VideoReduce extends Reducer<NullWritable, Text, Text, NullWritable> {// @Override // protected void reduce(NullWritable key, Iterable<Text> values, //Context context) throws IOException, InterruptedException {// for (Text value : values) {// context.write(value,NullWritable.get()); // }// } // }}

把預處理之后的數據進行入庫到hive中

數據的入庫操作階段?

創建數據庫和表：

創建數據庫名字為：video create database video;創建原始數據表：視頻表：douyinvideo_ori 用戶表：douyinvideo_user_ori創建ORC格式的表：視頻表：douyinvideo_orc 用戶表：douyinvideo_user_orc 給出創建原始表語句創建douyinvideo_ori視頻表： create table douyinvideo_ori(videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int,relatedId array<string>) row format delimited fields terminated by "!" collection items terminated by "&" stored as textfile; 創建douyinvideo_user_ori用戶表： create table douyinvideo_user_ori(uploader string,videos int,friends int) row format delimited fields terminated by "," stored as textfile;

數據入庫效果【截圖】：

數據入庫命令【命令】：

2.1

-- 創建?douyinvideo_orc 表

create table douyinvideo_orc(
????videoId string,
????uploader string,
????age int,
????category array<string>,
????length int,
????views int,
????rate float,
????ratings int,
????comments int,
????relatedId array<string>)
row format delimited
fields terminated by "!"
collection items terminated by "&"
stored as ORC;

?
-- 創建?douyinvideo_user_orc 表：

create table douyinvideo_user_orc(
????uploader string,
????videos int,
????friends int)
row format delimited
fields terminated by ","
stored as ORC;

2.2

-- 請寫出導入語句,將相應語句寫入答題卡中：?????douyinvideo_ori：

load data ?local inpath '/opt/package/video.txt' ??into table douyinvideo_ori;

-- 請寫出導入語句,將相應語句寫入答題卡中：???douyinvideo_user_ori：

load data ?local inpath '/opt/package/user.txt' ??into table douyinvideo_user_ori;

2.3

-- 2.3從原始表查詢數據并插入對應的ORC表中
-- ??douyinvideo_orc：

insert ?into ?table ???douyinvideo_orc ??select ?* from ?douyinvideo_ori;

-- ??douyinvideo_user_orc：

insert ?into ?table ???douyinvideo_user_orc ??select ?* from ?douyinvideo_user_ori;

數據的分析階段

3.1

-- #! ?bin/bash

hive -e "

select douyinvideo_ori.*

from douyinvideo_ori

ORDER BY douyinvideo_ori.ratings DESC??LIMIT 10 ;" > ?/export/ratings.txt

3.2

?-- 3.2統計上傳視頻最多的用戶前十名以及他們上傳的視頻流量在前20的視頻,把查詢結果保存到???/export/uploader.txt

-- ?腳本

#! ?bin/bash
hive -e "
use video ;
select douyinvideo_ori.videoId,douyinvideo_ori.comments,douyinvideo_ori.ratings,douyinvideo_user_ori.videos,friends
from douyinvideo_ori join douyinvideo_user_ori
?on douyinvideo_ori.uploader = douyinvideo_user_ori.uploader
ORDER BY douyinvideo_ori.ratings ?desc
LIMIT 20 " > ?/export/uploader.txt

數據保存到數據庫階段

建表語句

創建ratings外部表的語句：

-- 4.1創建hive對應的數據庫外部表
-- 請寫出創建ratings 外部表的語句,將相應語句寫入答題卡中：

create external table ratings
(
????videoId ??string,
????uploader ?string,
????age ??????int,
????category ?array<string>,
????length ???int,
????views ????int,
????rate ?????float,
????ratings ??int,
????comments ?int,
????relatedId array<string>
)
????row format delimited fields terminated by "\t"
????stored as textfile;

創建uploader外部表的語句：

?-- 請寫出創建?uploader 外部表的語句,將相應語句寫入答題卡中：

create external table uploader
(
????comments int,
????ratings ?int,
????videos ??int,
????friends ?int,
????videoId ?string
)
????row format delimited fields terminated by "\t"
????stored as textfile;

4.2

數據加載語句

-- 4.2加載第3步的結果數據到外部表中
-- 請寫出加載語句到?ratings 表中,將相應語句寫入答題卡中

load data local inpath '/export/ratings.txt' ?into table ratings;

-- 請寫出加載語句到??uploader 表中,將相應語句寫入答題卡中

load data local inpath '/export/uploader.txt' ?into table uploader;

4.3

創建hive ?hbase映射表

-- 創建hbase_ratings表并進行映射，請將相應語句寫入答題卡中：
create table video.hbase_ratings
( ??videoId ??string,
?????uploader string,
????age ??????int,
????category ?array<string>,
????length ???int,
????views ????int,
????rate ?????float,
????ratings ??int,
????comments ?int,
????relatedId string
) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties ("hbase.columns.mapping" ="cf:uploader,cf:age,cf:category,cf:length,cf:views,cf:rate,cf:ratings,cf:comments,cf:relatedId")tblproperties ("hbase.table.name" = "hbase_ratings");

-- 創建hbase_uploader表并進行映射，請將相應語句寫入答題卡中：
create table video.hbase_uploader
(
????comments int,
????ratings ?int,
????videos ??int,
????friends ?int,
????videoId ?string
)
????stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
????????with serdeproperties ("hbase.columns.mapping" = "cf:ratings,cf:videos,cf:friends,cf:videoId")
tblproperties ("hbase.table.name" = "hbase_uploader");

插入數據

?-- 請寫出通過insert overwrite select，插入hbase_ratings表的語句，將相應語句寫入答題卡中

insert overwrite table video.hbase_ratings select * from video.ratings;

-- 請寫出通過insert overwrite select，插入hbase_uploader表的語句，將相應語句寫入答題卡中

?insert overwrite table hbase_uploader select *from uploader;

數據的查詢顯示階段

1 代碼【截圖】：

請使用hbaseapi 對hbase_ratings表按照rowkey=1查詢cf列族下面的videoId,ratings列的值

代碼：?

package com.czxy.Api;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.filter.*; import org.apache.hadoop.hbase.util.Bytes;/*** Created by 一個蔡狗 on 2020/1/7.*/ public class HbaseAPI01 {public static void main(String[] args) throws Exception { // 1 : 請使用hbaseapi 對hbase_ratings表按照rowkey=1查詢cf列族下面的videoId,ratings列的值Configuration conf = new Configuration();conf.set("hbase.zookeeper.quorum","node001,node002,node003");Connection connection = ConnectionFactory.createConnection(conf);Table ratings = connection.getTable(TableName.valueOf("hbase_ratings"));Scan scan = new Scan();FilterList filterList = new FilterList();RowFilter rowFilter = new RowFilter(CompareFilter.CompareOp.EQUAL,new BinaryComparator(Bytes.toBytes("LVCb52iQrfo")));QualifierFilter qf2 = new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("ratings")));filterList.addFilter(rowFilter);filterList.addFilter(qf2);scan.setFilter(filterList);ResultScanner scanner = ratings.getScanner(scan);for (Result result : scanner) {Cell[] cells = result.rawCells();for (Cell cell : cells) {System.out.println(Bytes.toString(CellUtil.cloneRow(cell))+""+Bytes.toString(CellUtil.cloneQualifier(cell))+""+Bytes.toInt(CellUtil.cloneValue(cell)));}}ratings.close();connection.close();}}

2 代碼【截圖】：

請使用hbaseapi 對hbase_uploader表通過RowFilter過濾比rowKey =MdNyOfjnETI小的所有值出來

代碼：

package com.czxy.Api;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.client.*; import org.apache.hadoop.hbase.filter.*; import org.apache.hadoop.hbase.util.Bytes;/*** Created by 一個蔡狗 on 2020/1/7.*/ public class HbaseAPI02 {// 2 : 請使用hbaseapi 對hbase_uploader表通過RowFilter過濾比rowKey =MdNyOfjnETI小的所有值出來public static void main(String[] args) throws Exception {Configuration conf = new Configuration();conf.set("hbase.zookeeper.quorum","node001,node002,node003");Connection connection = ConnectionFactory.createConnection(conf);Table ratings = connection.getTable(TableName.valueOf("hbase_uploader"));Scan scan = new Scan();RowFilter rowFilter = new RowFilter(CompareFilter.CompareOp.LESS,new BinaryComparator(Bytes.toBytes("MdNyOfjnETI")));scan.setFilter(rowFilter);ResultScanner scanner = ratings.getScanner(scan);for (Result result : scanner) {Cell[] cells = result.rawCells();for (Cell cell : cells) {System.out.println(Bytes.toString(CellUtil.cloneRow(cell))+"_"+Bytes.toString(CellUtil.cloneQualifier(cell))+"_"+Bytes.toString(CellUtil.cloneValue(cell)));}}ratings.close();connection.close();}}

總結

以上是生活随笔為你收集整理的大数据离线流程（小练习）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Android软件架构
下一篇：程序员成长系列--应该读的通用技术书籍列