當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

20190328-几种数据清洗的方法

發(fā)布時間：2023/12/20 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 20190328-几种数据清洗的方法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

- - 清洗有缺失的行（存在空值、null值等）
  - 改變分隔符
  - 提取年、月、日等信息
  - 去除第一個和最后一個字符

清洗有缺失的行（存在空值、null值等）

源數(shù)據(jù)

[yao@master data]$ head -2 tmall-201412-1w.csv 13764633023 2014-12-01 02:20:42.000 全視目Allseelook 原宿風暴顯色美瞳彩色隱形藝術(shù)眼鏡1片拍2包郵 33.6 2 18067781305 13377918580 2014-12-17 08:10:25.000 kilala可啦啦大美目大直徑混血美瞳年拋彩色近視隱形眼鏡2片包郵 19.8 2 17359010576

這個數(shù)據(jù)有1w行，一共9列，但是有些行的列上有空值、null值、空格等

方法一：

通過awk命令去掉這些有缺失的行

[yao@master data]$ cat tmall_filter.sh #!/bin/bash infile=$1outfile=$2awk -F"\t" '{if($1 != "" && $2 != "" && $3 != "" && $4 != "" && $5 != "" && $6 != "" && $1 != "null" && $2 != "null" && $3 != "null" && $4 != "null" && $5 != "null" && $6 != "null" && $1 != " " && $2 != " " && $3 != " " && $4 != " " && $5 != " " && $6 != " ") print $0}' $infile > $outfile

方法二：

1.創(chuàng)建臨時表

CREATE TABLE IF NOT EXISTS tmall.tmall_201412_uid_pid( uid STRING, pid STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

2.數(shù)據(jù)清洗
（1）初步填充、check 結(jié)果

INSERT OVERWRITE TABLE tmall.tmall_201412_uid_pid SELECT uid, pid from tmall.tmall_201412;

下載 HDFS 文件到本地：

hadoop fs -get /user/hive/warehouse/tmall.db/tmall_201412_uid_pid/000000_0 .

打開本地文件：

vi 000000_0

執(zhí)行查找命令：

/null

可以看到不少噪聲數(shù)據(jù)，這些數(shù)據(jù)需要清洗，帶 null、字段為""的等

（2）初步清洗

INSERT OVERWRITE TABLE tmall.tmall_201412_uid_pid select regexp_extract(uid, '^[0-9]*$', 0),regexp_extract(pid, '^[0-9]*$', 0) from tmall.tmall_201412 where regexp_extract(uid, '^[0-9]*$', 0) is not null and regexp_extract(uid, '^[0-9]*$', 0) != 'NULL' and regexp_extract(uid, '^[0-9]*$', 0) !='' and regexp_extract(uid, '^[0-9]*$', 0) != ' ' and regexp_extract(uid, '^[0-9]*$', 0) != 'null' and regexp_extract(pid, '^[0-9]*$', 0) is not null and regexp_extract(pid, '^[0-9]*$', 0) != 'NULL' and regexp_extract(pid, '^[0-9]*$', 0) !='' and regexp_extract(pid, '^[0-9]*$', 0) != ' ' and regexp_extract(pid, '^[0-9]*$', 0) !='null' ;

改變分隔符

源數(shù)據(jù)

[yao@master product]$ head -5 1-5.csv 香菜,2.8,4,4,4,2.2,山西汾陽市晉陽農(nóng)副產(chǎn)品批發(fā)市場,山西,汾陽大蔥,2.8,2.8,2.8,2.8,2.6,山西汾陽市晉陽農(nóng)副產(chǎn)品批發(fā)市場,山西,汾陽蔥頭,1.6,1.6,1.6,1.6,1.6,山西汾陽市晉陽農(nóng)副產(chǎn)品批發(fā)市場,山西,汾陽大蒜,3.6,3.6,3.6,3.6,3,山西汾陽市晉陽農(nóng)副產(chǎn)品批發(fā)市場,山西,汾陽蒜苔,6.2,6.4,6.4,6.4,5.2,山西汾陽市晉陽農(nóng)副產(chǎn)品批發(fā)市場,山西,汾陽[yao@master product]$ cat china-province.csv 河北省,山西省,遼寧省,吉林省,黑龍江省,江蘇省,浙江省,安徽省,福建省,江西省,山東省,河南省,湖北省,湖南省,廣東省,海南省,四川省,貴州省,云南省,陜西省,甘肅省,青海省,臺灣省,內(nèi)蒙古自治區(qū),廣西壯族自治區(qū),西藏自治區(qū),寧夏回族自治區(qū),新疆維吾爾自治區(qū),香港特別行政區(qū),澳門特別行政區(qū)

方法一：
(1)將1-5.csv文件中的逗號分隔改為\t分隔
打開本地文件

vim 1-5.csv

執(zhí)行替換命令

:%s/','/'\t'/g

（2）清洗china-province.txt中數(shù)據(jù),按照逗號切分，每行一個省份

vi filter.sh #!/bin/bash infile=$1 outfile=$2 awk -F"," '{for(i=1;i<=NF,i++)print $i}'$infile>$outfile[yao@master product]$ bash filter.sh /home/yao/data/product/china-province.csv /home/yao/data/product/province.csv

方法二

（1）使用linux下的iconv 命令改變文件的編碼（編碼轉(zhuǎn)換）：

iconv -f GB2312 -t UTF-8 china-province01 -o china-province1 ---------原本編碼----新編碼------原文件名---------新生成文件名---------//用\t分隔 awk -F "\t" '{print $1"\t"$2"\t2014/1/1\t"$7"\t"$8"\t"$9}' 1-5.csv > data//將逗號替換為\t sed 's/,/\n/g' china-province.txt > china-province

vim編輯器中，將china-province中的省、自治區(qū)等字段去掉

:%s/,/\r/g :%s/省//g

提取年、月、日等信息

源數(shù)據(jù)

[yao@master hw]$ head -10 sogou_10w_ext 20171230000005 57375476989eea12893c0c3811607bcf 奇藝高清 1 1 http://www.qiyi.com/ 2017 12 30 00 20171230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙傳 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 2017 130 00 20171230000007 b97920521c78de70ac38e3713f524b50 本本聯(lián)盟 1 1 http://www.bblianmeng.com/ 2017 12 30 00 20171230000008 6961d0c97fe93701fc9c0d861d096cd9 華南師范大學圖書館 1 1 http://lib.scnu.edu.cn/ 2017 12 30 00 20171230000008 f2f5a21c764aebde1e8afcc2871e086f 在線代理 2 1 http://proxyie.cn/ 2017 12 30 00 20171230000009 96994a0480e7e1edcaef67b20d8816b7 偉大導演 1 1 http://movie.douban.com/review/1128960/ 2017 12 30 00 20171230000009 698956eb07815439fe5f46e9a4503997 youku 1 1 http://www.youku.com/ 2017 12 30 00 20171230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房產(chǎn)網(wǎng) 1 1 http://hf.house365.com/ 2017 12 30 00 20171230000010 f577230df7b6c532837cd16ab731f874 哈薩克網(wǎng)址大全 1 1 http://www.kz321.com/ 2017 12 30 00 20171230000010 285f88780dd0659f5fc8acc7cc4949f2 IQ數(shù)碼 1 1 http://www.iqshuma.com/ 2017 12 30 00

這個數(shù)據(jù)有500萬行，第一列表示時間，但不是標準的時間格式如yyyy-MM-dd HH:mm:ss；同時還有一些缺失，比如存在null值、空值等

擴展數(shù)據(jù)
將第一列的時間的年、月、日、時通過substr()截取出來

[yao@master data]$ cat sogou-log-extend.sh #!/bin/bash #infile=/sogou.500w.utf8 infile=$1 #outfile=/sogou.500w.utf8.final outfile=$2 awk -F '\t' '{print $0"\t"substr($1,0,4)"\t"substr($1,5,2)"\t"substr($1,7,2)"\t"substr($1,9,2)}' $infile > $outfile

后四列表示的分別是年、月、日、時

[yao@master hw]$ head -1 sogou_10w_ext 20171230000005 57375476989eea12893c0c3811607bcf 奇藝高清 1 1 http://www.qiyi.com/ 2017 12 30 00 20171230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙傳 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=1 2017 12 30 00

清洗數(shù)據(jù)
刪除空值和空格

[yao@master data]$ cat sogou-log-filter.sh #!/bin/bash #infile=/data/sogou-data/sogou.500w.utf8 infile=$1 #outfile=/data/sogou-data/sogou.500w.utf8.final outfile=$2 awk -F"\t" '{if($2 != "" && $3 != "" && $2 != " " && $3 != " ") print $0}' $infile > $outfile

去除第一個和最后一個字符

測試數(shù)據(jù)

[yao@master data]$ cat test.txt [12345] [12345] [12345] [12345]

去除第一個字符

[yao@master data]$ sed 's/.//' test.txt 12345] 12345] 12345] 12345]

去掉第二個字符

[yao@master data]$ sed -r 's/]$//g' test.txt [12345 [12345 [12345 [12345 [yao@master data]$ sed 's/.$//' test.txt [12345 [12345 [12345 [12345 [yao@master data]$ awk '{sub(/.$/,"")}1' test.txt [12345 [12345 [12345 [12345 [yao@master data]$ awk '{sub(/.{2}$/,"")}1' test.txt [1234 [1234 [1234 [1234

源數(shù)據(jù)

[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387159495","commentCount":"1419","content":"分享圖片","createTime":"1386981067","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww3.sinaimg.cn/thumbnail/40d61044jw1ebixhnsiknj20qo0qognx.jpg"],"praiseCount":"5265","reportCount":"1285","source":"iPad客戶端","userId":"1087770692","videourl":[],"weiboId":"3655325888057474","weiboUrl":"http://weibo.com/1087770692/AndhixO7g"}] [{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387159495","commentCount":"91","content":"行走：#去遠方發(fā)現(xiàn)自己#@費勇主編，跨界明星聯(lián)合執(zhí)筆，分享他們觀行思趣的心發(fā)現(xiàn)、他們的成長與心路歷程，當當網(wǎng)限量贈送出品人@陳坤抄誦印刷版《心經(jīng)》，贈完不再加印哦！詳情請戳：http://t.cn/8k622Sj","createTime":"1386925242","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww4.sinaimg.cn/thumbnail/b2336177jw1ebi6j4twk7j20m80tkgra.jpg"],"praiseCount":"1","reportCount":"721","source":"","userId":"2989711735","videourl":[],"weiboId":"3655091741442099","weiboUrl":"http://weibo.com/2989711735/An7bE639F"}] [{"beCommentWeiboId":"","beForwardWeiboId":"3655091741442099","catchTime":"1387159495","commentCount":"838","content":"抄誦“心經(jīng)”為大家祈福，字不好，見諒！","createTime":"1386926798","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"2586","reportCount":"693","source":"iPad客戶端","userId":"1087770692","videourl":[],"weiboId":"3655098267456453","weiboUrl":"http://weibo.com/1087770692/An7mavhLn"}]

該數(shù)據(jù)需要用json解析，但不符合json的格式。并且解壓后在一個目錄下有多個文件。

方法一：

//去掉第一個括號的腳本#!/bin/bash dir=`ls /home/yao/data/weibo/619893` //查看這個目錄 dir_input="/home/yao/data/weibo/619893/" dir_out="/home/yao/data/weibo/test/" for i in $dir //如果文件在這個目錄下就執(zhí)行 doinfile=$dir_input$i //要被執(zhí)行的文件就是dir_input下的第一個到最后一個文件sed 's/.//' $infile >> $dir_out/test.json //執(zhí)行后輸出到dir_out目錄下 done//去掉最后一個括號 awk '{sub(/.{2}$/,"")}1' test.json > data.json

清洗后load到hive的表中在進行json解析

方法二：
在hive里創(chuàng)建表，在加載的過程中通過substr()截取第二位和倒數(shù)第二位的字段

create table rate_weibo select get_json_object(line,'$.catchTime') as catchTime as select substr(2,length(line-2) b from weibo_json)

或創(chuàng)建表并加載數(shù)據(jù)

create table weibo_json(line string) row format delimited;load data local inpath '/home/yao/data/weibo/test/data.json' overwrite into table weibo_json;

總結(jié)
在無數(shù)次的實驗中，源數(shù)據(jù)去掉最后一個字符并沒有去掉中括號，可能是中括號后有空格等因素影響，導致通過去掉最后兩個字符才實現(xiàn)需求。
而substr()函數(shù)直接截取到我們想要的字符，更加的實用

總結(jié)

以上是生活随笔為你收集整理的20190328-几种数据清洗的方法的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：黑客的思维模式
下一篇：计算机二级excel降水量分值,计算机二