三十一、电子商务分析与服务推荐
1. 數(shù)據(jù)預處理
1.2 數(shù)據(jù)預處理的流程
本案例在原始數(shù)據(jù)的探索分析的基礎(chǔ)上,發(fā)現(xiàn)與分析目標無關(guān)或模型需要處理的數(shù)據(jù),針對此類數(shù)據(jù)進行處理。其中涉及的數(shù)據(jù)處理方式有:
- 數(shù)據(jù)清洗
- 數(shù)據(jù)變換
- 屬性歸約
2. 數(shù)據(jù)清洗
2.1 數(shù)據(jù)清洗規(guī)則
-
從探索分析的過程中發(fā)現(xiàn)與分析目標無關(guān)的數(shù)據(jù),歸納總結(jié)其數(shù)據(jù)滿足如下規(guī)則:中間頁面的網(wǎng)址、咨詢發(fā)布成功頁面、律師登錄助手的頁面等。將其整理成刪除數(shù)據(jù)規(guī)則,下表給出了信息的結(jié)果。律師用戶占了所有記錄的22%左右,其他數(shù)據(jù)占比很小,大概5%左右。
-
經(jīng)過上述清洗后的記錄中仍然存在大量的目錄網(wǎng)頁(可理解為用戶瀏覽信息的路徑),在進入推薦系統(tǒng)時,這些信息的作用不大,反而會影響推薦的結(jié)果。因此需要進一步篩選以html為后綴的網(wǎng)頁。
-
根據(jù)分析目標以及探索結(jié)果可知,咨詢與知識是其主要業(yè)務來源,故篩選咨詢與知識相關(guān)的記錄,將此部分數(shù)據(jù)作為模型分析需要的數(shù)據(jù)。
數(shù)據(jù)清洗操作的實現(xiàn)
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize = 10000) for i in sql:d = i[['realIP', 'fullURL']] #只要網(wǎng)址列d = d[d['fullURL'].str.contains('\.html')].copy() #只要含有.html的網(wǎng)址#保存到數(shù)據(jù)庫的cleaned_gzdata表中(如果表不存在則自動創(chuàng)建)d.to_sql('cleaned_gzdata', engine, index = False, if_exists = 'append')3 數(shù)據(jù)變換
3.1 用戶翻頁處理
- 因此,針對這些網(wǎng)頁需要還原其原始數(shù)據(jù)類型,處理方式為首先是被翻頁的網(wǎng)址,然后對翻頁的網(wǎng)址進行還原,最后針對每個用戶訪問的頁面進行重操作。
3.2 用戶翻頁處理的實現(xiàn)
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('cleaned_gzdata', engine, chunksize = 10000) for i in sql: #逐塊變換并去重d = i.copy()d['fullURL'] = d['fullURL'].str.replace('_\d{0,2}.html', '.html') #將下劃線后面部分去掉,規(guī)范為標準網(wǎng)址d = d.drop_duplicates() #刪除重復記錄d.to_sql('changed_gzdata', engine, index = False, if_exists = 'append') #保存3.3 網(wǎng)址分類
- 由于在探索階段發(fā)現(xiàn)有部分網(wǎng)頁的所屬類別是錯誤的,需對其數(shù)據(jù)進行網(wǎng)頁分類,且分析目標是分析咨詢類別與知識類別,因此需對這些網(wǎng)址進行手動分類,其分類的規(guī)則為包含”ask”、”askzt”關(guān)鍵字的記錄人為歸類至咨詢類別,對網(wǎng)址包含“知識”、“faguizt”關(guān)鍵字的網(wǎng)址歸類為知識類別。
3.4 網(wǎng)址分類的實現(xiàn)
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('changed_gzdata', engine, chunksize=10000)for i in sql:d = i.copy()d['type_l'] = d['fullURL']d['type_l_1'] = Noned['type_l_2'] = Noned['type_l'][d['fullURL'].str.contains('(ask)|(askzt)')] = 'zixun'd['type_l'][d['fullURL'].str.contains('(info)|(zhishiku)')] = 'zhishi'd['type_l'][d['fullURL'].str.contains('(faguizt)|(lifadongtai)')] = 'fagui'd['type_l'][d['fullURL'].str.contains('(fayuan)|(gongan)|(jianyu)|(gongzhengchu)')] = 'jigou'd['type_l'][d['fullURL'].str.contains('interview')] = 'fangtan'd['type_l'][d['fullURL'].str.contains('d\d+(_\d)?(_p\d+)?\.html')] = 'zhengce'd['type_l'][d['fullURL'].str.contains('baike')] = 'baike'd['type_l'][d['type_l'].str.len() > 15] = 'etc'd[['type_l_1', 'type_l_2']] = d['fullURL'].str.extract('http://www.lawtime.cn/(info|zhishiku)/(?P<type_l_1>[A-Za-z]+)/(?P<type_l_2>[A-Za-z]+)/\d+\.html',expand=False).iloc[:, 1:]d.to_sql('splited_gzdata', engine, index=False, if_exists='append')屬性歸約
6 完整代碼
6.1 代碼目錄結(jié)構(gòu)
6.2 完整代碼
1. sql_clean_save.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize = 10000) for i in sql:d = i[['realIP', 'fullURL']] #只要網(wǎng)址列d = d[d['fullURL'].str.contains('\.html')].copy() #只要含有.html的網(wǎng)址#保存到數(shù)據(jù)庫的cleaned_gzdata表中(如果表不存在則自動創(chuàng)建)d.to_sql('cleaned_gzdata', engine, index = False, if_exists = 'append')2. sql_data_change.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('cleaned_gzdata', engine, chunksize = 10000) for i in sql: #逐塊變換并去重d = i.copy()d['fullURL'] = d['fullURL'].str.replace('_\d{0,2}.html', '.html') #將下劃線后面部分去掉,規(guī)范為標準網(wǎng)址d = d.drop_duplicates() #刪除重復記錄d.to_sql('changed_gzdata', engine, index = False, if_exists = 'append') #保存3. sql_data_split.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('changed_gzdata', engine, chunksize=10000)for i in sql:d = i.copy()d['type_l'] = d['fullURL']d['type_l_1'] = Noned['type_l_2'] = Noned['type_l'][d['fullURL'].str.contains('(ask)|(askzt)')] = 'zixun'd['type_l'][d['fullURL'].str.contains('(info)|(zhishiku)')] = 'zhishi'd['type_l'][d['fullURL'].str.contains('(faguizt)|(lifadongtai)')] = 'fagui'd['type_l'][d['fullURL'].str.contains('(fayuan)|(gongan)|(jianyu)|(gongzhengchu)')] = 'jigou'd['type_l'][d['fullURL'].str.contains('interview')] = 'fangtan'd['type_l'][d['fullURL'].str.contains('d\d+(_\d)?(_p\d+)?\.html')] = 'zhengce'd['type_l'][d['fullURL'].str.contains('baike')] = 'baike'd['type_l'][d['type_l'].str.len() > 15] = 'etc'd[['type_l_1', 'type_l_2']] = d['fullURL'].str.extract('http://www.lawtime.cn/(info|zhishiku)/(?P<type_l_1>[A-Za-z]+)/(?P<type_l_2>[A-Za-z]+)/\d+\.html',expand=False).iloc[:, 1:]d.to_sql('splited_gzdata', engine, index=False, if_exists='append') 1>[A-Za-z]+)/(?P<type_l_2>[A-Za-z]+)/\d+\.html',expand=False).iloc[:, 1:]d.to_sql('splited_gzdata', engine, index=False, if_exists='append')總結(jié)
以上是生活随笔為你收集整理的三十一、电子商务分析与服务推荐的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 三十、电子商务分析与服务推荐的分析方法与
- 下一篇: 三十二、电子商务服务推荐模型构建