三十、电子商务分析与服务推荐的分析方法与过程
1. 分析方法與過(guò)程
1.1 目標(biāo)
本案例的目標(biāo)是對(duì)用戶進(jìn)行推薦,即以一定的方式將用戶與物品之間建立聯(lián)系。為了更好地幫助用戶從海量的數(shù)據(jù)中快速發(fā)現(xiàn)感興趣的網(wǎng)頁(yè),在目前相對(duì)單一的推薦系統(tǒng)上進(jìn)行補(bǔ)充。電子商務(wù)服務(wù)推薦的分析方法與過(guò)程的主要內(nèi)容包括:
- 數(shù)據(jù)抽取
- 數(shù)據(jù)探索性分析
2. 數(shù)據(jù)抽取
- 推薦系統(tǒng)使用的推薦算法**
- 本項(xiàng)目中使用的是協(xié)同過(guò)濾算法,其特點(diǎn)是通過(guò)歷史數(shù)據(jù)找出相似的用戶或者網(wǎng)頁(yè),在數(shù)據(jù)抽取的過(guò)程中,進(jìn)可能選擇大量的數(shù)據(jù),這樣就能降低推薦結(jié)果的隨機(jī)性,提高推薦結(jié)果的準(zhǔn)確性,能更好地發(fā)掘長(zhǎng)尾網(wǎng)頁(yè)中用戶感興趣的網(wǎng)頁(yè)。
用戶訪問(wèn)數(shù)據(jù)的特征
- 用戶的訪問(wèn)時(shí)間為條件,選取3個(gè)月用戶的訪問(wèn)數(shù)據(jù)作為原始數(shù)據(jù)集。數(shù)據(jù)總量有837450條記錄,其中包括用戶號(hào)、訪問(wèn)時(shí)間、來(lái)源網(wǎng)站、訪問(wèn)頁(yè)面、頁(yè)面標(biāo)題、來(lái)源網(wǎng)頁(yè)、標(biāo)簽、網(wǎng)頁(yè)類別和關(guān)鍵詞詞等屬性。
智能推薦系統(tǒng)的流程圖
- 建立數(shù)據(jù)庫(kù)
- 導(dǎo)入數(shù)據(jù)
- 搭建Python等數(shù)據(jù)庫(kù)環(huán)境
- 數(shù)據(jù)分析
- 建立模型
Python訪問(wèn)數(shù)據(jù)庫(kù)的代碼
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize=10000)3 探索性數(shù)據(jù)分析
網(wǎng)頁(yè)類分析
- 首先對(duì)原始數(shù)據(jù)中用戶點(diǎn)擊的網(wǎng)頁(yè)類型進(jìn)行統(tǒng)計(jì),網(wǎng)頁(yè)類型是指“網(wǎng)址類型”中的3位數(shù)字(本身有6/7位數(shù)字)。
網(wǎng)頁(yè)類統(tǒng)計(jì)結(jié)果
- 點(diǎn)擊與咨詢相關(guān)(網(wǎng)頁(yè)類型為101)的記錄占了49.16%,其次是其他的類型(網(wǎng)頁(yè)類型為199)占比24%左右,然后是知識(shí)相關(guān)(網(wǎng)頁(yè)類型為107)占比22%左右。
- 可以得到用戶的點(diǎn)擊頁(yè)面類型的排行榜為:咨詢相關(guān)、知識(shí)相關(guān)、其它方面的網(wǎng)頁(yè)、法規(guī)(類型為310)、律師相關(guān)(類型為102)。可以初步得出相對(duì)于長(zhǎng)篇的知識(shí),用戶更加偏向于查看咨詢或者進(jìn)行咨詢。
- 知識(shí)類型內(nèi)部統(tǒng)計(jì)
網(wǎng)頁(yè)類分析實(shí)現(xiàn)的代碼
counts = [i['fullURLId'].value_counts() for i in sql] # 逐塊統(tǒng)計(jì) counts = pd.concat(counts).groupby(level=0).sum() # 合并統(tǒng)計(jì)結(jié)果,把相同的統(tǒng)計(jì)項(xiàng)合并(即按index分組并求和) counts = counts.reset_index() # 重新設(shè)置index,將原來(lái)的index作為counts的一列。 counts.columns = ['index', 'num'] # 重新設(shè)置列名,主要是第二列,默認(rèn)為0 counts['type'] = counts['index'].str.extract('(\d{3})') # 提取前三個(gè)數(shù)字作為類別id counts['percent'] = counts['num'] / counts['num'].sum() * 100 counts_ = counts[['type', 'num', 'percent']].groupby('type').sum() # 按類別合并 counts_.sort_values('num', ascending=False) # 降序排列 print(counts_)點(diǎn)擊次數(shù)分析
- 統(tǒng)計(jì)分析原始數(shù)據(jù)用戶瀏覽網(wǎng)頁(yè)次數(shù)(以“真實(shí)IP區(qū)分”)的情況,其結(jié)果如下表所示:可以從表中發(fā)現(xiàn)瀏覽一次的用戶占所有用戶總量的58%左右,大部分用戶瀏覽的次數(shù)在2~7次,用戶瀏覽的平均次數(shù)是3次
點(diǎn)擊次數(shù)分析代碼的實(shí)現(xiàn)過(guò)程
網(wǎng)頁(yè)排名
- 由分析目標(biāo)課程,個(gè)性化推薦主要針對(duì)以html為后綴的網(wǎng)頁(yè)。從原始數(shù)據(jù)中統(tǒng)計(jì)以html為后綴的網(wǎng)頁(yè)的點(diǎn)擊率。
- 從表中可以看出,點(diǎn)擊次數(shù)前20名中,“法規(guī)專題”占了大部分,其次是“知識(shí)”,然后是“咨詢”。
類型點(diǎn)擊數(shù)統(tǒng)計(jì)
翻頁(yè)網(wǎng)頁(yè)統(tǒng)計(jì)
6 總結(jié)
分析方法與過(guò)程
-
數(shù)據(jù)抽取
1、建立數(shù)據(jù)庫(kù)—導(dǎo)入數(shù)據(jù)—搭建Python環(huán)境—數(shù)據(jù)分析—建立模型 -
數(shù)據(jù)探索性分析
2、網(wǎng)頁(yè)類型分析
3、網(wǎng)頁(yè)點(diǎn)擊次數(shù)分析
4、網(wǎng)頁(yè)排名分析
7.完整代碼
7.1 代碼目錄結(jié)果
7.2 完整代碼
1 sql_value_counts.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize=10000) ''' 用create_engine建立連接,連接地址的意思依次為“數(shù)據(jù)庫(kù)格式(mysql)+程序名(pymysql)+賬號(hào)密碼@地址端口/數(shù)據(jù)庫(kù)名(test)”,最后指定編碼為utf8; all_gzdata是表名,engine是連接數(shù)據(jù)的引擎,chunksize指定每次讀取1萬(wàn)條記錄。這時(shí)候sql是一個(gè)容器,未真正讀取數(shù)據(jù)。 '''counts = [i['fullURLId'].value_counts() for i in sql] # 逐塊統(tǒng)計(jì) counts = pd.concat(counts).groupby(level=0).sum() # 合并統(tǒng)計(jì)結(jié)果,把相同的統(tǒng)計(jì)項(xiàng)合并(即按index分組并求和) counts = counts.reset_index() # 重新設(shè)置index,將原來(lái)的index作為counts的一列。 counts.columns = ['index', 'num'] # 重新設(shè)置列名,主要是第二列,默認(rèn)為0 counts['type'] = counts['index'].str.extract('(\d{3})') # 提取前三個(gè)數(shù)字作為類別id counts['percent'] = counts['num'] / counts['num'].sum() * 100 counts_ = counts[['type', 'num', 'percent']].groupby('type').sum() # 按類別合并 counts_.sort_values('num', ascending=False) # 降序排列 print(counts_)2 ask_value_counts.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize=10000)# 統(tǒng)計(jì)101類別的情況 def count101(i): # 自定義統(tǒng)計(jì)函數(shù)j = i[['fullURLId']][i['fullURLId'].str.contains('101')].copy() # 找出類別包含101的網(wǎng)址return j['fullURLId'].value_counts()counts2 = [count101(i) for i in sql] # 逐塊統(tǒng)計(jì) counts2 = pd.concat(counts2).groupby(level=0).sum() # 合并統(tǒng)計(jì)結(jié)果 counts2 = pd.DataFrame(counts2) counts2.columns = ['num'] counts2['percent'] = counts2['num'] / counts2['num'].sum() * 100 counts2.sort_values('num', ascending=False) # 降序排列print(counts2)3 know_value_counts.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize = 10000)#統(tǒng)計(jì)107類別的情況 def count107(i): #自定義統(tǒng)計(jì)函數(shù)j = i[['fullURL']][i['fullURLId'].str.contains('107')].copy() #找出類別包含107的網(wǎng)址j['type'] = None #添加空列j['type'][j['fullURL'].str.contains('info/.+?/')] = u'知識(shí)首頁(yè)'j['type'][j['fullURL'].str.contains('info/.+?/.+?')] = u'知識(shí)列表頁(yè)'j['type'][j['fullURL'].str.contains('/\d+?_*\d+?\.html')] = u'知識(shí)內(nèi)容頁(yè)'return j['type'].value_counts()counts2 = [count107(i) for i in sql] #逐塊統(tǒng)計(jì) counts2 = pd.concat(counts2).groupby(level=0).sum() #合并統(tǒng)計(jì)結(jié)果 counts2 = pd.DataFrame(counts2) counts2.columns=['num'] counts2['percent'] = counts2['num']/counts2['num'].sum()*100 print(counts2)4 other_value_counts.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize=10000)# 統(tǒng)計(jì)1999001類別的情況 def count101(i): # 自定義統(tǒng)計(jì)函數(shù)j = i[['pageTitle']][i['fullURLId'].str.contains('1999001')].copy() # 找出類別包含101的網(wǎng)址j['type'] = u'其他'j['type'][(j['pageTitle'] != '') & (j['pageTitle'].str.contains(u'快車-律師助手'))] = u'快車-律師助手'j['type'][(j['pageTitle'] != '') & (j['pageTitle'].str.contains(u'免費(fèi)發(fā)布法律咨詢'))] = u'免費(fèi)發(fā)布咨詢'j['type'][(j['pageTitle'] != '') & (j['pageTitle'].str.contains(u'咨詢發(fā)布成功'))] = u'咨詢發(fā)布成功'j['type'][(j['pageTitle'] != '') & (j['pageTitle'].str.contains(u'快搜'))] = u'快搜'return j['type'].value_counts()counts2 = [count101(i) for i in sql] # 逐塊統(tǒng)計(jì) counts2 = pd.concat(counts2).groupby(level=0).sum() # 合并統(tǒng)計(jì)結(jié)果 counts2 = pd.DataFrame(counts2) counts2.columns = ['num'] counts2['percent'] = counts2['num'] / counts2['num'].sum() * 100 counts2.sort_values('num', ascending=False) # 降序排列web_click_counts.py
import pandas as pd from sqlalchemy import create_engineengine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8') sql = pd.read_sql('all_gzdata', engine, chunksize = 10000)#統(tǒng)計(jì)點(diǎn)擊次數(shù) #value_count統(tǒng)計(jì)數(shù)據(jù)出現(xiàn)的頻率c = [i['realIP'].value_counts() for i in sql] count3 = pd.concat(c).groupby(level=0).sum() count3 = pd.DataFrame(count3) count3[1] = 1 count3 = count3.groupby('realIP').sum()count3_ =count3.iloc[:7,:].append(count3.iloc[7:,:].sum(),ignore_index=True) count3_.index = list(range(1,8))+['7次以上'] print(count3_)# 對(duì)瀏覽次數(shù)達(dá)7次以上的情況進(jìn)行分析,發(fā)現(xiàn)大部分用戶瀏覽8~100次,代碼實(shí)現(xiàn):counts3_7 = pd.concat([count3.iloc[7:100,:].sum(),count3.iloc[100:300,:].sum(),count3.iloc[300:,:].sum()]) counts3_7.index = ['8-100','101-300','301以上'] counts3_7df = pd.DataFrame(counts3_7) counts3_7df.index.name = '點(diǎn)擊次數(shù)' counts3_7df.columns = ['用戶數(shù)'] print(counts3_7df)web_sort
import pandas as pd from sqlalchemy import create_engine engine = create_engine('mysql+pymysql://root:222850@127.0.0.1:3306/7law?charset=utf8')sql = pd.read_sql('all_gzdata', engine, chunksize=10000)counts4 = [i[['realIP','fullURL','fullURLId']] for i in sql] counts4_ = pd.concat(counts4) a = counts4_[counts4_['fullURL'].str.contains('\.html')] print(a.head()).0.0.1:3306/7law?charset=utf8’)
sql = pd.read_sql('all_gzdata', engine, chunksize=10000)counts4 = [i[['realIP','fullURL','fullURLId']] for i in sql] counts4_ = pd.concat(counts4) a = counts4_[counts4_['fullURL'].str.contains('\.html')] print(a.head())總結(jié)
以上是生活随笔為你收集整理的三十、电子商务分析与服务推荐的分析方法与过程的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 二十九、电子商务服务推荐项目基本描述
- 下一篇: 三十一、电子商务分析与服务推荐