日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

section8

發布時間:2023/12/18 编程问答 23 豆豆
生活随笔 收集整理的這篇文章主要介紹了 section8 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本章節的目的是 【明確目標用戶群】 ,以更好的服務現有用戶。

【知識點】

1.作圖

  • 顯示中文

plt.rcParams['font.sans-serif'] = ['SimHei'] # 步驟一(替換sans-serif字體) plt.rcParams['axes.unicode_minus'] = False # 步驟二(解決坐標軸負數的負號顯示問題)

2.數據庫操作

  • sqlalchemy 引擎

engine = create_engine('mysql+pymysql://root:123456@localhost:3306/datascience')

3.批量讀取文件

  • os.wolk()、os.path.join()用法
for root, dirs, files in os.walk(path): for file in files:`rfile = os.path.join(root,file)if rfile.split('.')[-1] == 'tsv':rdf = pd.read_csv(rfile, sep='\t')df = df.append(rdf)

4.groupby()以及agg() 的聯合使用,應對不同列使用不同的函數

  • 按月統計

affc = {'payment':'sum', 'log_date':'count'} dfm = df.groupby(['log_month', 'user_id']).agg(affc).reset_index()

  • 修改列明

renam = {'log_date':'access_days'} dfm.rename(columns=renam, inplace=True)

5.KMeans 聚類的使用

  • 單列的聚類(需要將單列應用 reshape(-1,1)格式化為1列)

from sklearn.cluster import KMeans a47 = action['A47'].reshape(-1, 1) kms = KMeans(n_clusters=3).fit(a47)

  • 聚類的標簽 labels_ 屬性

cluster = kms.labels_

  • 將標簽添加至源數據中,運用groupby()查看分組情況

action['cluster'] = cluster action.groupby(['cluster'])['user_id'].count()

  • 可視化分組
snsdf = action[['user_id','A47','cluster']].sort_values(by='A47',ascending=False) plt.figure(figsize=(8,5)) snsdf1 = snsdf.reset_index() snsdf1[snsdf1['cluster']==2]['A47'].plot(color='r',label='2:重度用戶') snsdf1[snsdf1['cluster']==1]['A47'].plot(color='g',label='1:中度用戶') snsdf1[snsdf1['cluster']==0]['A47'].plot(color='b',label='0:輕度用戶') plt.legend() plt.xlabel('用戶分布') plt.ylabel('排行榜得分')

6.主成分分析

  • 數據預處理

    • 提取要進行主成分分析的列
      paction = acc.iloc[:,3:(len(acc.columns)-1)]
    • 刪掉0值較多的列
      cc = paction[paction==0].count(axis=0)/len(paction) cc.plot() dd = cc[cc<.9] #刪掉該列中90%以上都是0值的列 paction = paction[dd.index] paction.head()
    • 刪掉相關性較強的列

      # 數據概覽 corp = paction.corr() sns.heatmap(corp) mask = np.array(corp) mask[np.tril_indices_from(mask)] = False # 畫下三角heatmap的方法 sns.heatmap(corp,mask=mask)# 通過下三角矩陣的方式,刪掉相關性較強的數據列 coll = corp.columns corp = pd.DataFrame(np.tril(corp, -1)) # 應用 np.tril(m, -1) 函數獲取下三角,上三角數據全部置為0 corp.columns = coll pac2 = paction.loc[:,(corp.abs()<.8).all()] # 任何一個數都小于 0.8 的數據 all() 函數 pac2.head()
    • 進行主成分分析

      from sklearn.decomposition import PCA pca = PCA() pca.fit(pac2)redio = pca.explained_variance_ratio_ # pca.explained_variance_ratio_ 是PCA降維后的矩陣課解釋性比率 print(redio) print(pca.singular_values_) # singular_values_ 是奇異值矩陣
    • 主成分的課解釋性曲線

      recu = redio.cumsum() # 應用 cumsum() 函數進行逐數據累加 plt.plot(recu)
    • 獲取降維后的數據以進行下一步

      pca.set_params(n_components=10) # 設置 維度 為 10 pac3 = pd.DataFrame(pca.fit_transform(pac2)) # 使用fit_transform()函數訓練并獲得降維后的數據 pac3.head()
    • 繼續應用 KMENAS 進行聚類, 得到所有用戶的 分類 ,然后再 平均 每個分類的每個行為的所有用戶的值
    • 繼續應用相關性 刪除 相關性強的列, 獲得最后 主要觀察指標
    • 對主要觀察指標進行 雷達圖 展示

      # 首先,對數據進行標準化處理 from sklearn.preprocessing import scale ccccc = pd.DataFrame(scale(cccc)) ccccc.columns = cccc.columns# 畫圖 plt.figure(figsize=(8,8)) N = ccccc.shape[1] # 極坐標的分割分數 angles = np.linspace(0, 2*np.pi, N, endpoint=False) # 設置雷達圖的角度,用于平分切開一個圓面 angles = np.concatenate((angles,[angles[0]])) # 使雷達圖一圈封閉起來 for i in range(len(ccccc)):values = ccccc.loc[i,:] # 構造數據values = np.concatenate((values,[values[0]])) # 為了使雷達圖一圈封閉起來plt.polar(angles, values, 'o-', linewidth=2) # 繪制 plt.legend(ccccc.index, loc='lower right') plt.thetagrids(angles * 180/np.pi, labels=list(ccccc.columns)) # 添加極坐標的標簽 plt.title('重要指標雷達圖呈現')

一、庫導入以及matplotlib顯示中文

import pandas as pd import numpy as np import pymysql from sqlalchemy import create_engine import matplotlib.pyplot as plt import seaborn as sns import missingno as msno import osplt.rcParams['font.sans-serif'] = ['SimHei'] # 步驟一(替換sans-serif字體) plt.rcParams['axes.unicode_minus'] = False # 步驟二(解決坐標軸負數的負號顯示問題) %matplotlib inline

數據庫引擎

engine = create_engine('mysql+pymysql://root:123456@localhost:3306/datascience')

二、批量讀取文件

def read_files(path):df = pd.DataFrame()for root, dirs, files in os.walk(path):for file in files:rfile = os.path.join(root,file)if rfile.split('.')[-1] == 'tsv':rdf = pd.read_csv(rfile, sep='\t')df = df.append(rdf)return df action_path = 'data/sample-data/section8/daily/action/' dau_path = 'data/sample-data/section8/daily/dau/' dpu_path = 'data/sample-data/section8/daily/dpu/'action = read_files(action_path) dau = read_files(dau_path) dpu = read_files(dpu_path)

查看數據完整性以及頭部信息

print(action.isnull().sum().sum()) print(action.shape) # print(action.info()) action.head() 0 (2653, 57) log_dateapp_nameuser_idA1A2A3A4A5A6A7...A45A46A47A48A49A50A51A52A53A5401234
2013-10-31game-016541330000000...003802565500000.046
2013-10-31game-014255300000101233...19201805433473622400.071
2013-10-31game-017095960000000...004162481700000.02
2013-10-31game-015250470200900...2222352006412210000.0109
2013-10-31game-017969080000000...29293882544410000.064

5 rows × 57 columns

print(dau.isnull().sum().sum()) print(dau.shape) print(dau.info()) dau.head() 0 (509754, 3) <class 'pandas.core.frame.DataFrame'> Int64Index: 509754 entries, 0 to 2410 Data columns (total 3 columns): log_date 509754 non-null object app_name 509754 non-null object user_id 509754 non-null int64 dtypes: int64(1), object(2) memory usage: 15.6+ MB None log_dateapp_nameuser_id01234
2013-05-01game-01608801
2013-05-01game-01712453
2013-05-01game-01776853
2013-05-01game-01823486
2013-05-01game-01113600
print(dpu.isnull().sum().sum()) print(dpu.shape) print(dpu.info()) dpu.head() 0 (3532, 4) <class 'pandas.core.frame.DataFrame'> Int64Index: 3532 entries, 0 to 7 Data columns (total 4 columns): log_date 3532 non-null object app_name 3532 non-null object user_id 3532 non-null int64 payment 3532 non-null int64 dtypes: int64(2), object(2) memory usage: 138.0+ KB None log_dateapp_nameuser_idpayment01234
2013-05-01game-01804005571
2013-05-01game-0179353781
2013-05-01game-0131771781
2013-05-01game-0131771781
2013-05-01game-01426525324
# 寫入數據庫# action.to_sql('s8_action', engine, index=False) # dau.to_sql('s8_dau', engine, index=False) # dpu.to_sql('s8_dpu', engine, index=False)

三、數據預處理

1.合并 DAU DPU

df = pd.merge(dau, dpu[['log_date','user_id','payment']], how='left', on=['user_id','log_date']) df.head() log_dateapp_nameuser_idpayment01234
2013-05-01game-01608801NaN
2013-05-01game-01712453NaN
2013-05-01game-01776853NaN
2013-05-01game-01823486NaN
2013-05-01game-01113600NaN
# 將無消費記錄的消費額設為 0 print(df.payment.isnull().sum()) df['payment'].fillna(0, inplace=True) print(df.payment.isnull().sum()) 507151 0 # 添加消費額標志位 df['is_pay'] = df['payment'].apply( lambda x: 1 if x>0 else 0 ) df.head() log_dateapp_nameuser_idpaymentis_pay01234
2013-05-01game-016088010.00
2013-05-01game-017124530.00
2013-05-01game-017768530.00
2013-05-01game-018234860.00
2013-05-01game-011136000.00

2.按月統計

# 增加月份列 df['log_month'] = df['log_date'].apply(lambda x: x[0:7]) df.head() log_dateapp_nameuser_idpaymentis_paylog_month01234
2013-05-01game-016088010.002013-05
2013-05-01game-017124530.002013-05
2013-05-01game-017768530.002013-05
2013-05-01game-018234860.002013-05
2013-05-01game-011136000.002013-05

巧妙運用 groupby 以及 agg 函數,統計出用戶按月份的 消費情況 和 登陸次數

# 按月統計 affc = {'payment':'sum', 'log_date':'count'} dfm = df.groupby(['log_month', 'user_id']).agg(affc).reset_index() # 修改列明 renam = {'log_date':'access_days'} dfm.rename(columns=renam, inplace=True) dfm.head() log_monthuser_idpaymentaccess_days01234
2013-05650.01
2013-051150.01
2013-051940.01
2013-054260.04
2013-055390.01

4.使用 Kmeans 進行分類, 得到排名靠前的用戶,即 重度用戶/中度用戶/輕度用戶

A47 列即是排行榜得分, 從分布圖上看出,大部分用戶得分很低,符合冪律曲線

# action['A47'].hist(bins=50, figsize=(6,4)) <matplotlib.axes._subplots.AxesSubplot at 0x1c21d894240>

sns.distplot(action['A47'],bins=50,kde=True) <matplotlib.axes._subplots.AxesSubplot at 0x1c21af07a58>

對 A47 列進行聚類,分為3類

from sklearn.cluster import KMeansa47 = action['A47'].reshape(-1, 1)kms = KMeans(n_clusters=3).fit(a47) D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) insteadThis is separate from the ipykernel package so we can avoid doing imports until cluster = kms.labels_ kms.cluster_centers_ array([[ 9359.84787792],[ 69386.11297071],[185857.17948718]]) action['cluster'] = cluster action.head() log_dateapp_nameuser_idA1A2A3A4A5A6A7...A46A47A48A49A50A51A52A53A54cluster01234
2013-10-31game-016541330000000...03802565500000.0460
2013-10-31game-014255300000101233...201805433473622400.0712
2013-10-31game-017095960000000...04162481700000.020
2013-10-31game-015250470200900...22352006412210000.01090
2013-10-31game-017969080000000...293882544410000.0640

5 rows × 58 columns

action.groupby(['cluster'])['user_id'].count() cluster 0 2096 1 479 2 78 Name: user_id, dtype: int64

圖上顯示,通過聚類分解后用戶分為3個類, 0 表示輕度用戶,排行榜得分最少; 1 表示中度用戶,排行版得分居中; 2 表示重度用戶,排行版得分較高,而且用戶數量較少,符合實際情況。

snsdf = action[['user_id','A47','cluster']].sort_values(by='A47',ascending=False) snsdf['user'] = range(len(snsdf)) sns.scatterplot(x='user',y='A47',hue='cluster',data=snsdf, palette='rainbow', alpha=.2) <matplotlib.axes._subplots.AxesSubplot at 0x1c21b9bf898>

snsdf = action[['user_id','A47','cluster']].sort_values(by='A47',ascending=False) snsdf['user'] = range(len(snsdf))plt.figure(figsize=(8,5)) snsdf1 = snsdf.reset_index() snsdf1[snsdf1['cluster']==2]['A47'].plot(color='r',label='2:重度用戶') snsdf1[snsdf1['cluster']==1]['A47'].plot(color='g',label='1:中度用戶') snsdf1[snsdf1['cluster']==0]['A47'].plot(color='b',label='0:輕度用戶') plt.legend() plt.xlabel('用戶分布') plt.ylabel('排行榜得分') Text(0,0.5,'排行榜得分')

限定排名靠前的用戶,即得分較高的重度和中度用戶,以便接下來進行分析

acc = action[action['cluster']>=1] acc.head() log_dateapp_nameuser_idA1A2A3A4A5A6A7...A46A47A48A49A50A51A52A53A54cluster15789
2013-10-31game-014255300000101233...201805433473622400.0712
2013-10-31game-017761200000900...381422146843715000.03122
2013-10-31game-0127619700007058...15546024226150800.0951
2013-10-31game-012215720000100...2439891579240000.0211
2013-10-31game-016924330000600...28507064549168000.01541

5 rows × 58 columns

5.主成分分析

獲取關鍵的參數

paction = acc.iloc[:,3:(len(acc.columns)-1)] paction.index=acc.user_id paction.head() A1A2A3A4A5A6A7A8A9A10...A45A46A47A48A49A50A51A52A53A54user_id425530776120276197221572692433
000010123358.25288230...19201805433473622400.071
00009000.00325195...19381422146843715000.0312
000070587.25150100...1515546024226150800.095
00001000.004014...242439891579240000.021
00006000.0010295...1528507064549168000.0154

5 rows × 54 columns

1.刪掉 0 值比較多的列

cc = paction[paction==0].count(axis=0)/len(paction) print(cc.head()) cc.plot() A1 1.000000 A2 0.926391 A3 1.000000 A4 0.994614 A5 0.055655 dtype: float64<matplotlib.axes._subplots.AxesSubplot at 0x1c21bbb1470>

# cc[cc>.8] dd = cc[cc<.95] paction = paction[dd.index] paction.head() A2A5A6A7A8A9A10A11A12A13...A45A46A47A48A49A50A51A52A53A54user_id425530776120276197221572692433
010123358.2528823019219...19201805433473622400.071
09000.0032519538819...19381422146843715000.0312
070587.2515010015311...1515546024226150800.095
01000.004014003...242439891579240000.021
06000.0010295002...1528507064549168000.0154

5 rows × 32 columns

2.刪掉相關性較強的列

corp = paction.corr() plt.figure(figsize=(15,8)) sns.heatmap(corp) <matplotlib.axes._subplots.AxesSubplot at 0x1c21bc094a8>

畫下三角heatmap,使用到的函數

mask = np.array(corp) mask[np.tril_indices_from(mask)] = False fig,ax = plt.subplots() fig.set_size_inches(15,8) sns.heatmap(corp,mask=mask) <matplotlib.axes._subplots.AxesSubplot at 0x1c21bc09400>

獲取矩陣的下三角,如果要獲取上三角的話, np.tril(m, 1)

coll = corp.columns corp = pd.DataFrame(np.tril(corp, -1)) corp.columns = coll corp.head() A2A5A6A7A8A9A10A11A12A13...A45A46A47A48A49A50A51A52A53A5401234
0.0000000.0000000.0000000.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
0.0697440.0000000.0000000.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
0.0761850.1788330.0000000.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
0.1587350.2193950.3713600.0000000.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
0.1672000.1861240.2420250.8031610.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0

5 rows × 32 columns

pac2 = paction.loc[:,(corp.abs()<.7).all()] # 任何一個數都小于0.7 的數據 pac2.head() A2A11A12A13A20A23A24A43A44A46A48A49A50A51A53A54user_id425530776120276197221572692433
019219000.5230.9217420347362240.071
038819000.0200.9025638684371500.0312
015311000.0100.9200015422615080.095
0003000.020.857142457924000.021
0002000.0110.7368428454916800.0154

進行主成分分析

from sklearn.decomposition import PCA pca = PCA() pca.fit(pac2) PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False) redio = pca.explained_variance_ratio_ print(redio) print(pca.singular_values_) [9.97843804e-01 1.92024564e-03 1.20120771e-04 5.57014208e-052.67905481e-05 1.54533752e-05 9.31262940e-06 4.38846214e-063.02317261e-06 8.36725295e-07 1.31874979e-07 9.78197162e-083.86464536e-08 2.94647596e-08 1.82272465e-08 7.54580333e-09] [3.96183910e+04 1.73797668e+03 4.34684952e+02 2.96004755e+022.05284590e+02 1.55911168e+02 1.21032418e+02 8.30848288e+016.89599635e+01 3.62791414e+01 1.44027941e+01 1.24044853e+017.79687146e+00 6.80796010e+00 5.35458829e+00 3.44523057e+00] recu = redio.cumsum() print(recu) x = np.arange(len(recu)) plt.plot(recu, color='r') [0.9978438 0.99976405 0.99988417 0.99993987 0.99996666 0.999982120.99999143 0.99999582 0.99999884 0.99999968 0.99999981 0.999999910.99999994 0.99999997 0.99999999 1. ][<matplotlib.lines.Line2D at 0x1c21dadada0>]

得到降維后的數據

pca.set_params(n_components=10) pac3 = pd.DataFrame(pca.fit_transform(pac2)) pacsse = pac3.copy() pac3.head() 012345678901234
2706.266005-100.824346-1.874787-1.57753612.481591-2.3943209.7708787.8075350.021273-2.169596
2373.811140147.314930-16.386795-8.42865510.019577-3.0047256.0097710.961469-1.5985312.144615
-1171.733361-5.4930810.7449950.542033-0.785251-5.756412-1.012336-1.7780677.2568840.343277
-2738.903900-50.4684872.3284912.965415-5.79434711.8912892.965366-1.1824130.0656191.245358
-1493.64261858.686385-10.80761211.7779737.6646929.3129684.3764291.994214-1.5680500.426246

6.KMeans 進行聚類

from sklearn.cluster import KMeanskm = KMeans(n_clusters=5) km.fit(pac3) KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,n_clusters=5, n_init=10, n_jobs=1, precompute_distances='auto',random_state=None, tol=0.0001, verbose=0) clu = km.labels_ pac3['clu'] = clu pac3.head() 0123456789clu01234
2706.266005-100.824346-1.874787-1.57753612.481591-2.3943209.7708787.8075350.021273-2.1695960
2373.811140147.314930-16.386795-8.42865510.019577-3.0047256.0097710.961469-1.5985312.1446150
-1171.733361-5.4930810.7449950.542033-0.785251-5.756412-1.012336-1.7780677.2568840.3432771
-2738.903900-50.4684872.3284912.965415-5.79434711.8912892.965366-1.1824130.0656191.2453584
-1493.64261858.686385-10.80761211.7779737.6646929.3129684.3764291.994214-1.5680500.4262461
pac3.groupby('clu')[2].count() clu 0 90 1 113 2 122 3 109 4 123 Name: 2, dtype: int64

#### palette 的顏色風格:
Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Vega10, Vega10_r, Vega20, Vega20_r, Vega20b, Vega20b_r, Vega20c, Vega20c_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spectral, spectral_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, viridis, viridis_r, vlag, vlag_r, winter, winter_r

plt.figure(figsize=(13,7)) sns.scatterplot(x=0, y=1, data=pac3,style='clu',hue='clu', palette='autumn') <matplotlib.axes._subplots.AxesSubplot at 0x1c21db35438>

將分類后的類別添加至原數據中

pac4 = pac2.copy() pac4['cluster'] = list(pac3.clu) pac4.head() A2A11A12A13A20A23A24A43A44A46A48A49A50A51A53A54clusteruser_id425530776120276197221572692433
019219000.5230.9217420347362240.0710
038819000.0200.9025638684371500.03120
015311000.0100.9200015422615080.0951
0003000.020.857142457924000.0214
0002000.0110.7368428454916800.01541
# 計算每個類的平均值 clu5 = pac4.groupby('cluster').mean() # 刪除相關性較高的列 clu5.drop(columns='A53',inplace=True) c5cor = clu5.corr() plt.figure(figsize=(15,8)) sns.heatmap(c5cor,annot=True) <matplotlib.axes._subplots.AxesSubplot at 0x1c21d92a780>

ccrp = pd.DataFrame(np.tril(c5cor,-1)) ccrp.columns = clu5.columns cccc = clu5.loc[:,(ccrp.abs()<.95).all()] cccc A2A20A23A24A44A46A50A51A54cluster01234
0.0222220.3222220.6555560.1676910.85819327.60000010.6666672.011111166.711111
0.0796460.2743360.3628320.0952310.84402720.1592923.0088501.469027102.106195
0.0737700.3770490.3360660.0706280.84934324.7377054.2868851.844262121.909836
0.0183490.2293580.2844040.0982520.84598124.1192665.2660551.733945146.871560
0.2032520.2926830.2439020.0636860.77507618.9837402.1300810.97561084.032520
from sklearn.preprocessing import scaleccccc = pd.DataFrame(scale(cccc)) ccccc.columns = cccc.columns ccccc A2A20A23A24A44A46A50A51A5401234
-0.8555900.4688591.9184001.8620200.7858821.4229701.8677731.1184571.424282
0.002962-0.503392-0.094337-0.1049610.315530-0.940402-0.688647-0.381093-0.746672
-0.0848841.582038-0.278379-0.7728260.4920380.513827-0.2619980.656909-0.081200
-0.913505-1.416613-0.633601-0.0229440.3803870.3173940.0648790.3517420.757602
1.851016-0.130892-0.912083-0.961289-1.973837-1.313789-0.982007-1.746015-1.354012
plt.figure(figsize=(8,8)) # 極坐標的分割分數 N = ccccc.shape[1] # 設置雷達圖的角度,用于平分切開一個圓面 angles = np.linspace(0, 2*np.pi, N, endpoint=False) # 使雷達圖一圈封閉起來 angles = np.concatenate((angles,[angles[0]])) for i in range(len(ccccc)):# 構造數據values = ccccc.loc[i,:]# 為了使雷達圖一圈封閉起來values = np.concatenate((values,[values[0]]))# 繪制plt.polar(angles, values, 'o-', linewidth=2) plt.legend(ccccc.index, loc='lower right') # 添加極坐標的標簽 plt.thetagrids(angles * 180/np.pi, labels=list(ccccc.columns)) plt.title('重要指標雷達圖呈現') Text(0.5,1.05,'重要指標雷達圖呈現')

不進行預處理的降維

dfp = acc.iloc[:,3:(len(acc.columns)-1)] dfp.index=acc.user_id dfp.head() A1A2A3A4A5A6A7A8A9A10...A45A46A47A48A49A50A51A52A53A54user_id425530776120276197221572692433
000010123358.25288230...19201805433473622400.071
00009000.00325195...19381422146843715000.0312
000070587.25150100...1515546024226150800.095
00001000.004014...242439891579240000.021
00006000.0010295...1528507064549168000.0154

5 rows × 54 columns

from sklearn.decomposition import PCApca = PCA(whiten=False) pca.fit(dfp) PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,svd_solver='auto', tol=0.0, whiten=False) retio = pca.explained_variance_ratio_ # print(retio) # print(pca.singular_values_) rec = retio.cumsum() print(rec) x = np.arange(len(rec)) plt.plot(rec, color='r') [0.9996008 0.99995245 0.99997489 0.99999016 0.9999933 0.999995640.99999759 0.99999838 0.99999897 0.9999995 0.99999962 0.999999720.99999979 0.99999986 0.9999999 0.99999993 0.99999996 0.999999970.99999997 0.99999998 0.99999998 0.99999999 0.99999999 0.999999990.99999999 1. 1. 1. 1. 1.1. 1. 1. 1. 1. 1.1. 1. 1. 1. 1. 1.1. 1. 1. 1. 1. 1.1. 1. 1. 1. 1. 1. ][<matplotlib.lines.Line2D at 0x1c21f406780>]

pca.set_params(n_components=10) pacsse = pd.DataFrame(pca.fit_transform(dfp)) pacsse.head() 012345678901234
94938.293061-342.891655-161.442878-199.6162101.83069273.107938153.124982124.440657-34.37161246.548951
56613.313155-960.580156-38.560364-45.83657113.67016690.767620-145.846645-40.25513410.50820316.287863
-31060.195159388.005529-6.932692-0.948812-5.33272818.23729311.39346714.689011-7.99490932.398532
-45806.2524431579.357883-81.812845-96.488345-18.477649-90.05921731.377291-22.865193-19.72483716.293640
-34963.135693611.858506-18.187490-16.454233-5.597209-9.722257-63.112236-3.9432667.222725-10.889839

手肘法獲取最優 K 值

from sklearn.cluster import KMeansdf_features = pacsse # 讀入數據 # '利用SSE選擇k' SSE = [] # 存放每次結果的誤差平方和 for k in range(1,9):estimator = KMeans(n_clusters=k) # 構造聚類器estimator.fit(df_features)SSE.append(estimator.inertia_) X = range(1,9) plt.xlabel('k') plt.ylabel('SSE') plt.plot(X,SSE,'o-') [<matplotlib.lines.Line2D at 0x1c2211cac50>]

顯然,先標準化數據是不合適的

# 顯然,先標準化數據是不合適的df_features = pd.DataFrame(scale(pacsse)) SSE = [] for k in range(1,9):estimator = KMeans(n_clusters=k) estimator.fit(df_features)SSE.append(estimator.inertia_) X = range(1,9) plt.xlabel('k') plt.ylabel('SSE') plt.plot(X,SSE,'o-') [<matplotlib.lines.Line2D at 0x1c2213bc438>]

km = KMeans(n_clusters=4) km.fit(pacsse) clu = km.labels_ pacsse['clu'] = clu pacsse.head() 0123456789clu01234
94938.293061-342.891655-161.442878-199.6162101.83069273.107938153.124982124.440657-34.37161246.5489512
56613.313155-960.580156-38.560364-45.83657113.67016690.767620-145.846645-40.25513410.50820316.2878630
-31060.195159388.005529-6.932692-0.948812-5.33272818.23729311.39346714.689011-7.99490932.3985321
-45806.2524431579.357883-81.812845-96.488345-18.477649-90.05921731.377291-22.865193-19.72483716.2936401
-34963.135693611.858506-18.187490-16.454233-5.597209-9.722257-63.112236-3.9432667.222725-10.8898391
pacsse.groupby('clu')[2].count() clu 0 153 1 344 2 54 3 6 Name: 2, dtype: int64 plt.figure(figsize=(13,7)) sns.scatterplot(x=0, y=1, data=pacsse,style='clu',hue='clu', palette='autumn') <matplotlib.axes._subplots.AxesSubplot at 0x1c22118b668>

顯然,不進行預處理的數據聚類是有問題的, 第一主成分和第二主成分 顯然是相關的

pac4 = pac2.copy() pac4['cluster'] = list(pacsse.clu) pac4.head()clu5 = pac4.groupby('cluster').mean() clu5.drop(columns='A53',inplace=True) c5cor = clu5.corr() plt.figure(figsize=(15,8)) sns.heatmap(c5cor,annot=True) <matplotlib.axes._subplots.AxesSubplot at 0x1c22145a4e0>

ccrp = pd.DataFrame(np.tril(c5cor,-1)) ccrp.columns = clu5.columns cccc = clu5.loc[:,(ccrp.abs()<.95).all()] cccc A12A20A51A54cluster0123
3.3986930.2287581.810458146.287582
1.9389530.3168601.433140101.531977
4.5925930.4074071.870370169.777778
2.1666670.1666671.666667213.833333
from sklearn.preprocessing import scaleccccc = pd.DataFrame(scale(cccc))ccccc.columns = cccc.columns ccccc A12A20A51A540123
0.352533-0.5627840.684599-0.285229
-1.0217050.406288-1.555764-1.388557
1.4765021.4022491.0403380.293858
-0.807330-1.245753-0.1691731.379928
plt.figure(figsize=(8,8)) # 極坐標的分割分數 N = ccccc.shape[1] # 設置雷達圖的角度,用于平分切開一個圓面 angles = np.linspace(0, 2*np.pi, N, endpoint=False) # 使雷達圖一圈封閉起來 angles = np.concatenate((angles,[angles[0]])) for i in range(len(ccccc)):# 構造數據values = ccccc.loc[i,:]# 為了使雷達圖一圈封閉起來values = np.concatenate((values,[values[0]]))# 繪制plt.polar(angles, values, 'o-', linewidth=2) plt.legend(ccccc.index, loc='lower right') # 添加極坐標的標簽 plt.thetagrids(angles * 180/np.pi, labels=list(ccccc.columns)) plt.title('重要指標雷達圖呈現') Text(0.5,1.05,'重要指標雷達圖呈現')

轉載于:https://www.cnblogs.com/cvlas/p/9537532.html

總結

以上是生活随笔為你收集整理的section8的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。