库存监控 - 异常检测算法
目錄
背景
第一部分 - 數(shù)據(jù)整理
1. 數(shù)據(jù)錄入
2. 清洗數(shù)據(jù)
3. 數(shù)據(jù)調(diào)整、聚合
第二部分 - 探索庫存數(shù)據(jù)(EDA)
1. 了解庫存整體情況
2. 可視化
第三部分 - 建立模型
1.?數(shù)據(jù)預(yù)處理
2. 建立模型
背景
????????對(duì)于零售行業(yè)來講,無論是跨境電商還是傳統(tǒng)零售商,庫存的高低決定企業(yè)流動(dòng)資產(chǎn)的高低,庫存的流動(dòng)速度決定企業(yè)現(xiàn)金流動(dòng)是否健康。不同的企業(yè)會(huì)有不同的衡量標(biāo)準(zhǔn),也會(huì)在不同的時(shí)間段處于不同的狀態(tài)。財(cái)務(wù)上有很多指標(biāo)可以在整個(gè)公司的維度監(jiān)控庫存水平高低,但如果細(xì)致到SKU這個(gè)層面的差異化分析,則會(huì)衍生許多標(biāo)準(zhǔn)問題以及工作量。本文旨在探究使用非監(jiān)督學(xué)習(xí)的方法,從數(shù)據(jù)的整理到最終的可視化,如何實(shí)現(xiàn)零售行業(yè)的庫存監(jiān)控問題。
第一部分 - 數(shù)據(jù)整理
1. 數(shù)據(jù)錄入
import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import gridspec from math import ceil# 讀取數(shù)據(jù) df = pd.read_csv('inventory.csv')# 由于數(shù)據(jù)有重復(fù),刪除重復(fù)的數(shù)據(jù)(數(shù)據(jù)中包含產(chǎn)品分倉庫的明細(xì)數(shù)據(jù)以及倉庫欄為‘All’的加總數(shù)據(jù)) df = df[df['Warehouse'] != 'All']# 顯示頭5行數(shù)據(jù) df.head() Part Number Item Number Warehouse Total On Hand \ 0 BAILLIE WALNUT BF481825.39901727 KY - Hebron 0 1 BAILLIE BEECH BF481825.39901731 KY - Hebron 0 2 NIELSEN BROWN PU BF763074.46849385 KY - Hebron 0 3 Scargill Mint BF767982.46854251 KY - Hebron 0 4 Scargill Blue BF767982.46854254 KY - Hebron 34 Total Available Total Unavailable Total On Transfer Qty Allocated \ 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 1 33 0 Total On Order Qty Received(12 Months) Qty Shipped (30 Days) \ 0 0 201 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 170 0 Qty Reserved Qty Unpickable Qty On Hold Qty Unprocessed Cycle Count \ 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 1 0 Qty Unprocessed Adjustment 0 0 1 0 2 0 3 0 4 02. 清洗數(shù)據(jù)
# 檢查數(shù)據(jù),清洗 tab_info = pd.DataFrame(df.dtypes).T.rename(index={0:'column Type'}) tab_info = tab_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'})) tab_info = tab_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0: 'null values (%)'})) tab_info Part Number Item Number Warehouse Total On Hand \ column Type object object object int64 null values (nb) 3 0 0 0 null values (%) 0.0883132 0 0 0 Total Available Total Unavailable Total On Transfer \ column Type int64 int64 int64 null values (nb) 0 0 0 null values (%) 0 0 0 Qty Allocated Total On Order Qty Received(12 Months) \ column Type int64 int64 int64 null values (nb) 0 0 0 null values (%) 0 0 0 Qty Shipped (30 Days) Qty Reserved Qty Unpickable \ column Type int64 int64 int64 null values (nb) 0 0 0 null values (%) 0 0 0 Qty On Hold Qty Unprocessed Cycle Count \ column Type int64 int64 null values (nb) 0 0 null values (%) 0 0 Qty Unprocessed Adjustment column Type int64 null values (nb) 0 null values (%) 0會(huì)用到的列是:
- Total On Hand: 產(chǎn)品庫存數(shù)量
- Qty Shipped (30 Days): 過去30天的銷售數(shù)量總和
調(diào)查數(shù)據(jù)的Null缺失值情況良好,可以直接使用。
3. 數(shù)據(jù)調(diào)整、聚合
# 將數(shù)據(jù)聚合到item number這個(gè)層面 df_alinv = df.groupby(['Item Number'])['Total On Hand','Qty Shipped (30 Days)'].sum().reset_index() df_alinv = df_alinv.sort_values(by = 'Total On Hand', ascending = False).reset_index(drop=True)# 剔除庫存為0的產(chǎn)品 df_alinv = df_alinv[df_alinv['Total On Hand'] > 0]print(df_alinv.head()) Item Number Total On Hand Qty Shipped (30 Days) 0 TNFI1501.41585776 10737 340 1 TNFI1499.41585774 7723 161 2 TNFI1081.33297150 3171 620 3 TNFI1263.38188681 3135 1 4 TNFI1087.33298045 2530 101第二部分 - 探索庫存數(shù)據(jù)(EDA)
1. 了解庫存整體情況
# 產(chǎn)品種類、對(duì)應(yīng)庫存數(shù)量以及分布 no_sku = len(df['Item Number'].unique()) qty_all_inve = df['Total On Hand'].sum() qtySold30days = df_alinv["Qty Shipped (30 Days)"].sum() daysOfGoodSold = ceil(qty_all_inve / qtySold30days * 30)print(f'產(chǎn)品種類數(shù)目(包含庫存為0的產(chǎn)品):{no_sku}種') print(f'有庫存的產(chǎn)品種類數(shù)目:{len(df_alinv)}種') print(f'庫存總量:{qty_all_inve}件') print(f'過去30天銷量:{qtySold30days}件') print(f'目前庫存可以繼續(xù)賣:{daysOfGoodSold}天') 產(chǎn)品種類數(shù)目(包含庫存為0的產(chǎn)品):833種 有庫存的產(chǎn)品種類數(shù)目:463種 庫存總量:116142件 過去30天銷量:18665件 目前庫存可以繼續(xù)賣:187天2. 可視化
# 庫存量 TOP N SKU 占比 def get_pct(top_n):return df_alinv.iloc[:top_n]['Total On Hand'].sum() / qty_all_inveacc_pct = [i for i in map(get_pct, [i for i in range(len(df_alinv))])]# 增加變量 - 周轉(zhuǎn)天數(shù) df_alinv['turnover days'] = df_alinv['Total On Hand'] / df_alinv['Qty Shipped (30 Days)'] * 30 df_alinv['turnover days (max=365)'] = df_alinv.apply(lambda x: x['turnover days'] if x['turnover days'] < 365 else 365, axis = 1)# 庫存分布圖 fig = plt.figure(figsize=(10, 8)) gs = gridspec.GridSpec(4, 2, width_ratios=[10, 1])ax0 = plt.subplot(gs[0]) ax0.set_title('SKU Total On Hand distribution') ax0.bar(df_alinv.index, df_alinv['Total On Hand'])ax1 = plt.subplot(gs[1]) ax1.set_title('boxplot') ax1.boxplot(df_alinv['Total On Hand'])ax2 = plt.subplot(gs[2]) ax2.set_title('SKU TOH acc% evolution') ax2.plot([i for i in range(len(df_alinv))], acc_pct)ax3 = plt.subplot(gs[4]) ax3.set_title('SKU Qty Sold in 30 days') ax3.bar(df_alinv.index, df_alinv['Qty Shipped (30 Days)'])ax4 = plt.subplot(gs[5]) ax4.set_title('boxplot') ax4.boxplot(df_alinv['Qty Shipped (30 Days)'])ax5 = plt.subplot(gs[6]) ax5.set_title('SKU Turnover Days (max=365)') ax5.set_xlabel('SKU ranked by Total On Hand') ax5.bar(df_alinv.index, df_alinv['turnover days (max=365)'])ax6 = plt.subplot(gs[7]) ax6.set_title('boxplot') ax6.boxplot(df_alinv['turnover days (max=365)'])plt.tight_layout() plt.show()# top 5 sku print(df_alinv.sort_values(by='Total On Hand', ascending = False).head()) Item Number Total On Hand Qty Shipped (30 Days) turnover days \ 0 TNFI1501.41585776 10737 340 947.382353 1 TNFI1499.41585774 7723 161 1439.068323 2 TNFI1081.33297150 3171 620 153.435484 3 TNFI1263.38188681 3135 1 94050.000000 4 TNFI1087.33298045 2530 101 751.485149 turnover days (max=365) 0 365.000000 1 365.000000 2 153.435484 3 365.000000 4 365.000000從上圖可得:
第三部分 - 建立模型
????????對(duì)于異常值的檢測(cè),常見的可以通過統(tǒng)計(jì)學(xué)方法分離出異常值。或者使用機(jī)器學(xué)習(xí)算法對(duì)一維或者多維數(shù)據(jù)進(jìn)行處理。比較常用的異常檢測(cè)算法有孤立森林(Isolation Forest)、DBScan、OnClassSVM等。對(duì)于異常檢測(cè),選擇合適維度非常重要。針對(duì)不同的業(yè)務(wù)場(chǎng)景,對(duì)選擇的維度會(huì)有不同的考慮。對(duì)于庫存來說,我們關(guān)心的是哪些產(chǎn)品庫存數(shù)量多而且還賣不出去,因此庫存數(shù)量(或者庫存價(jià)值)和周轉(zhuǎn)天數(shù)可以作為初始的兩個(gè)維度進(jìn)行學(xué)習(xí)。
1.?數(shù)據(jù)預(yù)處理
from sklearn.ensemble import IsolationForest from sklearn.cluster import KMeans from sklearn import preprocessing# 分割數(shù)據(jù)成兩部分 ''' 一是 “Qty Shipped (30 Days)” 為0,二是 “Qty Shipped (30 Days)” 不為0, 分別代表庫存不為零的情況下,那些過去30天沒有銷量以及有銷量的兩部分sku。 '''data_q0 = df_alinv[df_alinv['Qty Shipped (30 Days)'] == 0] data_q = df_alinv[df_alinv['Qty Shipped (30 Days)'] != 0]print(f'這部分的異常庫存總量為:{data_q0["Total On Hand"].sum()}') print(f'其中庫存數(shù)量大于50的sku有:{len(data_q0[data_q0["Total On Hand"] > 50])}個(gè)')''' 這部分的異常庫存總量為:3825 其中庫存數(shù)量大于50的sku有:21個(gè) '''- 當(dāng)庫存不為零,而且過去30天沒有銷量的sku,基本可以直接定義為異常庫存。
- 當(dāng)庫存不為零而且過去30天有銷量時(shí),進(jìn)一步檢查庫存異常值。
描繪直方圖,查看數(shù)據(jù)分布:
# 描繪直方圖,查看數(shù)據(jù)分布 fig, ax = plt.subplots(1,2,figsize=(10,3)) ax[0].hist(data_q['Total On Hand']) ax[0].set_title('Total On Hand Hist') ax[0].set_xlabel('Total On Hand') ax[0].set_ylabel('frequancy') ax[1].hist(data_q['turnover days']) ax[1].set_title('Turnover Days Hist') ax[1].set_xlabel('Turnover Days') plt.tight_layout()2. 建立模型
# 提取要訓(xùn)練的因素 X = data_q[['Total On Hand', 'turnover days']]# 構(gòu)建新的因素 ''' 隨著維度升高,"Total On Hand"的權(quán)重下降,"turnover days"的權(quán)重上升。 ''' X['1f'] = data_q['Total On Hand'] * data_q['turnover days'] X['2f'] = data_q['Total On Hand'] * data_q['turnover days'] * data_q['turnover days'] X['3f'] = data_q['Total On Hand'] * data_q['turnover days'] * data_q['turnover days'] * data_q['turnover days']Xlist = [] Xlist.append(X[['turnover days', '1f']]) Xlist.append(X[['turnover days', '2f']]) Xlist.append(X[['turnover days', '3f']])# # 標(biāo)準(zhǔn)化處理(由于數(shù)據(jù)分布不符合正態(tài)分布,因此使用min max 標(biāo)準(zhǔn)化) # scaler = preprocessing.MinMaxScaler() # X_minmax = scaler.fit_transform(X)def if_train(Xlist, contList):result = dict()for i, X in enumerate(Xlist):result[f'{i+1}f'] = dict()for c in contList:model = IsolationForest(contamination=c)model.fit(X)label = model.predict(X)result[f'{i+1}f'][c] = X.copy()result[f'{i+1}f'][c]['label'] = label.copy()return resultresult = if_train(Xlist, contList)dList = ['1f', '2f', '3f'] pList = [0.1, 0.2, 0.3, 0.4]fig, ax = plt.subplots(4,3, figsize=(14,15))for i, p in enumerate(pList):for j, d in enumerate(dList):ax[i, j].scatter(data_q['Total On Hand'], data_q['turnover days'], c=result[d][p]['label'], cmap = plt.cm.Spectral)ax[i, j].set_xlim(0,4000)ax[i, j].set_ylim(0,2000)if i == 0:ax[i, j].set_title(d)if j == 0:ax[i, j].set_ylabel(p)t += 1 plt.show()在三種新構(gòu)建的維度中(1f, 2f, 3f),可以看到在庫存為0附近的sku,隨著p的增大,異常值的庫存周轉(zhuǎn)天數(shù)從2000下降到300左右。結(jié)合實(shí)際運(yùn)營來講,庫存周轉(zhuǎn)天數(shù)大于300天的sku基本可以斷定為滯銷庫存。因此,該數(shù)據(jù)集當(dāng)中顯示出來的滯銷庫存占比情況高達(dá)40%以上。從異常值的分布情況來看,2f 顯示出來的對(duì)于高庫存sku的周轉(zhuǎn)天數(shù)容忍度是比較合理的。
總結(jié)
以上是生活随笔為你收集整理的库存监控 - 异常检测算法的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python已知横版求竖版_python
- 下一篇: 云贝餐饮外卖020v2版v1.7.3