群体稳定度指标PSI
群體穩(wěn)定性指標(biāo)PSI(Population Stability Index)是衡量模型的預(yù)測(cè)值與實(shí)際值偏差大小的指標(biāo)。
PSI = sum((實(shí)際占比-預(yù)期占比)* ln(實(shí)際占比/預(yù)期占比))
舉例:
比如訓(xùn)練一個(gè)logistic回歸模型,預(yù)測(cè)時(shí)候會(huì)有個(gè)概率輸出p。
測(cè)試集上的輸出設(shè)定為p1吧,將它從小到大排序后10等分,如0-0.1,0.1-0.2,…。
現(xiàn)在用這個(gè)模型去對(duì)新的樣本進(jìn)行預(yù)測(cè),預(yù)測(cè)結(jié)果叫p2,按p1的區(qū)間也劃分為10等分。
實(shí)際占比就是p2上在各區(qū)間的用戶占比,預(yù)期占比就是p1上各區(qū)間的用戶占比。
意義就是如果模型跟穩(wěn)定,那么p1和p2上各區(qū)間的用戶應(yīng)該是相近的,占比不會(huì)變動(dòng)很大,也就是預(yù)測(cè)出來的概率不會(huì)差距很大。
一般認(rèn)為PSI小于0.1時(shí)候模型穩(wěn)定性很高,0.1-0.25一般,大于0.25模型穩(wěn)定性差,建議重做。
PS:除了按概率值大小等距十等分外,還可以對(duì)概率排序后按數(shù)量十等分,兩種方法計(jì)算得到的psi可能有所區(qū)別但數(shù)值相差不大。
以上轉(zhuǎn)自:https://blog.csdn.net/Rango_lhl/article/details/81388051
以下用自己最近做的一個(gè)project的代碼做個(gè)例子:
Load libraries and data
import pandas as pd import numpy as np import math import re # sample data df = pd.read_csv('dev.csv')# holdout data(without target variable) dfo = pd.read_csv('oot0.csv')Define PSI function
def psi(bench, comp, group): """ bench: sample[variable] comp: holdout[variable] group: how many groups with in the variablesuggestion: group=max(2,min((len(set(df[var_name]))),10))at least 2,at max 10,so if continuous variable, it will be maximum at 10and if categorical variable with less than 10 cats, it will be number of categories """# get the number of rows of the sample and holdoutben_len=len(bench);comp_len=len(comp);# sort the valuesbench.sort();comp.sort();psi_cut=[];# calculate sample_size / number_groupsn=int(math.floor(ben_len/group));# from 1 to number_groupsfor i in range(1,group):# as bench has been sorted, we can get the spot of cutting edge values# for example i=1 ? bench[1] will be the first lowercut# and bench[1*n] or bench[-1] will be the first uppercut# count how many values in each benchlowercut=bench[(i-1)*n+1];# when i < groupif i!=group:uppercut=bench[(i*n)];ben_cnt=n;# when i==group else:uppercut=bench[-1];ben_cnt=ben_len-group*(n-1)# count for values in corresponding intervals in holdout datasetcomp_cnt = len([i for i in comp if i > lowercut and i<=uppercut]);# calculate percentage of counts_in_interval / total_valuesben_pct=(ben_cnt+0.0)/ben_len;comp_pct=(comp_cnt+0.0)/comp_len;# if in the corresponding interval, there is value, calculate the psi for this intervalif comp_pct > 0.0:psi_cut.append((ben_pct-comp_pct)*math.log(ben_pct/(comp_pct)));else:psi_cut.append(0);# sum up all the psi, and check the totalpsi=sum(psi_cut);return psi;Run the psi function to every input variable
list_inputs = list() # get all the input variables for var_name in df.columns:if re.search('^i',var_name):list_inputs.append(var_name) # iter the function through them for var_name in list_inputs:psi_value=psi(bench=list(df[var_name]),comp=list(dfo[var_name]),group=max(2,min((len(set(df[var_name]))),10)));print ("psi for ", var_name, " = ", psi_value)總結(jié)
以上是生活随笔為你收集整理的群体稳定度指标PSI的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: java short 转char_byt
- 下一篇: 第一类对象