當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

模型监控psi_PSI和CSI：前2个模型监控指标

發布時間：2023/12/15 编程问答 50 豆豆

生活随笔收集整理的這篇文章主要介紹了模型监控psi_PSI和CSI：前2个模型监控指标小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

模型監控psi

建模基礎 (Modeling Fundamentals)

Once a model has been put into PROD (production), regular monitoring is required to make sure that the model is still relevant and reliable. I have written a post on model validation vs model monitoring and the importance of these 2 stages, you can check it out as a prequel to this post.

將模型放入PROD(生產)后，需要定期監視以確保模型仍然相關且可靠。我已經寫了一篇關于模型驗證與模型監視以及這兩個階段的重要性的文章，您可以將其作為這篇文章的前傳進行檢查。

Moving on to the subject matter of this post, we will learn all about the PSI and CSI i.e. Population Stability Index and Characteristic Stability Index which are one of the most important monitoring strategies used in a lot of domains especially credit risk domain.

繼續本文的主題，我們將學習有關PSI和CSI的所有信息，即人口穩定性指數和特征穩定性指數，它們是在許多領域(尤其是信用風險領域)中使用的最重要的監視策略之一。

PSI and CSI, both of these metrics focus on the shift in the POPULATION DISTRIBUTION.

PSI和CSI，這兩個指標都關注人口分布的變化。

These two monitoring metrics are based on the premise that a predictive model works best when the development and the validation/OOT (out of time) samples are not significantly different in terms of economic conditions, underlying assumptions, the campaign-style, and focus, etc.

這兩個監控指標的前提是，當開發樣本和驗證樣本/ OOT(過期)樣本在經濟條件，基本假設，活動方式和重點方面沒有顯著差異時，預測模型會發揮最佳作用，等等

For example: if we built a predictive model to predict attrition rate for credit card customers in normal economic circumstances and then if we proceed onto testing it against a sample from recession-hit times, the model might not be able to predict accurately as the population distribution in different income segments might have changed significantly which might drive the actual attrition rate really high but the model might not be able to capture it, thus, predicting erroneous results. But, as we understand this now and if we proceed onto checking the population distribution shifts between the DEV time and the current time, we can get a fair idea if the model results are reliable or not. And this is accomplished by PSI and CSI as important monitoring metrics.

例如：如果我們建立了一個預測模型來預測正常經濟情況下信用卡客戶的流失率，然后再根據經濟衰退時期的樣本對它進行測試，則該模型可能無法準確預測人口數量不同收入階層的收入分配可能發生了重大變化，這可能會使實際的流失率確實很高，但模型可能無法捕獲流失率，因此預測了錯誤的結果。但是，正如我們現在所了解的那樣，如果繼續檢查DEV時間與當前時間之間的人口分布變化，則模型結果是否可靠，我們可以很清楚地知道。這是通過PSI和CSI作為重要的監視指標來完成的。

人口穩定指數(PSI) (Population stability index (PSI))

It is a metric to measure how much a variable has shifted in distribution between two samples over time. It is widely used for monitoring changes in the characteristics of a population and for diagnosing possible problems in model performance — many a times, it’s a good indication if the model has stopped predicting accurately due to significant changes in the population distribution.

它是衡量變量隨時間推移在兩個樣本之間的分布變化了多少的度量。它廣泛用于監視總體特征的變化并診斷模型性能中可能存在的問題-很多時候，這很好地表明了模型是否由于總體分布的重大變化而停止了準確的預測。

The above definition has been so aptly explained in this research paper. I have also given a reference link at the end of this post.

以上定義已在本研究論文中得到了恰當的解釋。在這篇文章的結尾，我也給出了一個參考鏈接。

Originally, Population Stability Index (PSI) was developed to monitor changes in the distribution between an OOT and the development time period samples in the credit risk scorecard analytics. The use of PSI has become somewhat flexible in nature now to examine the distributional shifts for the model-related attributes as well as in the overall sample populations, including both dependent and independent variables- CSI, we’ll cover this in the next section.

最初，開發人口穩定指數(PSI)來監視信用風險記分卡分析中OOT和開發時間段樣本之間的分布變化。現在，PSI的使用在本質上已經變得有些靈活，可以檢查與模型相關的屬性以及整個樣本總體(包括因變量和自變量CSI)的分布偏移，我們將在下一部分中介紹。

PSI tends to overall population shift whereas CSI tends to focus on the individual model variables used.

PSI傾向于總體人口轉移，而CSI則傾向于關注所使用的各個模型變量。

Link鏈接

A change in the population distribution can be due to:

人口分布的變化可能是由于：

Changes in the economic environment like Recession, Covid-19 etc
經濟環境的變化，例如經濟衰退，Covid-19等
Changes in the sources of data
數據來源的變化
Internal policy changes which impact the population distribution directly or indirectly
內部政策變化直接或間接影響人口分布
Issues in data integration which might result in erroneous data
數據集成問題可能會導致數據錯誤
Issues while programming/coding, such as model implementation or missing some crucial steps of the scoring code
編程/編碼時出現的問題，例如模型實現或缺少評分代碼的某些關鍵步驟

Since a distributional change need not necessarily involve a dependent variable, PSI can also be used to examine the similarity/dissimilarity between any samples, e.g, to compare education, income, and health status between two or multiple populations in social-demographic studies.

由于分布變化不一定需要涉及因變量，因此PSI還可以用于檢查任何樣本之間的相似性/不相似性，例如，在社會人口統計學研究中比較兩個或多個人群之間的教育，收入和健康狀況。

STEPS TO CALCULATE PSI (Reference: Link)

計算PSI的步驟(參考： 鏈接)

Sort scoring variable on descending order in the scoring sample

在評分樣本中按降序對評分變量進行排序

Split the data into 10 or 20 groups (deciling)

將數據分為10或20組(決定)

Calculate % of records in each group based on scoring sample

根據得分樣本計算每組中記錄的百分比

Calculate % of records in each group based on the development sample

根據開發樣本計算每組中記錄的百分比

Calculate the difference between Step 3 and Step 4

計算步驟3和步驟4之間的差異

Take Natural Log of (Step3 / Step4)

取(Step3 / Step4)的自然對數

Multiply Step5 and Step6

分別乘以Step5和Step6

EXCEL TABULATION OF PSI:

出色的PSI表：

Image by Author圖片作者

INTERPRETATION RULES (Reference: Link)

解釋規則(參考： 鏈接)

PSI < 0.1: No change. You can continue using the existing model.

PSI <0.1 ：不變。您可以繼續使用現有模型。

PSI >=0.1: but less than 0.2 — Slight change is required.

PSI> = 0.1：但小于0.2-需要稍作更改。

PSI >=0.2: Significant change is required. Ideally, we should not use this model anymore. It should be recalibrated/redeveloped.

PSI> = 0.2：需要進行重大更改。理想情況下，我們不應再使用此模型。應該重新校準/重新開發。

We can use a conditional formatting range as well, Red-Amber-Green zone, Red is the alarming state where the PSI is more than 20%, Amber is a 10–20% where the model should be kept under observation and Green is a stage where the model is deemed fir for use i.e. <10%.

我們也可以使用條件格式范圍，紅色-琥珀色-綠色區域，紅色是警報狀態，PSI大于20％，琥珀色是10–20％，應保持模型處于觀察狀態，綠色是該模型被認為可使用的階段，即<10％。

Now, basis on the use case, these thresholds are adjusted according to business relevance but the idea remains the same to keep a watch over the population shifts.

現在，根據用例，這些閾值會根據業務相關性進行調整，但是保持監視人口變化的想法保持不變。

特性穩定性指數(CSI) (Characteristic stability index (CSI))

It answers which variable is causing a shift in population distribution. It compares the distribution of an independent variable in the scoring data set to a development data set. It detects shifts in the distributions of input variables that are submitted for scoring over time.

它回答哪個變量導致人口分布發生變化。它將評分數據集中自變量的分布與開發數據集進行比較。它可以檢測輸入變量的分布的變化，這些變化會隨時間推移進行評分。

It helps to determine which changing variable is responsible majorly for causing the model score shift.

它有助于確定哪個變化變量主要是導致模型得分變化的原因。

When a model deteriorates in performance, checking distributional changes in the model variables can help with identifying possible causes. This is a step that is taken generally after one has checked PSI and it’s not in the green zone (<0.1 in general) to check that the overall population distribution can be attributed majorly to which variables.

當模型的性能下降時，檢查模型變量的分布變化可以幫助確定可能的原因。這是在檢查了PSI之后通常采取的步驟，并且不在綠色區域(通常<0.1)以檢查總體人口分布是否主要歸因于哪些變量。

At least one variable that has changed significantly or if several variables have moved to some extent, it might be time to recalibrate or to rebuild the model.

至少有一個已發生重大變化的變量，或者如果多個變量已發生某種程度的變化，則可能是時候重新校準或重建模型了。

CSI invariably has the same steps as PSI while calculating it. Just that the deciding is done on the dev sample values of a particular variable by binning it and setting those limits as hard cutoffs. Then, while calculating the frequency values for any validation/OOT sample, apply those same cutoffs to the data and calculate the frequency values and apply the same formula as we did while calculating PSI.

計算CSI時，CSI始終具有與PSI相同的步驟。只是通過對特定變量的裝箱樣本值進行裝箱并將這些限制設置為硬性臨界值來做出決定。然后，在計算任何驗證/ OOT樣本的頻率值時，將那些截止值應用于數據并計算頻率值，并應用與計算PSI時相同的公式。

EXCEL TABULATION OF CSI:

CSI的出色調節：

Image by Author圖片作者

Thus, PSI can help us determine the overall population distribution comparison with the DEV sample and if it shows a significant shift then CSI can help us further narrow it down to a few variables which are causing the fluctuations.

因此，PSI可以幫助我們確定與DEV樣本的總體人口分布比較，如果它顯示出顯著的變化，則CSI可以幫助我們進一步將其縮小到幾個引起波動的變量。

https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf

Watch this space for more on Data Science, Machine learning, and Statistics!

觀看此空間，以獲取有關數據科學，機器學習和統計的更多信息！

Happy Learning, Happy Growing!

快樂學習，快樂成長！

翻譯自: https://towardsdatascience.com/psi-and-csi-top-2-model-monitoring-metrics-924a2540bed8

模型監控psi

總結

以上是生活随笔為你收集整理的模型监控psi_PSI和CSI：前2个模型监控指标的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：小米8se插耳机还是外放怎么办(小米官方
下一篇： flask渲染图像_用于图像推荐的Fla