电路分析导论_生存分析导论
電路分析導論
In our extremely competitive times, all businesses face the problem of customer churn/retention. To quickly give some context, churn happens when the customer stops using the services of a company (stops purchasing, cancels the subscription, etc.). Retention refers to keeping the clients of a business active (the definition of active highly depends on the business model).
在我們競爭異常激烈的時代,所有企業都面臨客戶流失/保留的問題。 為了快速提供背景信息,當客戶停止使用公司的服務(停止購買,取消訂閱等)時,就會發生流失。 保留是指使業務的客戶保持活動狀態(活動的定義在很大程度上取決于業務模型)。
Intuitively, companies want to increase retention by preventing churn. This way, their relationship with the customers is longer and thus potentially more profitable. What is more, in most cases the company’s cost of retaining a customer is much lower than that of acquiring a new customer, for example, via performance marketing. For businesses, the concept of retention is closely connected to customer lifetime value (CLV), which the businesses want to maximize. But that is a topic for another article.
直觀上,公司希望通過防止流失來增加保留率。 這樣,他們與客戶的關系就會更長,因此可能會帶來更大的利潤。 更重要的是,在大多數情況下,公司保留客戶的成本要比例如通過績效營銷獲得新客戶的成本低得多。 對于企業而言,保留的概念與企業希望最大化的客戶生命周期價值 (CLV)緊密相關。 但這是另一篇文章的主題。
With this article, I want to start a short series focusing on survival analysis, which is often an underestimated, yet very interesting branch of statistical learning. In this article, I provide a general introduction to survival analysis and its building blocks. First I explain the required concepts and then describe different approaches to analyzing time-to-event data. Let’s start!
在本文中,我想開始一個簡短的系列,著重于生存分析,這通常是統計學學習中被低估但非常有趣的分支。 在本文中,我對生存分析及其組成部分進行了一般性介紹。 首先,我解釋了必需的概念,然后描述了分析事件數據的不同方法。 開始吧!
生存分析導論 (Introduction to Survival Analysis)
Survival analysis is a field of statistics that focuses on analyzing the expected time until a certain event happens. Originally, this branch of statistics developed around measuring the effects of medical treatment on patients’ survival in clinical trials. For example, imagine a group of cancer patients who are administered a certain new form of treatment. Survival analysis can be used for analyzing the results of that treatment in terms of the patients’ life expectancy.
生存分析是一個統計領域,專注于分析直到發生某個事件之前的預期時間。 最初,該統計分支的發展是圍繞在臨床試驗中測量藥物治療對患者生存的影響。 例如,想象一組接受某種新形式治療的癌癥患者。 生存分析可用于根據患者的預期壽命來分析該治療的結果。
However, survival analysis is not restricted to investigating deaths and can be just as well used for determining the time until a machine fails or — what may at first sound a bit counterintuitively— a user of a certain platform converts to a premium service. That is possible because survival analysis focuses on the time until an event happens, without actually defining the event as a negative one. The conditions that apply to the most popular methods of survival analysis are:
但是,生存分析并不僅限于調查死亡情況,它還可以用于確定機器故障或某個平臺的用戶轉換為優質服務之前的時間(起初聽起來有些反直覺)。 之所以可以這樣做是因為生存分析著眼于事件發生之前的時間,而沒有將事件實際定義為否定事件。 適用于最流行的生存分析方法的條件是:
- the event of interest is clearly defined and well-specified, so there is no ambiguity about whether it happened or not, 對感興趣的事件進行了明確的定義和明確的規定,因此對于它是否發生沒有歧義,
- the event can occur only once for each subject — this is clear in case of death, but if we applied the analysis to churn, this might be a more complicated case, as a churned user might be reactivated and churn again. 該事件對于每個主題只能發生一次-在死亡的情況下很明顯,但是如果我們將分析應用于客戶流失,則情況可能更復雜,因為流失的用戶可能會重新激活并再次流失。
We have already established that survival analysis is used for modeling the time-to-event series, in other words, lifetimes (hence also the name of the Python library which is the go-to tool for this kind of analyses). Generally speaking, we can use survival analysis to try to answer questions like:
我們已經建立了生存分析用于建模事件發生時間序列 (即生存期)的方法(因此也稱為Python庫的名稱,Python庫是此類分析的必備工具)。 一般而言,我們可以使用生存分析來嘗試回答以下問題:
- what percentage of the population will survive past a certain time? 一定時間后將有百分之幾的人口生存?
- of the survivors, what will be their death/failure rate? 的幸存者中,他們的死亡/失敗率是多少?
- how do particular characteristics (for example, such features as age, gender, geographical location, etc.) affect the probability of survival? 特定特征(例如年齡,性別,地理位置等特征)如何影響生存概率?
Having briefly described the general idea of survival analysis, it is time to introduce a few concepts that are crucial for a thorough understanding of the subject.
簡要描述了生存分析的一般概念之后,現在該介紹一些對徹底理解該主題至關重要的概念。
Photo by Scott Graham on Unsplash Scott Graham在Unsplash上拍攝的照片審查制度 (Censoring)
Censoring can be described as the missing data problem in the domain of survival analysis. Observations are censored when the information about their survival time is incomplete. There are different kinds of censoring, such as:
審查可以描述為生存分析領域中的數據丟失問題。 當有關生存時間的信息不完整時,將對觀測進行審查 。 審查方式有多種,例如:
- right-censoring, 權利審查
- interval-censoring, 間隔檢查
- left-censoring. 左審查。
To keep this section short, we just discuss the one that is encountered most frequently — right-censoring. Let’s come back to the example with cancer treatment. Imagine, that the study of the effects of the new medicine lasts 5 years (this is an arbitrary number, not actually based on anything). It can happen that after 5 years, some of the patients survived and thus have not experienced the death event. At the same time, the authors of the study lost contact with some patients — they might have relocated to another country, they might have actually died, but no confirmation was ever received. Those cases are affected by right-censoring, that is, their true survival time is equal to or greater than the observed survival time (in this case, the 5 years of the study). The following image illustrates right-censoring.
為了使本節簡短,我們只討論最常遇到的一個問題- 右刪失 。 讓我們回到有關癌癥治療的例子。 想象一下,對新藥效果的研究持續了5年(這是一個任意數字,實際上并不是基于任何東西)。 可能發生的情況是,在5年后,一些患者幸存了下來,因此沒有經歷過死亡事件。 同時,該研究的作者與某些患者失去了聯系-他們可能已搬遷到另一個國家,他們可能實際上已經死亡,但從未收到任何確認。 這些案例受權利審查的影響,也就是說,它們的真實生存時間等于或大于觀察到的生存時間(在本例中為研究的5年)。 下圖說明了權限檢查。
Source資源The existence of censoring is also the reason why we cannot use simple OLS for problems in the survival analysis. That is because OLS effectively draws a regression line that minimizes the sum of squared errors. But for censored data, the error terms are unknown and therefore we cannot minimize the MSE. Applying some simple solutions such as using the censorship date as the date of the death event or dropping the censored observations can severely bias the results.
審查的存在也是我們無法在生存分析中使用簡單OLS解決問題的原因。 這是因為OLS有效地繪制了一條回歸線,該回歸線使平方誤差的總和最小。 但是對于被檢查的數據,錯誤項是未知的,因此我們無法最小化MSE。 應用一些簡單的解決方案,例如使用檢查日期作為死亡事件的日期或放棄檢查的觀察結果,可能會嚴重影響結果。
For information regarding different kinds of censoring, please go here.
有關各種檢查的信息,請轉到此處 。
生存功能 (The Survival Function)
The survival function is a function of time (t) and can be represented as
生存函數是時間( t )的函數,可以表示為
where Pr() stands for the probability and T for the time of the event of interest for a random observation from the sample. We can interpret the survival function as the probability of the event of interest (for example, the death event) not occurring by the time t.
其中, Pr()代表概率, T代表關注事件的時間,可以從樣本中進行隨機觀察。 我們可以將生存函數解釋為感興趣的事件(例如,死亡事件)在時間t之前未發生的概率。
The survival function takes values in the range between 0 and 1 (inclusive) and is a non-increasing function of t.
生存函數的取值范圍是0到1(含)之間,并且是t的非遞增函數。
危害功能 (The Hazard Function)
We can think of the hazard function (or hazard rate) as the probability of the subject experiencing the event of interest within a small (or to be more precise, infinitesimal) interval of time, assuming that the subject has survived up until the beginning of the said interval. The hazard function can be represented as:
我們可以將危害函數 (或危害率)視為對象在很小(或更確切地說是無窮小)的時間間隔內經歷關注事件的概率,前提是對象一直存活到開始。所說的間隔。 危害函數可以表示為:
where the expression in the numerator is the conditional probability of the event of interest occurring in the given time interval, provided it has not happened before. dt in the denominator is the width of the considered interval of time. When we divide the former by the latter, we effectively obtain the rate of the event’s occurrence per unit of time. Lastly, by taking the limit as the width of the interval goes to zero, we end up with the instantaneous rate of occurrence, so the risk of an event happening at a particular point in time.
其中分子中的表達式是感興趣事件在給定時間間隔內發生的條件概率,前提是該事件以前沒有發生過。 分母中的dt是所考慮的時間間隔的寬度。 當我們將前者除以后者時,我們可以有效地獲得每單位時間事件發生的比率。 最后,通過在間隔的寬度變為零時取極限,我們得出瞬時發生率,因此事件在特定時間點發生的風險。
You might wonder why the hazard rate is defined using this small interval of time. The reason for that lies in the fact that the probability of a continuous random variable being equal to a particular value is zero. That is why we need to consider the probability of the event happening in a very small interval of time.
您可能想知道為什么使用這么短的時間間隔來定義危險率。 其原因在于,連續隨機變量等于特定值的概率為零。 這就是為什么我們需要考慮事件在很小的時間間隔內發生的可能性。
Technical note: to be theoretically correct, it is important to mention that the hazard function is not actually a probability and the name hazard rate is the more fitting one. That is because even though the expression in the numerator is the probability, the dt in the denominator can actually result in a value of the hazard rate greater than 1 (it is still limited to 0 at the lower interval).
技術說明:從理論上講是正確的,重要的是要提到危害函數實際上并不是概率,而危害率這個名稱更合適。 這是因為即使分子中的表達式是概率,分母中的dt實際上也可以導致危險率的值大于1(在較低的時間間隔仍限制為0)。
Lastly, the survival and hazard functions are related to each other as specified by the following formula:
最后,生存和危害功能相互關聯,如下式所示:
To give the equation a bit of context, the integral in the brackets is called the cumulative hazard and can be interpreted as the sum of the risks the subject faces going from time-point 0 to t.
為了使方程更準確,將方括號中的積分稱為累積危害,可以將其解釋為受試者從時間點0到t所面臨的風險之和。
Photo by Justin Luebke on Unsplash 賈斯汀·呂貝克 ( Justin Luebke)在Unsplash上攝生存分析的不同方法 (Different approaches to Survival Analysis)
As survival analysis is an entire domain of different statistical methods for working with time-to-event series, there are naturally many different approaches we could follow. On a high level, we could split them into three main groups:
由于生存分析是處理事件間隔時間序列的不同統計方法的整個領域,因此自然可以采用許多不同的方法。 在較高的層次上,我們可以將它們分為三個主要組:
Non-parametric — with these approaches, we make no assumptions about the underlying distribution of data. Perhaps the most popular example from this group is the Kaplan-Meier curve, which — in short — is a method of estimating and plotting the survival probability as a function of time.
非參數 -使用這些方法,我們不對數據的基本分布進行任何假設。 該組中最受歡迎的示例也許是Kaplan-Meier曲線 ,簡而言之,它是一種估計和繪制生存概率隨時間變化的方法。
Semi-parametric — as you could have guessed, this group is in between the two extremes and makes very few assumptions. Most importantly, there are no assumptions about the shape of the hazard function/rate. The most popular method from this group is the Cox regression, which we can use to identify the relationship between the hazard function and a set of explanatory variables (predictors).
半參數 -正如您可能已經猜到的,該組介于兩個極端之間,并且很少進行假設。 最重要的是,沒有關于危害函數/速率的形狀的假設。 該組中最流行的方法是Cox回歸 ,我們可以使用它來識別危害函數和一組解釋變量(預測變量)之間的關系。
Parametric — you might have encountered this approach while doing your studies. The idea is to use some statistical distributions (some of the popular ones include exponential, log, Weibull, or Lomax) to estimate how long a subject will survive. Often, we use maximum likelihood estimation (MLE) to fit the distribution (or actually the distribution’s parameters) to the data for the best performance.
參數化 -學習時可能會遇到這種方法。 想法是使用一些統計分布(一些流行的分布包括指數分布,對數分布,Weibull分布或Lomax分布)來估計對象可以存活多長時間。 通常,我們使用最大似然估計(MLE)使分布(或實際上是分布的參數)適合數據,以獲得最佳性能。
The methods mentioned in this short list are by no means exhaustive and there are many more interesting approaches to analyzing time-to-event data using machine- or deep-learning-based techniques. I will try to cover the most interesting ones in the following posts, so stay tuned :)
此簡短列表中提到的方法絕不是窮舉,并且有很多有趣的方法可以使用基于機器學習或深度學習的技術來分析事件數據。 我將在以下帖子中嘗試介紹最有趣的內容,敬請期待:)
結論 (Conclusions)
In this article, I tried to provide a brief yet thorough introduction to the domain of survival analysis. I believe that this area is often overlooked when talking about different data science solutions. However, by using some simple (or not so simple at all!) solutions we can provide valuable insights for the company or stakeholders and generate actual value-added.
在本文中,我試圖對生存分析領域進行簡要而全面的介紹。 我認為,在談論不同的數據科學解決方案時,通常會忽略這一領域。 但是,通過使用一些簡單(或根本不是那么簡單!)解決方案,我們可以為公司或利益相關者提供有價值的見解,并產生實際的增值。
This article is only the beginning of a short series, and I will keep on adding the following parts below. In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.
本文只是一個簡短系列的開始,我將繼續在下面添加以下部分。 如果您有任何疑問或建議,請在評論中讓我知道,或在Twitter上與您聯系 。
In the meantime, you might like some of my other articles:
同時,您可能會喜歡我的其他一些文章:
翻譯自: https://towardsdatascience.com/introduction-to-survival-analysis-6f7e19c31d96
電路分析導論
總結
以上是生活随笔為你收集整理的电路分析导论_生存分析导论的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 扫描二维码读取文档_使用深度学习读取和分
- 下一篇: 强化学习-第3部分