當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Casual inference 综述框架

發布時間：2023/12/14 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了 Casual inference 综述框架小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

A survey on causal inference

因果推理綜述——《A Survey on Causal Inference》一文的總結和梳理

因果推斷理論解讀

Rubin因果模型的三個假設

基礎理論

理論框架

名詞解釋

individual treatment effect ：ITE = $Y_{1i}-Y_{0i}$
average treatment effect ：ATE= $E(Y_{1i}-Y_{0i})$
conditional treatment effect ：CAPE = $E(Y_{1i}-Y_{0i}|X)$

兩個挑戰

counterfact：無法觀測到反事實數據
confounder bias：treatment不是隨機分配

1 Rubin Causal Model（RCM）

potential outcome model （虛擬事實模型），也叫做Rubin Causal Model（RCM），希望估計出每個unit或者整體平均意義下的potential outcome，進而得到干預效果treatment effect(eg. ITE/ATE)。

因此準確地估計出potential outcome是該框架的關鍵，由于混雜因子confounder的存在，觀察到的數據不用直接用來近似potential outcome，需要有進一步的處理。

核心思想：準確估計potential outcome，尋找對照組

matching：根據傾向得分，找到最佳對照組
weighting/pairing：重加權
subclassification/stratification：分層，求得CATE

2 Pearl Causal Graph（SCM）

通過計算因果圖中的條件分布，獲得變量之間的因果關系。有向圖指導我們使用這些條件分布來消除估計偏差，其核心也是估計檢驗分布、消除其他變量帶來的偏差。

鏈式結構：常見在前門路徑，A -> C一定需要經過B
叉式結構：中間節點B通常被視為A和C的共因(common cause)或混雜因子（confounder )。混雜因子會使A和C在統計學上發生關聯，即使它們沒有直接的關系。經典例子：“鞋的尺碼←孩子的年齡→閱讀能力”，穿較大碼的鞋的孩子年齡可能更大，所以往往有著更強的閱讀能力，但當固定了年齡之后，A和C就條件獨立了。
對撞結構：AB、BC相關，AC不相關；給定B時，AC相關

三個假設

1. 無混淆性(Unconfoundedness)

也稱之為「條件獨立性假設」(conditional independence assumption, CIA)，即解決X->T的路徑。

Given the background variable, X, treatment assignment T is independent to the potential outcomes Y

$(Y1,Y0)⊥W∣X(Y_1, Y_0) \perp W | X$

該假設使得具有相同X的unit是隨機分配的。

2. 正值(Positivity)

For any value of X, treatment assignment is not deterministic

$\mid X=x)>0$

干預一定要有實驗樣本；干預、混雜因子越多，所需的樣本也越多

3. 一致性(Consistency)

也可以叫「穩定單元干預值假設」（Stable Unit Treatment Value Assumption, SUTVA）

The potential outcomes for any unit do not vary with the treatment assigned to other units, and, for each unit, there are no different forms or versions of each treatment level, which lead to different potential outcomes.

任意單元的潛在結果都不會因其他單元的干預發生改變而改變，且對于每個單元，其所接受的每種干預不存在不同的形式或版本，不會導致不同的潛在結果。

混淆因素

Confounders are the variables that affect both the treatment assignment and the outcome.

Confounder大多會引起偽效應(spurious effect)和選擇偏差(selection bias)。

針對spurious effect，根據X分布進行權重加和
$ATE?=∑xp(x)E[YF∣X=x,W=1]?∑xp(x)E[YF∣X=x,W=0]\text { ATE }=\sum_x p(x) \mathbb{E}\left[Y^F\mid X=x, W=1\right]-\sum_x p(x) \mathbb{E}\left[Y^F \mid X=x, W=0\right]$
針對selection bias，為每個group找到對應的pseudo group，如sample re-weighting, matching, tree-based methods, confounder balancing, balanced representation learning methods, multi-task-based methods

建模方法

1. re-weighting methods*

By assigning appropriate weight to each unit in the observational data, a pseudo-population can be created on which the distributions of the treated group and control group are similar.

通過給每個觀測數據分配權重，調整treatment和control兩個組的分布，使其接近。關鍵在于怎么選擇balancing score，propensity score是特殊情況。

$e(x)=Pr?(W=1∣X=x)e(x)=\operatorname{Pr}(W=1 \mid X=x)$

The propensity score can be used to balance the covariates in the treatment and control groups and therefore reduce the bias through matching, stratification (subclassification), regression adjustment, or some combination of all three.

1. Propensity Score Based Sample Re-weighting

IPW ： $r=We(x)+1?W1?e(x)r=\frac{W}{e(x)}+\frac{1-W}{1-e(x)}$ ，用r給每個樣本算權重
$ATEIPW=1n∑i=1nWiYiFe^(xi)?1n∑i=1n(1?Wi)YiF1?e^(xi)\mathrm{ATE}_{I P W}=\frac{1}{n} \sum_{i=1}^n \frac{W_i Y_i^F}{\hat{e}\left(x_i\right)}-\frac{1}{n} \sum_{i=1}^n \frac{\left(1-W_i\right) Y_i^F}{1-\hat{e}\left(x_i\right)}$
經normalization,
$ATEIPW=∑i=1nWiYiFe^(xi)/∑i=1nWie^(xi)?∑i=1n(1?Wi)YiF1?e^(xi)/∑i=1n(1?Wi)1?e^(xi)\mathrm{ATE}_{I P W}=\sum_{i=1}^n \frac{W_i Y_i^F}{\hat{e}\left(x_i\right)} / \sum_{i=1}^n \frac{W_i}{\hat{e}\left(x_i\right)}-\sum_{i=1}^n \frac{\left(1-W_i\right) Y_i^F}{1-\hat{e}\left(x_i\right)} / \sum_{i=1}^n \frac{\left(1-W_i\right)}{1-\hat{e}\left(x_i\right)}$

缺點：極大依賴e(X)估計的準確性

DR：解決propensity score估計不準的問題

$ATEDR=1n∑i=1n{[WiYiFe^(xi)?Wi?e^(xi)e^(xi)m^(1,xi)]?[(1?Wi)YiF1?e^(xi)?Wi?e^(xi)1?e^(xi)m^(0,xi)]}=1n∑i=1n{m^(1,xi)+Wi(YiF?m^(1,xi))e^(xi)?m^(0,xi)?(1?Wi)(YiF?m^(0,xi))1?e^(xi)}\begin{aligned} \mathrm{ATE}_{D R} &=\frac{1}{n} \sum_{i=1}^n\left\{\left[\frac{W_i Y_i^F}{\hat{e}\left(x_i\right)}-\frac{W_i-\hat{e}\left(x_i\right)}{\hat{e}\left(x_i\right)} \hat{m}\left(1, x_i\right)\right]-\left[\frac{\left(1-W_i\right) Y_i^F}{1-\hat{e}\left(x_i\right)}-\frac{W_i-\hat{e}\left(x_i\right)}{1-\hat{e}\left(x_i\right)} \hat{m}\left(0, x_i\right)\right]\right\} \\ &=\frac{1}{n} \sum_{i=1}^n\left\{\hat{m}\left(1, x_i\right)+\frac{W_i\left(Y_i^F-\hat{m}\left(1, x_i\right)\right)}{\hat{e}\left(x_i\right)}-\hat{m}\left(0, x_i\right)-\frac{\left(1-W_i\right)\left(Y_i^F-\hat{m}\left(0, x_i\right)\right)}{1-\hat{e}\left(x_i\right)}\right\} \end{aligned}$
$m^(1,xi)\hat{m}\left(1, x_i\right)$ 和 $m^(0,xi)\hat{m}\left(0, x_i\right)$ 是treatment和control兩組的回歸模型

The estimator is robust even when one of the propensity score or outcome regression is incorrect (but not both).

2. Confounder Balancing

D2VD ：Data-Driven Variable Decomposition

根據seperation assumption，變量分為confounder、adjusted variables和irrelavant variables。
$ATED2VD=E[(YF??(z))W?p(x)p(x)(1?p(x))]\mathrm{ATE}_{\mathrm{D}^2 \mathrm{VD}}=\mathbb{E}\left[\left(Y^F-\phi(\mathrm{z})\right) \frac{W-p(x)}{p(x)(1-p(x))}\right]$
其中，z為調整變量

假設 $α，β\alpha，\beta$ 分別分離調整變量和混淆變量，即 $YD2VD?=(YF?Xα)⊙R(β)Y_{\mathrm{D}^2 \mathrm{VD}}^*=\left(Y^F-X \alpha\right) \odot R(\beta)$ ， $γ\gamma$ d對應所有變量的ATE結果，則問題可以建模成

$minimize?∥(YF?Xα)⊙R(β)?Xγ∥22s.t.?∑i=1Nlog?(1+exp?(1?2Wi)?Xiβ))<τ∥α∥1≤λ,∥β∥1≤δ,∥γ∥1≤η,∥α⊙β∥22=0\begin{aligned} \operatorname{minimize} &\left\|\left(Y^F-X \alpha\right) \odot R(\beta)-X \gamma\right\|_2^2 \\ \text { s.t. } &\left.\sum_{i=1}^N \log \left(1+\exp \left(1-2 W_i\right) \cdot X_i \beta\right)\right)<\tau \\ &\|\alpha\|_1 \leq \lambda,\|\beta\|_1 \leq \delta,\|\gamma\|_1 \leq \eta,\|\alpha \odot \beta\|_2^2=0 \end{aligned}$

第一個約束是正則項，最后一個約束保證調整變量和混淆變量的分離

2. stratification methods

$ATEstrat?=τ^strat?=∑j=1Jq(j)[Yˉt(j)?Yˉc(j)]\mathrm{ATE}_{\text {strat }}=\hat{\tau}^{\text {strat }}=\sum_{j=1}^J q(j)\left[\bar{Y}_t(j)-\bar{Y}_c(j)\right]$
其中，一共分成J個block，且 $q (j)$ 為j-th block的比例

關鍵在于如何劃分block，典型方法有等頻法，基于出現概率(如PS)劃分相似樣本。但是，該方法在兩側重疊區域小，從而導致高方差。

However, this approach suffers from high variance due to the insufficient overlap between treated and control groups in the blocks whose propensity score is very high or low.

3. matching methods*

4. tree-based methods*

This approach is different from conventional CART in two aspects. First, it focuses on estimating conditional average treatment effects instead of directly predicting outcomes as in the conventional CART. Second, different samples are used for constructing the partition and estimating the effects of each subpopulation, which is referred to as an honest estimation. However, in the conventional CART, the same samples are used for these two tasks.

5. representation based methods

1. Domain Adaptation Based on Representation Learning

Unlike the randomized control trials, the mechanism of treatment assignment is not explicit in observational data. The counterfactual distribution will generally be
different from the factual distribution.

關鍵在于縮小反事實分布和實際分布的差別，即源域和目標域

6. multi-task methods

7. meta-learning methods*

1. S-learner

S-learner是將treatment作為特征，所有數據一起訓練

step1: $μ(T,X)=E[Y∣T,X]\mu(T, X)=E[Y \mid T, X]$
step2: $τ^=1n∑i(μ^(1,Xi)?μ^(0,Xi))\hat{\tau}=\frac{1}{n} \sum_i\left(\hat{\mu}\left(1, X_i\right)-\hat{\mu}\left(0, X_i\right)\right)$

該方法不直接建模uplift，X的high dimension可能會導致treatment丟失效果。

2. T-learner

T-learner分別對control和treatment組建模

step1: $μ1(X)=E[Y∣T=1,X]μ0(X)=E[Y∣T=0,X]\mu_1(X)=E[Y \mid T=1, X] \quad \mu_0(X)=E[Y \mid T=0, X]$
step2: $τ^=1n∑i(μ^1(Xi)?μ0^(Xi))\hat{\tau}=\frac{1}{n} \sum_i\left(\hat{\mu}_1\left(X_i\right)-\hat{\mu_0}\left(X_i\right)\right)$

每個estimator只使用部分數據，尤其當樣本不足或者treatment、control樣本量差別較大時，模型variance較大（對數據利用效率低）；容易出現兩個模型的Bias方向不一致，形成誤差累積，使用時需要針對兩個模型打分分布做一定校準；同時當數據差異過大時(如數據量、采樣偏差等)，對準確率影響較大。

3. X-learner

X-Learner在T-Learner基礎上，利用了全量的數據進行預測，主要解決Treatment組間數據量差異較大的情況。

step1: 對實驗組和對照組分別建立兩個模型 $μ^1\hat \mu_1$ 和 $μ^0\hat \mu_0$
$D0=μ^1(X0)?Y0D1=Y1?μ^0(X1)\begin{aligned} &D_0=\hat{\mu}_1\left(X_0\right)-Y_0 \\ &D_1=Y_1-\hat{\mu}_0\left(X_1\right) \end{aligned}$
step2: 對求得的實驗組和對照組增量D1和 $D 0$ 建立兩個模型 $τ^1\hat{\tau}_1$ 和 $τ^0\hat{\tau}_0$ 。
$τ^0=f(X0,D0)τ^1=f(X1,D1)\begin{aligned} &\hat{\tau}_0=f\left(X_0, D_0\right) \\ &\hat{\tau}_1=f\left(X_1, D_1\right) \end{aligned}$
step3: 引入傾向性得分模型 $e (x)$ 對結果進行加權，求得增量。
$e(x)=P(W=1∣X=x)τ^(x)=e(x)τ^0(x)+(1?e(x))τ^1(x)\begin{aligned} &e(x)=P(W=1 \mid X=x) \\ &\hat{\tau}(x)=e(x) \hat{\tau}_0(x)+(1-e(x)) \hat{\tau}_1(x) \end{aligned}$

4. R-learner

總結

以上是生活随笔為你收集整理的Casual inference 综述框架的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：现代信息系统的企业驱动力
下一篇：弥散阴影html,设计弥散阴影效果海报图