稀疏组套索_Python中的稀疏组套索
稀疏組套索
So yesterday I launched a new package for python: asgl (the name comes from Adaptive Sparse Group Lasso) that adds a lot of features that were already available in R packages but not in python, like solving sparse group lasso models, and goes beyond that, adding extra features that improve the results that sparse group lasso can provide.
所以昨天我為python啟動了一個新軟件包: asgl (名稱來自Adaptive Sparse Group Lasso),它添加了許多R包中已經可用但python中沒有的功能,例如解決稀疏組lasso模型,并且超出了此范圍,添加了額外的功能來改善稀疏組的結果套索可以提供。
This is going to be the first article on a series about state-of-the-art regularization techniques in regression, and I would like to start talking about the sparse group lasso: what is it and how to use it. Specifically, here we will see:
這將是有關回歸的最新正則化技術系列的第一篇文章,我想開始談論稀疏組套索:它是什么以及如何使用它。 具體來說,在這里我們將看到:
What is sparse group lasso
什么是稀疏組套索
How to use sparse group lasso in python
如何在python中使用稀疏組套索
How to perform k-fold cross validation
如何執行K折交叉驗證
How to use grid search in order to find the optimal solution.
如何使用網格搜索以找到最佳解決方案。
什么是稀疏組套索 (What is sparse group lasso)
To understand what is sparse group lasso we need to talk (briefly) about two techniques: lasso and group lasso. Given a risk function, for example the linear regression risk,
要了解什么是稀疏組套索,我們需要(簡要地)談論兩種技術: 套索和套索套索 。 給定風險函數,例如線性回歸風險,
Risk function of a linear regression model線性回歸模型的風險函數Lasso: is defined by adding a penalization on the absolute value of the β coefficients,
套索:通過對β系數的絕對值加罰分來定義,
Lasso penalty formula套索懲罰公式This definition provides sparse solutions, because it will send to zero some of the β coefficients (the least related with the response variable). The effect of this penalization can be controlled using the λ parameter. A large λ value provides solutions where the penalization has a greater importance, and thus there are more zeros among the β coefficients. This is mainly useful in high dimensional datasets, where there are more variables than observations but we only expect a small fragment of the variables to be truly meaningful.
該定義提供了稀疏解 ,因為它將某些β系數(與響應變量最小相關)發送到零。 懲罰的效果可以使用λ參數來控制 。 較大的λ值提供了懲罰更重要的解決方案,因此β系數中有更多零。 這主要在高維數據集中有用,在高數據集中 ,變量多于觀測值,但我們只希望變量的一小部分真正有意義。
However, there are situations in which the predictor variables in X have a natural grouped structure. For example, in biostatistics it is common to deal with genetic datasets in which predictors are grouped into genetical pathways. In stock market analysis one can group companies from the same business segment. In climate data one can group different regions… And lasso provides individual sparse solutions, not group sparse.
但是,在某些情況下,X中的預測變量具有自然的分組結構。 例如,在生物統計學中,通常要處理將預測因子歸為遺傳途徑的遺傳數據集。 在股票市場分析中,可以將同一業務部門的公司分組。 在氣候數據中,可以將不同的區域分組...套索提供的是稀疏解決方案,而不是稀疏解決方案。
Group lasso: So here comes group lasso to the rescue. Group lasso is built as the sum of squares of coefficients belonging to the same group.
套索組:來了 套索救援。 組套索被構建為屬于同一組的系數的平方和。
Group lasso penalty formula組套索懲罰公式This way it takes into account the possible grouped structure of predictors, and it sends to zero whole groups of variables. If all the groups are of size 1 (only one predictor per group) we will be solving a lasso model. Lets see lasso and group lasso graphically,
這樣,它考慮了預測變量的可能分組結構,并將整組變量發送到零。 如果所有組的大小均為1(每組只有一個預測變量),我們將求解套索模型。 讓我們以圖形方式查看套索并分組套索,
Lasso, group lasso and ridge penalizations comparison套索,套索和嶺處罰的比較In the image above we have a simple problem with three coefficients, β?β?? and β??. The last two coefficients form a group, and as we can see, lasso (left image) does not take this grouping information into account, but group lasso does. So group lasso can be seen as lasso between groups and ridge within groups. If a group is meaningful, we select the whole group. If it is not, we send it to zero.
在上圖中,我們有一個簡單的問題,其中三個系數分別為β?β??和β?2。 最后兩個系數組成一個組,正如我們所看到的,套索(左圖)沒有考慮該分組信息,但是套索卻在考慮。 因此,組套索可以看作是組之間和組內脊之間的套索。 如果一個組有意義,我們選擇整個組。 如果不是,我們將其發送為零。
Sparse group lasso: and finally here it is,
稀疏組套索:終于到了,
Sparse group lasso penalty function稀疏組套索懲罰函數Sparse group lasso is a linear combination between lasso and group lasso, so it provides solutions that are both between and within group sparse.
稀疏組套索是套索和套索之間的線性組合,因此它提供的稀疏介于組內和組內。
This technique selects the most meaningful predictors from the most meaningful groups, and is one of the best variable selection alternatives of recent years. However, there was no implementation of sparse group lasso for python… until now.
該技術從最有意義的組中選擇最有意義的預測變量,并且是近年來最好的變量選擇替代方法之一。 但是,到目前為止,還沒有針對python的稀疏組套索的實現。
移至python:安裝asgl (Moving to python: install asgl)
Lets start by installing asgl. This can be easily done using pip
讓我們從安裝asgl開始。 可以使用pip輕松完成
pip install asglOr alternatively, one can pull the github repository and run the setup.py
或者,可以拉github存儲庫并運行setup.py
git clone https://github.com/alvaromc317/asgl.gitcd asgl
python setup.py
導入庫 (Import libraries)
Once we have the package installed, we can start using it. First lets import the libraries and the BostonHousing dataset that we will use in this example.
一旦安裝了軟件包,就可以開始使用它了。 首先,讓我們導入在本示例中將使用的庫和BostonHousing數據集。
Here, in addition to importing the dataset, we have created a variable called group_index . This variable describes the group structure of the data, so if we have 13 predictors, group_index should be a variable of length 13, and if the first three predictors form a group, they should have the same group_index value. However, BostonHousing dataset does not have a natural grouped structure, so here we define a fake one just for the sake of this article.
在這里,除了導入數據集,我們還創建了一個名為group_index的變量。 此變量描述了數據的組結構,因此,如果我們有13個預測變量,則group_index應該是長度為13的變量,并且如果前三個預測變量形成一個組,則它們應該具有相同的group_index值。 但是, BostonHousing數據集沒有自然的分組結構,因此,僅出于本文的目的,在此我們定義一個偽造的結構。
sql模型的參數 (Parameters for the sgl model)
If we have a look to the sparse group lasso equation above, we can see that there are two parameters, α and λ, that can be optimized. λ controlls how much weight we want to give to the penalization, so larger λ values produce more sparse solutions. And α controls the tradeoff between lasso and group lasso. α equal to 1 provides a lasso, and α equal to 0 provides a group lasso. Now, usually, we can define a grid of possible values for both parameters and try to find the combination that minimizes the error.
如果我們看一下上面的稀疏組套索方程,我們可以看到有兩個參數可以優化。 λ控制著我們希望對懲罰給予多少權重,因此較大的λ值會產生更多的稀疏解。 α控制套索和組套索之間的權衡。 等于1的α提供套索,等于0的α提供組套索。 現在,通常我們可以為兩個參數定義一個可能值的網格,并嘗試找到使誤差最小的組合。
Additionally, we specify the type of model to solve (lm, because we are solving a linear model), the penalization (sgl, because we want the sparse group lasso), the error type to use (MSE, because we will use the mean squared error) and finally, we can indicate if we want to run this code sequentially or in parallel. Parallel execution makes use of the multiprocess python library and can solve the problem much faster than sequentially.
此外,我們指定要求解的模型的類型(lm,因為我們正在求解線性模型),懲罰(sgl,因為我們想要稀疏組套索),要使用的錯誤類型(MSE,因為我們將使用均值平方誤差),最后,我們可以指出是要順序運行還是并行運行此代碼。 并行執行利用了multiprocess 進程 python庫,并且比順序執行要快得多。
交叉驗證 (Cross validation)
We have defined a grid of 23 possible λ values and 20 possible α values. A total number of 460 combinations of parameters. And we are going to find the optimal combination using cross validation (that will run in parallel)
我們定義了一個由23個可能的λ值和20個可能的α值組成的網格。 總共460個參數組合。 我們將使用交叉驗證(將并行運行)找到最佳組合
So first, we define a CV object, and we plug in all the parameter information defined above. Then, we run the cross_validation() function, that will return the error that each of the 460 models achieve on each of the k folds.
因此,首先,我們定義一個CV對象,并插入上面定義的所有參數信息。 然后,我們運行cross_validation()函數,該函數將返回460個模型中的每一個在k折上實現的錯誤。
With that, we search for the parameter values that minimize the mean squared error, and store their values in optimal_lambda and optimal_alpha .
這樣,我們將搜索使均方誤差最小的參數值 ,并將其值存儲在optimal_lambda和optimal_alpha 。
As simple as that, we have found our optimal model.
如此簡單,我們就找到了最佳模型。
獲得最終錯誤 (Obtain the final error)
However, one should remember that cross validation errors are a little bit biased, so in order to obtain the final prediction error we are going to run a final train / test split just using the optimal parameters,
但是,請記住,交叉驗證錯誤有些偏頗,因此,為了獲得最終的預測錯誤,我們將僅使用最佳參數進行最終的訓練/測試拆分,
First, we define an ASGL object, that we will use for fitting a simple sparse group lasso model (no cross validation). We then split the data into 70% train 30% test using the function train_test_split() , and we fit the model so that we obtain the final estimates of the β coefficients (stored in final_beta_solution )
首先,我們定義一個ASGL對象,將其用于擬合簡單的稀疏組套索模型(無交叉驗證)。 然后,使用函數train_test_split()將數據拆分為70%的訓練,30%的測試,并擬合模型,以便獲得β系數的最終估計值(存儲在final_beta_solution )
Given a new dataset, we can predict the value of the response variable using the function ASGL.predict() , and compute the final prediction error using the test set with the function error_calculator .
給定一個新的數據集,我們可以預測使用該函數的響應變量的值ASGL.predict()并使用所述測試組與該功能計算最終的預測誤差error_calculator 。
And that’s it on how to implement sparse group lasso in python. I hope you enjoyed this post and found it useful, so stay tuned for future posts on this series and please do not hesitate on contacting me if you have any question / suggestion.
這就是如何在python中實現稀疏組套索。 我希望您喜歡這篇文章并覺得它有用,所以請繼續關注本系列的后續文章,如果您有任何問題/建議,請隨時與我聯系。
For a deeper review on what the asgl package has to offer, I recommend reading the jupyter notebook provided in the github repository.
為了更深入地了解asgl軟件包所提供的內容,我建議閱讀github存儲庫中提供的jupyter筆記本 。
Have a good day!
祝你有美好的一天!
翻譯自: https://towardsdatascience.com/sparse-group-lasso-in-python-255e379ab892
稀疏組套索
總結
以上是生活随笔為你收集整理的稀疏组套索_Python中的稀疏组套索的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 我叫mt4六芒星怎么玩
- 下一篇: spotify音乐下载_使用Python