當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

稀疏组套索_Python中的稀疏组套索

發(fā)布時(shí)間：2023/12/15 python 26 豆豆

生活随笔收集整理的這篇文章主要介紹了稀疏组套索_Python中的稀疏组套索小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

稀疏組套索

So yesterday I launched a new package for python: asgl (the name comes from Adaptive Sparse Group Lasso) that adds a lot of features that were already available in R packages but not in python, like solving sparse group lasso models, and goes beyond that, adding extra features that improve the results that sparse group lasso can provide.

所以昨天我為python啟動(dòng)了一個(gè)新軟件包： asgl (名稱來(lái)自Adaptive Sparse Group Lasso)，它添加了許多R包中已經(jīng)可用但python中沒(méi)有的功能，例如解決稀疏組lasso模型，并且超出了此范圍，添加了額外的功能來(lái)改善稀疏組的結(jié)果套索可以提供。

This is going to be the first article on a series about state-of-the-art regularization techniques in regression, and I would like to start talking about the sparse group lasso: what is it and how to use it. Specifically, here we will see:

這將是有關(guān)回歸的最新正則化技術(shù)系列的第一篇文章，我想開(kāi)始談?wù)撓∈杞M套索：它是什么以及如何使用它。具體來(lái)說(shuō)，在這里我們將看到：

What is sparse group lasso
什么是稀疏組套索
How to use sparse group lasso in python
如何在python中使用稀疏組套索
How to perform k-fold cross validation
如何執(zhí)行K折交叉驗(yàn)證
How to use grid search in order to find the optimal solution.
如何使用網(wǎng)格搜索以找到最佳解決方案。

什么是稀疏組套索 (What is sparse group lasso)

To understand what is sparse group lasso we need to talk (briefly) about two techniques: lasso and group lasso. Given a risk function, for example the linear regression risk,

要了解什么是稀疏組套索，我們需要(簡(jiǎn)要地)談?wù)搩煞N技術(shù)：套索和套索套索 。給定風(fēng)險(xiǎn)函數(shù)，例如線性回歸風(fēng)險(xiǎn)，

Risk function of a linear regression model線性回歸模型的風(fēng)險(xiǎn)函數(shù)

Lasso: is defined by adding a penalization on the absolute value of the β coefficients,

套索：通過(guò)對(duì)β系數(shù)的絕對(duì)值加罰分來(lái)定義，

Lasso penalty formula套索懲罰公式

This definition provides sparse solutions, because it will send to zero some of the β coefficients (the least related with the response variable). The effect of this penalization can be controlled using the λ parameter. A large λ value provides solutions where the penalization has a greater importance, and thus there are more zeros among the β coefficients. This is mainly useful in high dimensional datasets, where there are more variables than observations but we only expect a small fragment of the variables to be truly meaningful.

該定義提供了稀疏解 ，因?yàn)樗鼘⒛承│孪禂?shù)(與響應(yīng)變量最小相關(guān))發(fā)送到零。懲罰的效果可以使用λ參數(shù)來(lái)控制。較大的λ值提供了懲罰更重要的解決方案，因此β系數(shù)中有更多零。這主要在高維數(shù)據(jù)集中有用，在高數(shù)據(jù)集中 ，變量多于觀測(cè)值，但我們只希望變量的一小部分真正有意義。

However, there are situations in which the predictor variables in X have a natural grouped structure. For example, in biostatistics it is common to deal with genetic datasets in which predictors are grouped into genetical pathways. In stock market analysis one can group companies from the same business segment. In climate data one can group different regions… And lasso provides individual sparse solutions, not group sparse.

但是，在某些情況下，X中的預(yù)測(cè)變量具有自然的分組結(jié)構(gòu)。例如，在生物統(tǒng)計(jì)學(xué)中，通常要處理將預(yù)測(cè)因子歸為遺傳途徑的遺傳數(shù)據(jù)集。在股票市場(chǎng)分析中，可以將同一業(yè)務(wù)部門的公司分組。在氣候數(shù)據(jù)中，可以將不同的區(qū)域分組...套索提供的是稀疏解決方案，而不是稀疏解決方案。

Group lasso: So here comes group lasso to the rescue. Group lasso is built as the sum of squares of coefficients belonging to the same group.

套索組：來(lái)了套索救援。組套索被構(gòu)建為屬于同一組的系數(shù)的平方和。

Group lasso penalty formula組套索懲罰公式

This way it takes into account the possible grouped structure of predictors, and it sends to zero whole groups of variables. If all the groups are of size 1 (only one predictor per group) we will be solving a lasso model. Lets see lasso and group lasso graphically,

這樣，它考慮了預(yù)測(cè)變量的可能分組結(jié)構(gòu)，并將整組變量發(fā)送到零。如果所有組的大小均為1(每組只有一個(gè)預(yù)測(cè)變量)，我們將求解套索模型。讓我們以圖形方式查看套索并分組套索，

Lasso, group lasso and ridge penalizations comparison套索，套索和嶺處罰的比較

In the image above we have a simple problem with three coefficients, β?β?? and β??. The last two coefficients form a group, and as we can see, lasso (left image) does not take this grouping information into account, but group lasso does. So group lasso can be seen as lasso between groups and ridge within groups. If a group is meaningful, we select the whole group. If it is not, we send it to zero.

在上圖中，我們有一個(gè)簡(jiǎn)單的問(wèn)題，其中三個(gè)系數(shù)分別為β?β??和β?2。最后兩個(gè)系數(shù)組成一個(gè)組，正如我們所看到的，套索(左圖)沒(méi)有考慮該分組信息，但是套索卻在考慮。因此，組套索可以看作是組之間和組內(nèi)脊之間的套索。如果一個(gè)組有意義，我們選擇整個(gè)組。如果不是，我們將其發(fā)送為零。

Sparse group lasso: and finally here it is,

稀疏組套索：終于到了，

Sparse group lasso penalty function稀疏組套索懲罰函數(shù)

Sparse group lasso is a linear combination between lasso and group lasso, so it provides solutions that are both between and within group sparse.

稀疏組套索是套索和套索之間的線性組合，因此它提供的稀疏介于組內(nèi)和組內(nèi)。

This technique selects the most meaningful predictors from the most meaningful groups, and is one of the best variable selection alternatives of recent years. However, there was no implementation of sparse group lasso for python… until now.

該技術(shù)從最有意義的組中選擇最有意義的預(yù)測(cè)變量，并且是近年來(lái)最好的變量選擇替代方法之一。但是，到目前為止，還沒(méi)有針對(duì)python的稀疏組套索的實(shí)現(xiàn)。

移至python：安裝asgl (Moving to python: install asgl)

Lets start by installing asgl. This can be easily done using pip

讓我們從安裝asgl開(kāi)始。可以使用pip輕松完成

pip install asgl

Or alternatively, one can pull the github repository and run the setup.py

或者，可以拉github存儲(chǔ)庫(kù)并運(yùn)行setup.py

git clone https://github.com/alvaromc317/asgl.git
cd asgl
python setup.py

導(dǎo)入庫(kù) (Import libraries)

Once we have the package installed, we can start using it. First lets import the libraries and the BostonHousing dataset that we will use in this example.

一旦安裝了軟件包，就可以開(kāi)始使用它了。首先，讓我們導(dǎo)入在本示例中將使用的庫(kù)和BostonHousing數(shù)據(jù)集。

Here, in addition to importing the dataset, we have created a variable called group_index . This variable describes the group structure of the data, so if we have 13 predictors, group_index should be a variable of length 13, and if the first three predictors form a group, they should have the same group_index value. However, BostonHousing dataset does not have a natural grouped structure, so here we define a fake one just for the sake of this article.

在這里，除了導(dǎo)入數(shù)據(jù)集，我們還創(chuàng)建了一個(gè)名為group_index的變量。此變量描述了數(shù)據(jù)的組結(jié)構(gòu)，因此，如果我們有13個(gè)預(yù)測(cè)變量，則group_index應(yīng)該是長(zhǎng)度為13的變量，并且如果前三個(gè)預(yù)測(cè)變量形成一個(gè)組，則它們應(yīng)該具有相同的group_index值。但是， BostonHousing數(shù)據(jù)集沒(méi)有自然的分組結(jié)構(gòu)，因此，僅出于本文的目的，在此我們定義一個(gè)偽造的結(jié)構(gòu)。

sql模型的參數(shù) (Parameters for the sgl model)

If we have a look to the sparse group lasso equation above, we can see that there are two parameters, α and λ, that can be optimized. λ controlls how much weight we want to give to the penalization, so larger λ values produce more sparse solutions. And α controls the tradeoff between lasso and group lasso. α equal to 1 provides a lasso, and α equal to 0 provides a group lasso. Now, usually, we can define a grid of possible values for both parameters and try to find the combination that minimizes the error.

如果我們看一下上面的稀疏組套索方程，我們可以看到有兩個(gè)參數(shù)可以優(yōu)化。 λ控制著我們希望對(duì)懲罰給予多少權(quán)重，因此較大的λ值會(huì)產(chǎn)生更多的稀疏解。 α控制套索和組套索之間的權(quán)衡。等于1的α提供套索，等于0的α提供組套索。現(xiàn)在，通常我們可以為兩個(gè)參數(shù)定義一個(gè)可能值的網(wǎng)格，并嘗試找到使誤差最小的組合。

Additionally, we specify the type of model to solve (lm, because we are solving a linear model), the penalization (sgl, because we want the sparse group lasso), the error type to use (MSE, because we will use the mean squared error) and finally, we can indicate if we want to run this code sequentially or in parallel. Parallel execution makes use of the multiprocess python library and can solve the problem much faster than sequentially.

此外，我們指定要求解的模型的類型(lm，因?yàn)槲覀冋谇蠼饩€性模型)，懲罰(sgl，因?yàn)槲覀兿胍∈杞M套索)，要使用的錯(cuò)誤類型(MSE，因?yàn)槲覀儗⑹褂镁灯椒秸`差)，最后，我們可以指出是要順序運(yùn)行還是并行運(yùn)行此代碼。 并行執(zhí)行利用了multiprocess 進(jìn)程 python庫(kù)，并且比順序執(zhí)行要快得多。

交叉驗(yàn)證 (Cross validation)

We have defined a grid of 23 possible λ values and 20 possible α values. A total number of 460 combinations of parameters. And we are going to find the optimal combination using cross validation (that will run in parallel)

我們定義了一個(gè)由23個(gè)可能的λ值和20個(gè)可能的α值組成的網(wǎng)格。總共460個(gè)參數(shù)組合。我們將使用交叉驗(yàn)證(將并行運(yùn)行)找到最佳組合

So first, we define a CV object, and we plug in all the parameter information defined above. Then, we run the cross_validation() function, that will return the error that each of the 460 models achieve on each of the k folds.

因此，首先，我們定義一個(gè)CV對(duì)象，并插入上面定義的所有參數(shù)信息。然后，我們運(yùn)行cross_validation()函數(shù)，該函數(shù)將返回460個(gè)模型中的每一個(gè)在k折上實(shí)現(xiàn)的錯(cuò)誤。

With that, we search for the parameter values that minimize the mean squared error, and store their values in optimal_lambda and optimal_alpha .

這樣，我們將搜索使均方誤差最小的參數(shù)值 ，并將其值存儲(chǔ)在optimal_lambda和optimal_alpha 。

As simple as that, we have found our optimal model.

如此簡(jiǎn)單，我們就找到了最佳模型。

獲得最終錯(cuò)誤 (Obtain the final error)

However, one should remember that cross validation errors are a little bit biased, so in order to obtain the final prediction error we are going to run a final train / test split just using the optimal parameters,

但是，請(qǐng)記住，交叉驗(yàn)證錯(cuò)誤有些偏頗，因此，為了獲得最終的預(yù)測(cè)錯(cuò)誤，我們將僅使用最佳參數(shù)進(jìn)行最終的訓(xùn)練/測(cè)試拆分，

First, we define an ASGL object, that we will use for fitting a simple sparse group lasso model (no cross validation). We then split the data into 70% train 30% test using the function train_test_split() , and we fit the model so that we obtain the final estimates of the β coefficients (stored in final_beta_solution )

首先，我們定義一個(gè)ASGL對(duì)象，將其用于擬合簡(jiǎn)單的稀疏組套索模型(無(wú)交叉驗(yàn)證)。然后，使用函數(shù)train_test_split()將數(shù)據(jù)拆分為70％的訓(xùn)練，30％的測(cè)試，并擬合模型，以便獲得β系數(shù)的最終估計(jì)值(存儲(chǔ)在final_beta_solution )

Given a new dataset, we can predict the value of the response variable using the function ASGL.predict() , and compute the final prediction error using the test set with the function error_calculator .

給定一個(gè)新的數(shù)據(jù)集，我們可以預(yù)測(cè)使用該函數(shù)的響應(yīng)變量的值A(chǔ)SGL.predict()并使用所述測(cè)試組與該功能計(jì)算最終的預(yù)測(cè)誤差error_calculator 。

And that’s it on how to implement sparse group lasso in python. I hope you enjoyed this post and found it useful, so stay tuned for future posts on this series and please do not hesitate on contacting me if you have any question / suggestion.

這就是如何在python中實(shí)現(xiàn)稀疏組套索。我希望您喜歡這篇文章并覺(jué)得它有用，所以請(qǐng)繼續(xù)關(guān)注本系列的后續(xù)文章，如果您有任何問(wèn)題/建議，請(qǐng)隨時(shí)與我聯(lián)系。

For a deeper review on what the asgl package has to offer, I recommend reading the jupyter notebook provided in the github repository.

為了更深入地了解asgl軟件包所提供的內(nèi)容，我建議閱讀github存儲(chǔ)庫(kù)中提供的jupyter筆記本。

Have a good day!

祝你有美好的一天！

翻譯自: https://towardsdatascience.com/sparse-group-lasso-in-python-255e379ab892

稀疏組套索

總結(jié)

以上是生活随笔為你收集整理的稀疏组套索_Python中的稀疏组套索的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：我叫mt4六芒星怎么玩
下一篇： deepin中zz_如何解决R中的Fiz