Homepage Solution manuals Kevin P. Murphy Machine Learning: a Probabilistic Perspective Exercise 13.11 - Projected gradient descent for l1 regularized least squares

Exercise 13.11 - Projected gradient descent for l1 regularized least squares

Answers

Generally, we take the gradient of 𝐰 and optimize. When some constraint on 𝐰 is broken by the gradient descent, the increment is moderated so that the constraint remains valid. For the following loss function:

min 𝐰 { NLL ( 𝐰 ) + λ | | 𝐰 | | 1 } ,

consider under a linear regression context:

NLL ( 𝐰 ) = 1 2 | | 𝐲 𝐗 𝐰 | | 2 2 .

For λ 𝐰 1 which is not differentiate, it is suggest:

𝐰 = 𝐮 𝐯 ,

where

u i = ( x i ) + = max { 0 , x i } ,

v i = ( x i ) + = max { 0 , x i } .

With 𝐮 0 , 𝐯 0 , we have:

𝐰 1 = 1 n T 𝐮 + 1 n T 𝐯 .

Hence the original problem is translated into:

min 𝐰 { 1 2 | | 𝐲 𝐗 ( 𝐮 𝐯 ) | | 2 2 + λ 1 n T 𝐮 + λ 1 n T 𝐯 } .

s.t.  𝐮 0 , 𝐯 0 .

Denote:

𝐳 = ( 𝐮 𝐯 ) ,

then we can rewrite the original target to be optimized into:

min 𝐳 { f ( 𝐳 ) = 𝐜 T 𝐳 + 1 2 𝐳 T 𝐀 𝐳 } ,

s.t.  𝐳 0 ,

where:

𝐜 = ( λ 1 n 𝐲 𝐗 λ 1 n + 𝐲 𝐗 ) ,

𝐀 = ( 𝐗 T 𝐗 𝐗 T 𝐗 𝐗 T 𝐗 𝐗 T 𝐗 ) .

Now we have changed the problem to a quadratic problem with a simple bound constraint. The gradient is given by:

f ( 𝐳 ) = 𝐜 + 𝐀 𝐳 .

An ordinary gradient descent step is:

𝐳 k + 1 = 𝐳 k α f ( 𝐳 k ) .

For projected case, take 𝐠 k :

𝐠 i k = min { 𝐳 i k , α f ( 𝐳 k ) i } .

And:

𝐳 k + 1 = 𝐳 k 𝐠 k ,

hence 𝐳 is constrained as a legal weight candidate.

The original paper suggest more delicate method to moderate the learning rate, refer to Gradient Projection for Sparse Reconstruction: Application to Compressed Sensing and Other Inverse Problems, Mario A.T.Figueiredo.

User profile picture
2021-03-24 13:42
Comments