Homepage Solution manuals Kevin P. Murphy Machine Learning: a Probabilistic Perspective Exercise 8.6 - Elementary properties of l2 regularized logistic regression

Exercise 8.6 - Elementary properties of l2 regularized logistic regression

Answers

For question (a), the Hessian of l 2 regularized negative log likelihood is:

𝐇 + λ 𝐈 ,

where 𝐇 , following the derivation in exercise 8.3, is at least semi-positive definite. So the Hessian for this model is strictly (for non-trivial λ ) positive definite, there is a unique optimal solution. The answer is False.

For question (b), the result is not necessarily true. For a sparse optimum, one should resort to the LASSO model, where a Laplace prior is exerted on weights.

For question (c), if λ = 0 then the model reduces to ordinary logistic regression. If the dataset is linearly separable then there exists 𝐰 such that i :

𝐰 T 𝐱 i y i 0 .

Now for an arbitrary number α > 0 , the weights α 𝐰 also meets the separation condition. Let α justifies that the statement in (c) is True.

For question (d), the statement is False since the model now has to trade-off between fitness and prior knowledge. Concretely, we can prove the other statement: as we increase λ , the likelihood of the training dataset monotonically decreases.

Assume that 𝐰 ^ 1 minimize the (8.131) with λ 1 , denoted by J 1 . Now increase λ 1 to λ 2 , we are now optimizing the loss:

J 2 ( 𝐰 ) = l ( 𝐰 , 𝒟 train ) + λ 2 𝐰 T 𝐰 ,

whose optimal solution is denoted by 𝐰 ^ 2 .

If 𝐰 ^ 1 = 𝐰 ^ 2 then we already have:

l ( 𝐰 ^ 1 , 𝒟 train ) = l ( 𝐰 ^ 2 , 𝒟 train ) .

Otherwise, we would have:

l ( 𝐰 ^ 1 , 𝒟 train ) + λ 2 𝐰 ^ 1 T 𝐰 1 ^ > l ( 𝐰 ^ 2 , 𝒟 train ) + λ 2 𝐰 ^ 2 T 𝐰 2 ^ .

If l ( 𝐰 ^ 2 , 𝒟 train ) > l ( 𝐰 ^ 1 , 𝒟 train ) then we would have:

Δ = l ( 𝐰 ^ 2 , 𝒟 train ) l ( 𝐰 ^ 1 , 𝒟 train ) > λ 2 ( 𝐰 ^ 2 T 𝐰 2 ^ 𝐰 ^ 1 T 𝐰 1 ^ ) .

Finally, consider:

J 1 ( 𝐰 ^ 1 ) J 1 ( 𝐰 ^ 2 ) = Δ + λ 1 ( 𝐰 ^ 1 T 𝐰 1 ^ 𝐰 ^ 2 T 𝐰 2 ^ ) .

If 𝐰 ^ 1 T 𝐰 1 ^ > 𝐰 ^ 2 T 𝐰 2 ^ then 𝐰 ^ 1 is not the optimum of J 1 and we arrive in a contradiction. Otherwise:

λ 1 ( 𝐰 ^ 2 T 𝐰 2 ^ 𝐰 ^ 1 T 𝐰 1 ^ ) < λ 2 ( 𝐰 ^ 2 T 𝐰 2 ^ 𝐰 ^ 1 T 𝐰 1 ^ ) < Δ .

Hence we still have the optimality of 𝐰 ^ 1 fail. This finishes the proof.

For question (e), the statement is False. This can be easily shown by imagine λ .

User profile picture
2021-03-24 13:42
Comments