Exercise 8.6 - Elementary properties of l2 regularized logistic regression

Answers

For question (a), the Hessian of $l_{2}$ regularized negative log likelihood is:

𝐇 + λ 𝐈,

where $𝐇$ , following the derivation in exercise 8.3, is at least semi-positive definite. So the Hessian for this model is strictly (for non-trivial $λ$ ) positive definite, there is a unique optimal solution. The answer is False.

For question (b), the result is not necessarily true. For a sparse optimum, one should resort to the LASSO model, where a Laplace prior is exerted on weights.

For question (c), if $λ = 0$ then the model reduces to ordinary logistic regression. If the dataset is linearly separable then there exists $𝐰^{'}$ such that $\forall i$ :

{𝐰^{'}}^{T} 𝐱_{i} \cdot y_{i} \geq 0 .

Now for an arbitrary number $α > 0$ , the weights $α \cdot 𝐰^{'}$ also meets the separation condition. Let $α \to \infty$ justifies that the statement in (c) is True.

For question (d), the statement is False since the model now has to trade-off between fitness and prior knowledge. Concretely, we can prove the other statement: as we increase $λ$ , the likelihood of the training dataset monotonically decreases.

Assume that ${\hat{𝐰}}_{1}$ minimize the (8.131) with $λ_{1}$ , denoted by $J_{1}$ . Now increase $λ_{1}$ to $λ_{2}$ , we are now optimizing the loss:

J_{2} (𝐰) = - l (𝐰, 𝒟_{train}) + λ_{2} 𝐰^{T} 𝐰,

whose optimal solution is denoted by ${\hat{𝐰}}_{2}$ .

If ${\hat{𝐰}}_{1} = {\hat{𝐰}}_{2}$ then we already have:

l ({\hat{𝐰}}_{1}, 𝒟_{train}) = l ({\hat{𝐰}}_{2}, 𝒟_{train}) .

Otherwise, we would have:

- l ({\hat{𝐰}}_{1}, 𝒟_{train}) + λ_{2} {\hat{𝐰}}_{1}^{T} \hat{𝐰_{1}} > - l ({\hat{𝐰}}_{2}, 𝒟_{train}) + λ_{2} {\hat{𝐰}}_{2}^{T} \hat{𝐰_{2}} .

If $l ({\hat{𝐰}}_{2}, 𝒟_{train}) > l ({\hat{𝐰}}_{1}, 𝒟_{train})$ then we would have:

Δ = l ({\hat{𝐰}}_{2}, 𝒟_{train}) - l ({\hat{𝐰}}_{1}, 𝒟_{train}) > λ_{2} ({\hat{𝐰}}_{2}^{T} \hat{𝐰_{2}} - {\hat{𝐰}}_{1}^{T} \hat{𝐰_{1}}) .

Finally, consider:

J_{1} ({\hat{𝐰}}_{1}) - J_{1} ({\hat{𝐰}}_{2}) = Δ + λ_{1} ({\hat{𝐰}}_{1}^{T} \hat{𝐰_{1}} - {\hat{𝐰}}_{2}^{T} \hat{𝐰_{2}}) .

If ${\hat{𝐰}}_{1}^{T} \hat{𝐰_{1}} > {\hat{𝐰}}_{2}^{T} \hat{𝐰_{2}}$ then ${\hat{𝐰}}_{1}$ is not the optimum of $J_{1}$ and we arrive in a contradiction. Otherwise:

λ_{1} ({\hat{𝐰}}_{2}^{T} \hat{𝐰_{2}} - {\hat{𝐰}}_{1}^{T} \hat{𝐰_{1}}) < λ_{2} ({\hat{𝐰}}_{2}^{T} \hat{𝐰_{2}} - {\hat{𝐰}}_{1}^{T} \hat{𝐰_{1}}) < Δ .

Hence we still have the optimality of ${\hat{𝐰}}_{1}$ fail. This finishes the proof.

For question (e), the statement is False. This can be easily shown by imagine $λ \to \infty$ .

solour_lfq

2021-03-24 13:42

Exercise 8.6 - Elementary properties of l2 regularized logistic regression

Answers

Comments

Add answer