Exercise 8.4 - Gradient and Hessian of log-likelihood for multinomial logistic regression

Answers

For question (a), given a sample indexed $i$ we have,

μ_{𝑖𝑘} = \frac{\exp (𝐰_{k}^{T} 𝐱_{i})}{\sum_{c} \exp (𝐰_{c}^{T} 𝐱_{i})},

η_{𝑖𝑗} = 𝐰_{j}^{T} 𝐱_{i} .

Now we have:

\begin{aligned} \frac{\partial μ_{𝑖𝑘}}{\partial η_{𝑖𝑗}} & = \frac{\frac{\partial \exp (η_{𝑖𝑘})}{\partial η_{𝑖𝑗}} \cdot \sum_{c} \exp (η_{𝑖𝑐}) - \frac{\partial \sum_{c} \exp (η_{𝑖𝑐})}{\partial η_{𝑖𝑗}} \cdot \exp (η_{𝑖𝑘})}{{(\sum_{c} \exp (η_{𝑖𝑐}))}^{2}} \\ = \frac{\exp (η_{𝑖𝑘}) \cdot δ_{𝑘𝑗} \cdot \sum_{c} \exp (η_{𝑖𝑐}) - \exp (η_{𝑖𝑗}) \cdot \exp (η_{𝑖𝑘})}{{(\sum_{c} \exp (η_{𝑖𝑐}))}^{2}} \\ = μ_{𝑖𝑘} \cdot δ_{𝑘𝑗} - μ_{𝑖𝑗} \cdot μ_{𝑖𝑘}, \end{aligned}

what dominates is but the elementary calculus.

For question (b), recall that:

l (𝐖) = \sum_{i = 1}^{N} \sum_{c} y_{𝑖𝑐} \cdot \log μ_{𝑖𝑐} .

Let $l_{i} (𝐖) = \sum_{c} y_{𝑖𝑐} \cdot \log μ_{𝑖𝑐}$ , we are now ready for reduction:

\begin{aligned} \frac{\partial l_{i}}{\partial 𝐰_{j}} & = \frac{\partial}{\partial 𝐰_{j}} \sum_{c} y_{𝑖𝑐} \cdot \log μ_{𝑖𝑐} \\ = \sum_{c} \frac{y_{𝑖𝑐}}{μ_{𝑖𝑐}} \cdot \frac{\partial μ_{𝑖𝑐}}{\partial η_{𝑖𝑗}} \cdot \frac{\partial η_{𝑖𝑗}}{\partial 𝐰_{j}} \\ = \sum_{c} \frac{y_{𝑖𝑐}}{μ_{𝑖𝑐}} \cdot μ_{𝑖𝑐} \cdot (δ_{𝑐𝑗} - μ_{𝑖𝑗}) \cdot 𝐱_{i} \\ = \sum_{c} y_{𝑖𝑐} \cdot (1 - μ_{𝑖𝑗}) \cdot 𝐱_{i} \\ = y_{𝑖𝑗} \cdot (1 - μ_{𝑖𝑗}) \cdot 𝐱_{i} - \sum_{c \neq j} y_{𝑖𝑐} \cdot μ_{𝑖𝑗} \cdot 𝐱_{i} \\ = y_{𝑖𝑗} \cdot (1 - μ_{𝑖𝑗}) \cdot 𝐱_{i} + (y_{𝑖𝑗} - 1) \cdot μ_{𝑖𝑗} \cdot 𝐱_{i} \\ = (y_{𝑖𝑗} - μ_{𝑖𝑗}) \cdot 𝐱_{i} . \end{aligned}

Summarizing over $i$ yields (8.126).

For question (c), we have by definition:

𝐇_{c, c^{'}} = \nabla_{𝐰_{c^{'}}} \nabla_{𝐰_{c}} l (𝐖) .

Hence we begin with the result from question (b):

\begin{aligned} \nabla_{𝐰_{c^{'}}} \nabla_{𝐰_{c}} l_{i} & = \frac{\partial}{\partial 𝐰_{c^{'}}} (y_{𝑖𝑐} - μ_{𝑖𝑐}) \cdot 𝐱_{i} \\ = - \frac{\partial}{\partial 𝐰_{c^{'}}} μ_{𝑖𝑐} \cdot 𝐱 \\ = - \frac{\partial μ_{𝑖𝑐}}{\partial η_{i c^{'}}} \frac{\partial η_{i c^{'}}}{\partial 𝐰_{c^{'}}} \cdot 𝐱_{i} \\ = - μ_{𝑖𝑐} (δ_{c c^{'}} - μ_{i c^{'}}) \cdot 𝐱_{i} 𝐱_{i}^{T}, \end{aligned}

where in the last step we have to adopt the outer product to span the Hessian. Summarizing over $i$ yields the desired result (8.127).

solour_lfq

2021-03-24 13:42