Exercise 3.19 - Irrelevant features with naive Bayes

Answers

The log-likelihood is defined by:

\log p (𝐱_{i} | c, 𝜃) = \sum_{w = 1}^{W} x_{𝑖𝑤} \cdot \log \frac{𝜃_{𝑐𝑤}}{1 - 𝜃_{𝑐𝑤}} + \sum_{w = 1}^{W} \log (1 - 𝜃_{𝑐𝑤}) .

In a succint way:

\log p (𝐱_{i} | c, 𝜃) = ϕ {(𝐱_{i})}^{T} β_{c},

where:

ϕ (𝐱_{i}) = {(𝐱_{i}, 1)}^{T},

β_{c} = {(\log \frac{𝜃_{c 1}}{1 - 𝜃_{c 1}}, . . . \sum_{w = 1}^{W} \log (1 - 𝜃_{𝑐𝑤}))}^{T} .

For question (a):

\begin{align} \log \frac{p (c = 1 | 𝐱_{i})}{p (c = 2 | 𝐱_{i})} = & \log \frac{p (c = 1) \cdot p (𝐱_{i} | c = 1)}{p (c = 2) \cdot p (𝐱_{i} | c = 2)} \\ = & \log \frac{p (𝐱_{i} | c = 1)}{p (𝐱_{i} | c = 2)} \\ = & ϕ {(𝐱_{i})}^{T} (β_{1} - β_{2}) . \end{align}

For question (b), with:

\log \frac{p (c = 1 | 𝐱_{i})}{p (c = 2 | 𝐱_{i})} = \log \frac{p (c = 1)}{p (c = 2)} + ϕ {(𝐱_{i})}^{T} (β_{1} - β_{2}),

a word $w$ will not affect this posterior measure as long as:

x_{𝑖𝑤} (β_{1, w} - β_{2, w}) = 0

Hence if:

𝜃_{c = 1, w} = 𝜃_{c = 2, w},

then it cannot affect the classification decision. That is to say, $w$ appear in class 1 and 2 with the same frequency.

For question (c), we have:

{\hat{𝜃}}_{1, w} = 1 - \frac{1}{2 + N_{1}},

{\hat{𝜃}}_{2, w} = 1 - \frac{1}{2 + N_{2}} .

They are different when $N_{1} \neq N_{2}$ so the bias effect remains. However, this bias reduces when $N$ grows large.

For question (d), using information theory would be a solid option.

solour_lfq

2021-03-24 13:42