Exercise 3.19 - Irrelevant features with naive Bayes

Answers

The log-likelihood is defined by:

log p ( 𝐱 i | c , 𝜃 ) = w = 1 W x 𝑖𝑤 log 𝜃 𝑐𝑤 1 𝜃 𝑐𝑤 + w = 1 W log ( 1 𝜃 𝑐𝑤 ) .

In a succint way:

log p ( 𝐱 i | c , 𝜃 ) = ϕ ( 𝐱 i ) T β c ,

where:

ϕ ( 𝐱 i ) = ( 𝐱 i , 1 ) T ,

β c = ( log 𝜃 c 1 1 𝜃 c 1 , . . . w = 1 W log ( 1 𝜃 𝑐𝑤 ) ) T .

For question (a):

log p ( c = 1 | 𝐱 i ) p ( c = 2 | 𝐱 i ) = log p ( c = 1 ) p ( 𝐱 i | c = 1 ) p ( c = 2 ) p ( 𝐱 i | c = 2 ) = log p ( 𝐱 i | c = 1 ) p ( 𝐱 i | c = 2 ) = ϕ ( 𝐱 i ) T ( β 1 β 2 ) .

For question (b), with:

log p ( c = 1 | 𝐱 i ) p ( c = 2 | 𝐱 i ) = log p ( c = 1 ) p ( c = 2 ) + ϕ ( 𝐱 i ) T ( β 1 β 2 ) ,

a word w will not affect this posterior measure as long as:

x 𝑖𝑤 ( β 1 , w β 2 , w ) = 0

Hence if:

𝜃 c = 1 , w = 𝜃 c = 2 , w ,

then it cannot affect the classification decision. That is to say, w appear in class 1 and 2 with the same frequency.

For question (c), we have:

𝜃 ^ 1 , w = 1 1 2 + N 1 ,

𝜃 ^ 2 , w = 1 1 2 + N 2 .

They are different when N 1 N 2 so the bias effect remains. However, this bias reduces when N grows large.

For question (d), using information theory would be a solid option.

User profile picture
2021-03-24 13:42
Comments