Exercise 11.3 - EM for mixtures of Bernoullis

Answers

For the mixture of Bernoullis model, consider K bases, from which each is a Bernoulli distribution:

Ber ( x | 𝜃 k ) = 𝜃 k 𝕀 ( x = 1 ) ( 1 𝜃 k ) 𝕀 ( x = 0 ) .

The auxiliary function, whom we are to optimize w.r.t. 𝜃 is:

Q ( 𝜃 , 𝜃 old ) = 𝔼 p ( 𝐳 | 𝒟 , 𝜃 old ) [ n = 1 N log p ( x n , 𝐳 n | 𝜃 ) ] = n = 1 N k = 1 K 𝔼 [ z 𝑛𝑘 ] ( log π k + 𝕀 ( x n = 1 ) log 𝜃 k + 𝕀 ( x n = 0 ) log ( 1 𝜃 k ) ) = n = 1 N k = 1 K r 𝑛𝑘 ( log π k + 𝕀 ( x n = 1 ) 𝜃 k + 𝕀 ( x n = 0 ) ( 1 𝜃 k ) ) .

Taking differential w.r.t. 𝜃 k

∂𝑄 𝜃 k = n = 1 N r 𝑛𝑘 ( 𝕀 ( x n = 1 ) 1 𝜃 k 𝕀 ( x n = 0 ) 1 1 𝜃 k ) ,

set it to zero:

𝜃 k = n = 1 N r 𝑛𝑘 𝕀 ( x n = 1 ) n = 1 N r 𝑛𝑘 .

This is exactly (11.116) modules α -reduction.

If a Beta ( α k , β k ) prior is introduced for each base then we introduce α k 1 positive samples and β k 1 negative samples into the computation, this is tantamount to setting r 𝑛𝑘 = 1 for n = N + 1 , , N + α k + β k 2 , so:

𝜃 k = n = 1 N r 𝑛𝑘 𝕀 ( x n = 1 ) + α k 1 n = 1 N r 𝑛𝑘 + α k + β k 2 .

At this point one might wonder the necessity of introducing a mixture of Bernoullis. Unlike the mixture of Gaussians, that of Bernoullis seems less convincing. Let 𝜃 denotes the weighted average of base models:

𝜃 = k π k 𝜃 k ,

then the variance of the mixture model remains 𝜃 𝜃 2 . There is no need of using a mixture of Bernoullis (regardin prediction) unless we have to explicitly model a scenario in which there has to be a mixture structure. For example, if we were told that a binary string is generated from a set of unbalanced coins where each coin has different dynamics and we are asked to tell which coin generates some specific toss. But even this scenario might lead to abnormality, considering a coin that always yields head and another that always yields tail.

User profile picture
2021-03-24 13:42
Comments