Exercise 6.8

Answers

(a)
Pseudo code for the recursive branch and bound search for the nearest neighbor.
  • Given a test point x, we search for its nearest neighbor in the data set D.
  • We look for algorithm called F(S,x) to find the nearest neighbor of x in a given cluster S. Let’s initialize with the whole data set as our first cluster D. From clustering algorithm, we should know its two children clusters, their centers, μ1,μ2, and radii r1,r2.
  • Here is how we implement F(S,x):

    (a)
    Set the nearest neighbor be xnn = None as in python.
    (b)
    If the input cluster has no children cluster, i.e. it has at most 2 data points
    • Find the nearest neighbor in the cluster by computing the distance from x to all the points in the cluster. There are at most 2 computations needed since we assume every cluster with more than 2 points is split into 2 sub-clusters.
    • Return this point as the nearest neighbor xnn in the input cluster to x
    • Else, suppose the two children clusters of input cluster are: S1,S2
    • If x μ1x μ2: let xnn = F(S1,x)
    • Else: let xnn = F(S2,x)
    (c)
    Now we have xnn, let’s call the sibling cluster of S be S, let’s compute the bound condition
    (d)
    If x xnn > x μ r: (Need check cluster S for the nearest neighbor)
    • xnn = F(S,x)
    • If x xnn < x xnn: xnn = xnn
    (e)
    Return xnn
  • We call F(S,x) with S = D to get the nearest neighbor in the data set to x
(b)
If the sub-clusters of a cluster are balanced, and assume the number of data points is power of 2, i.e. N = 2k, the maximum depth of the recursive search for a nearest neighbor can be log 2N = k
(c)
If we have balanced sub-clusters and if the bound conditions always hold, then the if-condition in step 5 will be always False, so we just need to go through log 2N steps, each with a computation of d for distance in the Rd space. The total cost is thus dlog 2N
(d)
To apply the branch and bound algorithm to k-nearest neighbors, assume that every cluster with more than k points is split into 2 sub-clusters. Instead of search for the nearest neighbor in a cluster, we search for k-nearest neighbors in a clusters, then apply the bound condition on the k-th neighbor with another cluster.
User profile picture
2021-12-08 09:44
Comments