Unfolded Insert

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Unfolded Insert as PDF for free.

More details

  • Words: 1,340
  • Pages: 5
Data layout of a cover tree and how to get it: the insertion algorithm Alcides Viamontes Esquivel July 12, 2009

This post continues the insights on Cover - Trees, the data structure presented in [1] for nearest neighbour searches in arbitrary-dimensional metric spaces. trees.

I will focus in getting an explicit, storable representation for cover However, it is important to readily obtain the implict representation

back, because many applications can be better stated with it. Let's remember a few points:



In the implicit representation of a cover tree, each point has an associated chain of nodes, starting at a certain level and going down to minus innite.



Each chain has a head, i.e. a node which is child of a node in another chain.

Each chain has an innite number of nodes but the ammount of information it stores is nite. First, each chain should store the level of its head. And then, it have to store a reference to each children chain. With that information it is possible to get back the innite structure of the implicit representation. Now let's examine how a node is inserted. Algorithm 1 is the point insertion algorithm published in [1]. Its recursive nature implies the use of a stack holding, in this case, the sets of nodes

Qi

on each level.

1

Algorithm 1: Generic insertion step in a cover tree Input: point p, cover set Qi , level i Q = {Children(q) : q ∈ Qi } // Q is at level i − 1 if d(p, Q) > 2i then // This means that p can't be at level i − 1. // Q also contains all nodes in Qi (covering); for // them this is a valid separation condition.

return true else 

Qi−1 = q ∈ Q : d(p, q) 6 2i // Qi−1 contains the nodes in level i − 1 for whom // we have to test separation. found = Insert(p , Qi−1 , i − 1) if found and d(p, Qi ) 6 2i then // Level i − 1 is a valid level for p, so proceed // to find a parent in level i. i pick a single q ∈ Qi such that d(p, q) 6 2 insert p into Children(q)

return false else return found end end

Algorithm 1 uses children of nodes in their implicit sense, thus it is important to be able to retrieve, of all children chains of a given one, those that begin at a certain level.

A chain may have children chains in dierent levels, which

underlines the need of

classifying

children references by level.

Depending on

the selected programming language and libraries, that would ammount to using a

multiset

or a sorted array for storing references. Because insertion of random

children references may be necessary, a tree or hash based multiset is better suited than a sorted array. If we were to code it in C++, a possible class for chains is presented in gure 1. Standard multisets are tree-based, supporting asymptotic

O(log n) O(n).

time for

insertions and searches, while keeping space usage bounded by

Let's go back to algorithm 1. See a ow diagram at gure 2 for this algorithm. It might be an easy way of understing the insertion process, or the result of a historic derivation.

At coding time however, the use of an implicit stack

is expensive, because registers have to be saved or restored between context switches. Note the existence of two implicit loops, both of them related to the recursive invocation: one that runs when entering, the other when returning. An explicit stack and the removal of recursivity improves the situation a little, but still memory is required for storing entire sets, and more memory means more cache misses. Algorithm 2 shows recursivity unfolded. There both loops and the stack are explicit.

2

template



template



c h a i n

c h a i n

) {

return

c1

−>

struct

chain ;

bool c h a i n _ s o r t ( const ∗ c1 , const ∗ c 2

f i r s t _ l e v e l < c2

−>

first_level

;

}

template

struct c h a i n { int f i r s t _ l e v e l ; std : : multiset < chain ∗ > children_chains ; Point

p;

};

Figure 1: Candidate structure for chains in a cover tree

Figure 2: This gure shows the recursive insertion algorithm. Note the existence of two implicit loops, colored in red and green.

Thin lines represent explicit

paths in the code, and gross lines denote implicit control ow across the recursive invocation.

3

Algorithm 2 portraits the use of an optimization feature: the use of a 'break' in the second loop. Its use is possible because once a parent is found for a chain, there is nothing else to do. In the other side, there is no explicit way of breaking the green loop of gure 2, except perhaps throwing an exception. And there are little chances that even a sophisticated functional compiler can optimize with tail invocation superous returns in the green loop.

Algorithm 2: Non recursive insert algoritm Input: p (point to insert) stack= empty stack (stack) i = higher level of root chain with direct Qi = {ri } where ri is node of root chain

while true do

children (integer) at level

i

(set of nodes)

// Meaning of i: the level currently considered. // Meaning of Qi : a set of nodes such that d(p, q) 6 2i+1 . Q = {Children(q) : q ∈ Qi } // Q is at level i-1 if d(p, Q) > 2i then

else

break

stack.push(Qi ) // Update i and Qi for the level below; // note that i is updated after Qi .  Qi = q ∈ Q : d(p, q) 6 2i i=i-1

end end while true do i += 1

// Go up one level

stack.pop() if d(p, Qi ) ≤ 2i then i Pick q ∈ Qi such that d(p, q) 6 2 Make p a child of q

Qi

=

break end end

It is possible to further exploit the just-one-parent issue and store a suitable candidate in each run of the rst loop, eliminating the stack and the second loop. Such idea is depicted in algorithm 3.

1 over 1.

If implemented, algorithm 3 might yield a considerable speedup

Note that these results are ultimately dependent on the implemention, the target dataset and the the costs of computing distances between points.

1 In

my implementation, using a cheap

l1

metric in a one-dimensional dataset, algorithm 3

was approximately two times faster than 1

4

Algorithm 3: Optimized, non recursive insert algoritm Input: p (point to insert) Input: r (root chain) Input: imax (level high enough such that d(p, r) < 2imax i = imax Q∗ = {r} candidateparent

while true do

=

)(integer)

(r, imax + 1)

// i: the level currently considered. // Q∗ : every node q at level i such that d(p, q) 6 2i+1 , Q = {Children(q) : q ∈ Q∗ } // Q is at level i-1 // Q∗ is Qi till here Q∗ = Ø // Q∗ is Qi−1 from here foreach q ∈ Q do dq = d(q, p) if dq 6 2i then Add q to Q∗ if rst level of q ≥ i then // Previous condition implies that q ∈ Qi candidateparent = (q, i)

end end end if Q∗ = Ø then break else i=i−1

end end

(q, i) = candidateparent Insert p as child of q , level

of

q

is

i

and level of

p

is

i−1

References [1] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. 2006.

5


Related Documents

Unfolded Insert
May 2020 3
Insert
August 2019 38
Insert
July 2020 18
Fluverin Insert
June 2020 14
Insert 2
April 2020 15