Analysis of the Clustering Properties of the Hilbert Space-Filling Curve

Bongki Moon H.V. Jagadish Christos Faloutsos Joel H. Saltz

Abstract

Several schemes for linear mapping of a multidimensional space have been proposed for various applications such as access methods for spatio-temporal databases and image compression. In these applications, one of the most desired properties from such linear mappings is clustering, which means the locality between objects in the multidimensional space being preserved in the linear space. It is widely believed that the Hilbert space- lling curve achieves the best clustering 1, 14]. In this paper, we analyze the clustering property of the Hilbert space- lling curve by deriving closed-form formulas for the number of clusters in a given query region of an arbitrary shape (e.g., polygons and polyhedra). Both the asymptotic solution for the general case and the exact solution for a special case generalize previous work 14]. They agree with the empirical results that the number of clusters depends on the hyper-surface area of the query region and not on its hyper-volume. We also show that the Hilbert curve achieves better clustering than the z curve. From a practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the required disk access behaviors and hence the total access time.

Index Terms: locality-preserving linear mapping, range queries, multi-attribute access methods,

data clustering, Hilbert curve, space- lling curves, fractals.

1 Introduction

The design of multidimensional access methods is di cult compared to one-dimensional cases because there is no total ordering that preserves spatial locality. Once a total ordering is found for a given spatial

This work was sponsored in part by National Science Foundation CAREER Award IIS-9876037, and grants IRI9625428 and DMS-9873442, by the Advanced Research Projects Agency under contract No. DABT63-94-C-0049. It was also supported by NSF Cooperative Agreement No. IRI-9411299, and by DARPA/ITO through Order F463, issued by ESC/ENS under contract N66001-97-C-851. Additional funding was provided by donations from NEC and Intel. The authors assume all responsibility for the contents of the paper.

1

or multidimensional database, one can use any one-dimensional access method (e.g., B+ -tree), which may yield good performance for multidimensional queries. An interesting application of the ordering arises in a multidimensional indexing technique proposed by Orenstein 19]. The idea is to develop a single numeric index on a one-dimensional space for each point in a multidimensional space, such that for any given object, the range of indices, from the smallest index to the largest, includes few points not in the object itself. Consider a linear traversal or a typical range query for a database where record signatures are mapped with multi-attribute hashing 24] to buckets stored on disk. The linear traversal speci es the order in which the objects are fetched from disk as well as the number of blocks fetched. The number of nonconsecutive disk accesses will be determined by the order of blocks fetched. Although the order of blocks fetched is not explicitly speci ed in the range query, it is reasonable to assume that the set of blocks fetched can be rearranged into a number of groups of consecutive blocks by a database server or disk controller mechanism 25]. Since it is more e cient to fetch a set of consecutive disk blocks rather than a randomly scattered set in order to reduce additional seek time, it is desirable that objects close together in a multidimensional attribute space also be close together in the one-dimensional disk space. A good clustering of multidimensional data points on the one-dimensional sequence of disk blocks may also reduce the number of disk accesses that are required for a range query. In addition to the applications described above, several other applications also bene t from a linear mapping that preserves locality: 1. In traditional databases, a multi-attribute data space must be mapped into a one-dimensional disk space to allow e cient handling of partial-match queries 22]; in numerical analysis, large multidimensional arrays 6] have to be stored on disk, which is a linear structure. 2. In image compression, a family of methods use a linear mapping to transform an image into a bit string; subsequently, any standard compression method can be applied 18]. A good clustering of pixels will result in a fewer number of long runs of similar pixel values, thereby improving the compression ratio. 3. In geographic information systems (GIS), run-encoded forms of image representations are orderingsensitive, as they are based on representations of the image as sets of runs 1]. 4. Heuristics in computational geometry problems use a linear mapping. For example, for the traveling salesman problem, the cities are linearly ordered and visited accordingly 2]. 2

5. Locality-preserving mappings are used for bandwidth reduction of digitally sampled signals 4] and for graphics display generation 20]. 6. In scienti c parallel processing, locality-preserving linearization techniques are widely used for dynamic unstructured mesh partitioning 17]. Sophisticated mapping functions have been proposed in the literature. One based on interleaving bits from the coordinates, which is called z-ordering, was proposed 19]. Its improvement was suggested by Faloutsos 8], using Gray coding on the interleaved bits. A third method, based on the Hilbert curve 13], was proposed for secondary key retrieval 11]. In the mathematical context, these three mapping functions are based on di erent space- lling curves: the z curve, the Gray-coded curve and the Hilbert curve, respectively. Figure 1 illustrates the linear orderings yielded by the space- lling curves for a 4 4 grid.

z curve

Gray-coded curve Figure 1: Illustration of space- lling curves

Hilbert curve

It was shown that under most circumstances, the linear mapping based on the Hilbert space- lling curve outperforms the others in preserving locality 14]. In this paper, we provide analytic results of the clustering e ects of the Hilbert space- lling curve, focusing on arbitrarily shaped range queries, which require the retrieval of all objects inside a given hyper-rectangle or polyhedron in multidimensional space. For purposes of analysis, we assume a multidimensional space with nite granularity, where each point corresponds to a grid cell. The Hilbert space- lling curve imposes a linear ordering on the grid cells, assigning a single integer value to each cell. Ideally, it is desirable to have mappings that result in fewer disk accesses. The number of disk accesses, however, depends on several factors such as the capacity of the disk pages, the splitting algorithm, the insertion order and so on. Here we use the 3

average number of clusters, or continuous runs, of grid points within a subspace representing a query region, as the measure of the clustering performance of the Hilbert curve. If each grid point is mapped to one disk block, this measure exactly corresponds to the number of non-consecutive disk accesses, which involve additional seek time. This measure is also highly correlated to the number of disk blocks accessed, since (with many grid points in a disk block) consecutive points are likely to be in the same block, while points across a discontinuity are likely to be in di erent blocks. This measure is used only to render the analysis tractable, and some weaknesses of this measure were discussed in 14].

Sx 00 Sy 01 Sy Sx 00 01

10 11 00 01 10 11 00 01 10 11

10 11

(a)

(b)

Figure 2: Illustration of: (a) two clusters for the z curve, (b) one cluster for the Hilbert curve

De nition 1.1 Given a d-dimensional query, a cluster is de ned to be a group of grid points inside the query that are consecutively connected by a mapping (or a curve).

For example, there are two clusters in the z curve (Figure 2(a)) but only one in the Hilbert curve (Figure 2(b)) for the same 2-dimensional rectangular query S S . Now, the problem we will investigate is formulated as follows:

x y

Given a d-dimensional rectilinear polyhedron representing a query, nd the average number of clusters inside the polyhedron for the Hilbert curve.

The de nition of the d-dimensional rectilinear polyhedron is given in Section 3. Note that in the d-dimensional space with nite granularity, for any d-dimensional object such as spheres, ellipsoids, quadric cones and so on, there exists a corresponding (rectilinear) polyhedron that contains exactly the same set of grid points inside the given object. Thus, the solution to the problem above will 4

cover more general cases concerning any simple connected object of arbitrary shape. The rest of the paper is organized as follows. Section 2 surveys historical work on space- lling curves and other related analytic studies. Section 3 presents an asymptotic formula of the average number of clusters for ddimensional range queries of arbitrary shape. Section 4 derives a closed-form exact formula of the average number of clusters in a 2-dimensional space. In Section 5, we provide empirical evidence to demonstrate the correctness of the analytic results for various query shapes. Finally, in Section 6, we discuss the contributions of this paper and suggest future work.

2 Historical Survey and Related Work

G. Peano, in 1890, discovered the existence of a continuous curve which passes through every point of a closed square 21]. According to Jordan's precise notion (in 1887) of continuous curves, Peano's curve is a continuous mapping of the closed unit interval I = 0; 1] into the closed unit square S = 0; 1]2. Curves of this type have come to be called Peano curves or space- lling curves 28]. Formally,

De nition 2.1 If a mapping f : I ! E (n 2) is continuous, and f (I ) the image of I under f has positive Jordan content (area for n = 2 and volume for n = 3), then f (I ) is called a space- lling curve. E denotes an n-dimensional Euclidean space.

n n

Although G. Peano discovered the rst space- lling curve, it was D. Hilbert in 1891 who was the rst to recognize a general geometric procedure that allows the construction of an entire class of space- lling curves 13]. If the interval I can be mapped continuously onto the square S , then after partitioning I into four congruent subintervals and S into four congruent subsquares, each subinterval can be mapped continuously onto one of the subsquares. If this is carried on ad in nitum, I and S are partitioned into 22 congruent replicas for n = 1; 2; 3; ; 1. Hilbert demonstrated that the subsquares can be arranged so that the inclusion relationships are preserved, that is, if a square corresponds to an interval, then its subsquares correspond to the subintervals of that interval. Figure 3 describes how this process is to be carried out for the rst three steps. It has been shown that the Hilbert curve is a continuous, surjective and nowhere di erentiable mapping 26]. However, Hilbert gave the space- lling curve, in a geometric form only, for mapping I into S (i.e., 2-dimensional Euclidean space). The generation of a 3-dimensional Hilbert curve was described in 14, 26]. A generalization of the Hilbert curve, in an analytic form, for higher dimensional spaces was given in 5]. In this paper, a d-dimensional Euclidean space with nite granularity is assumed. Thus, we use the k-th order approximation of a d-dimensional Hilbert space- lling curve (k 1 and d 2), which maps

n

5

(a) First step

(b) Second step

(c) Third step

Figure 3: The rst three steps of Hilbert space- lling curve an integer set 0; 2 ? 1] into a d-dimensional integer space 0; 2 ? 1] .

kd k d

Notation 2.1 For k 1 and d 2, let H denote the k-th order approximation of a d-dimensional Hilbert space- lling curve, which maps 0; 2 ? 1] into 0; 2 ? 1] .

d k kd k d

The drawings of the rst, second and third steps of the Hilbert curve in Figure 3 correspond to H2 , H2 1 2 2 and H3 , respectively. Jagadish 14] compared the clustering properties of several space- lling curves by considering only 2 2 range queries. Among the z curve (2.625), the Gray-coded curve (2.5) and the Hilbert curve (2), the Hilbert curve was the best in minimizing the number of clusters. The numbers within the parentheses are the average number of clusters for 2 2 range queries. Rong and Faloutsos 23] derived a closed-form expression of the average number of clusters for the z curve, which gives 2.625 for 2 2 range queries (exactly the same as the result given in 14]) and in general approaches one third of the perimeter of the query rectangle plus two thirds of the side length of the rectangle in the unfavored direction. Jagadish 16] derived closed-form, exact expressions of the average number of clusters for the Hilbert curve in a 2-dimensional grid, but only for 2 2 and 3 3 square regions. This is a special case of the more general formulae derived in this paper. Abel and Mark 1] reported empirical studies to explore the relative properties of such mapping functions using various metrics. They reached the conclusion that the Hilbert ordering deserves closer attention as an alternative to the z curve ordering. Bugnion et al. estimated the average number of clusters and the distribution of inter-cluster intervals for 2-dimensional rectangular queries. They derived the estimations based on the fraction of vertical and horizontal edges of any particular spacelling curve. However, those fractions were provided only for a 2-dimensional space and without any 6

calculation or formal veri cation. In this paper, we formally prove that, in a d-dimensional space, the d di erent edge directions approach the uniform distribution, as the order of the Hilbert curve approximation grows into in nity. Several closely related analyses for the average number of 2-dimensional quadtree nodes have been presented in the literature. Dyer 7] presented an analysis for the best, worst and average case of a square of size 2 2 , giving an approximate formula for the average case. Sha er 27] gave a closed formula for the exact number of blocks that such a square requires when anchored at a given position (x; y ); he also gave a formula for the average number of blocks for such squares (averaged over all possible positions). Some of these formulae were generalized for arbitrary 2-dimensional and d-dimensional rectangles 9, 10].

n n

3 Asymptotic Analysis

In this section, we give an asymptotic formula for the clustering property of the Hilbert space- lling curve for general polyhedra in a d-dimensional space. The symbols used in this section are summarized in Table 1. The polyhedra we consider here are not necessarily convex, but are rectilinear in the sense that any (d-1)-dimensional polygonal surface is perpendicular to one of the d coordinate axes.

De nition 3.1 A rectilinear polyhedron is bounded by a set V of polygonal surfaces each of which is perpendicular to one of the d coordinate axes, where V is a subset of R and homeomorphic 1 to a (d-1)-dimensional sphere S ?1 .

d d

For d = 2 the set V is, by de nition, a Jordan curve 3], which is essentially a simple closed curve in R2 . The set of surfaces of a polyhedron divides the d-dimensional space R into two connected components, which may be called the interior and the exterior. The basic intuition is that each cluster within a given polyhedron corresponds to a segment of the Hilbert curve, connecting a group of grid points in the cluster, which has two endpoints adjacent to the surface of the polyhedron. The number of clusters is then equal to half the number of endpoints of the segments bounded by the surface of the polyhedron. In other words,

d

Remark 3.1 The number of clusters within a given d-dimensional polyhedron is equal to the number of entries (or exits) of the Hilbert curve into (or from) the polyhedron. 1 Two subsets X and Y of Euclidean space are called homeomorphic if there exists a continuous bijective mapping, f : X ! Y , with a continuous inverse f ?1 12].

7

Thus, we expect that the number of clusters is approximately proportional to the perimeter or hypersurface area of the d-dimensional polyhedron (d 2). With this observation, the task is reduced to nding a constant factor of a linear function. Our approach to derive the asymptotic solution largely depends on the self-similar nature of the Hilbert curve, which stems from the recursive process of the curve expansion. Speci cally, we shall show in the following lemmas that the edges of d di erent orientations are uniformly distributed in a d-dimensional Euclidean space. That is, approximately one d-th of the edges are aligned to the i-th dimensional axis for each i (1 i d). Here we mean by edges the line segments of the Hilbert curve connecting two neighboring points. The uniform distribution of the edges provides key leverage for deriving the asymptotic solution. To show the uniform distribution, it is important to understand how the k-th order approximation of the Hilbert curve is derived from lower order approximations, and how the d-dimensional Hilbert curve is extended from the 2-dimensional Hilbert curve, which was described only in a geometric form in 13]. (Analytic forms for the d-dimensional Hilbert curves were presented in 5].) In a d-dimensional space, the k-th order approximation of the d-dimensional Hilbert curve H is derived from the 1-st order approximation of the d-dimensional Hilbert curve H1 by replacing each vertex in the H1 by H ?1 , which may be rotated about a coordinate axis and/or re ected about a hyperplane perpendicular to a coordinate axis. Since there are 2 vertices in the H1 , the H is considered to be composed of 2 H ?1 vertices and (2 ? 1) edges, each connecting two of them. Before describing the extension for the d-dimensional Hilbert curve, we de ne the orientations of H . Consider H1 , which consists of 2 vertices and (2 ? 1) edges. No matter where the Hilbert curve starts its traversal, the coordinates of the start and end vertices of the H1 di er only in one dimension, meaning that both vertices lie on a line parallel to one of the d coordinate axes. We say that H1 is i-oriented if its start and end vertices lie on a line parallel to the i-th coordinate axis. For any k (k > 1), the orientation of H is equal to that of H1 from which it is derived. Figure 4 and Figure 5 illustrate the processes that generate H3 from H2 , and H4 from H3 , respectively. In general, when the d-th dimension is added to the (d-1)-dimensional Hilbert curve, each vertex of H1?1 (i.e., H ?1 ) is replaced by H ?1 of the same orientation except in the 2 ?1 -th one (i.e., the end vertex ?1 ?1 ), whose orientation is changed from 1-oriented to d-oriented parallel to the d-th dimensional of H1 axis. For example, in Figure 5, the orientations of the two vertices connected by a dotted line have been

d k d d d k d d d k d d k d d k d d d d d d k d k k k k d d k d k d d

8

2 1 2 1

k

2 3

2 3

k

Figure 4: the 3-dimensional Hilbert curve (H3 with vertices representing H3?1 approximations annotated by their orientations.)

2 1 2 4 3 3 2 4 2 1 2 3 2 3 2 2

Figure 5: the 4-dimensional Hilbert curve (H4 with vertices representing H4?1 approximations annotated by their orientations.)

k k

changed from 1 to 4. Since the orientations of all the other (2 ?1 ?1) H ?1 vertices remain unchanged, they are all j-oriented for some j (1 j < d). The whole 2 ?1 H ?1 vertices are then replicated by re ection, and nally the two replicas are connected by an edge parallel to the d-th coordinate axis (called d-oriented edge) to form a d-oriented H . In short, whenever a dimension (say, the d-th dimension) is added, two d-oriented H ?1 vertices are introduced, the number of 1-oriented H ?1 vertices remains unchanged as two, and the number of H ?1 vertices of the other orientations are doubled.

d d k d d k d k d k d k d k

Notation 3.1 Let ' be the number of i-oriented H ?1 vertices in a given d-oriented H .

i d k d k

Lemma 1 For a d-oriented H (d 2), 8 >2 < ' => :2

d k i

if i = 1,

?

i

d+1

if 1 < i d.

(1)

Proof. By induction on d.

9

Symbol

d

De nition

d Number of dimensions (x1; :::; x ) Coordinates of a grid point in a d-dimensional grid space H k-th order approximation of the d-dimensional Hilbert curve ' Number of i-oriented H ?1 vertices in a H " Number of i-oriented edges in a d-oriented H S+ Number of interior grid points which face i+-surface S? Number of interior grid points which face i?-surface p+ Probability that the predecessor of a grid point is its i+-neighbor p? Probability that the predecessor of a grid point is its i?-neighbor

d k i d k d k i;k d k i i i

S N

i

q

d

Total surface area of a given d-dimensional rectilinear polyhedral query q Average number of clusters within a given d-dimensional rectilinear polyhedron Table 1: De nition of Symbols

The following lemma shows that the edges of d di erent orientations approach the uniform distribution as the order of the Hilbert curve approximation grows into in nity.

Notation 3.2 Let " denote the number of i-oriented edges in a (d-oriented) H .

i;k d k

Lemma 2 In a d-dimensional space, for any i and j (1

grows to in nity.

i; j

d), " ="

i;k

j;k

approaches unity as k

Proof. We begin by deriving recurrence relations among the terms " and ' . As we mentioned previously, the fundamental operations involved in expanding the Hilbert curve (i.e., from H ?1 to H ) are rotation and re ection. During the expansion of H , the orientation of a H ?1 vertex in a quantized subregion is changed only by rotation; a set of subregions of an orientation are replicated from one of the same orientation, which leaves the directions of their edges unchanged. Consequently, any two distinct H ?1 vertices of the same orientation contain the same number of edges " ?1 for each direction i (1 i d). Therefore, the set of the 1-oriented edges in the H consists of 2 ?1 connection edges (in H1 ), d-oriented edges of the 1-oriented H ?1 vertices, (d-1)-oriented edges of the 2-oriented H ?1 vertices, (d-2)-oriented edges of the 3-oriented H ?1 vertices and so on. By applying the same procedure to the other directions, we obtain

i;k i d k d k d k d k d k i;k d k d d d k d k d k

"1

;k

= '1" ?1 + '2 " ?1 ?1 +

d;k d ;k

+ ' "1 ?1 + 2 ?1

d ;k d

10

"2 "3 "

;k

;k

d;k

= '2" ?1 + '3 " ?1 ?1 + = '3" ?1 + '4 " ?1 ?1 + . . . = ' " ?1 + '1" ?1 ?1 +

d;k d ;k d;k d ;k d d;k d ;k i; d i

+ '1"1 ?1 + 2 ?2 + '2"1 ?1 + 2 ?3

d ;k ;k d

(2)

+ ' ?1 "1 ?1 + 1

d ;k i

The initial values are given by " 1 = 2 ? , and the values of ' are in Lemma 1. The constants in the last terms being ignored, the recurrence relations are completely symmetric. From the symmetry, it can be shown that for any i and j (1 i; j d),

k

lim " = 1: !1

"

i;k j;k

The proof is complete. Now we consider a d-dimensional grid space, which is equivalent to a d-dimensional Euclidean integer space. In the d-dimensional grid space, each grid point y = (x1 ; : : :; x ) has 2d neighbors. The coordinates of the neighbors di er from those of y by unity only in one dimension. In other words, the coordinates of the neighbors that lie in a line parallel to the i-th axis must be either (x1; : : :; x +1; : : :; x ) or (x1; : : :; x ? 1; : : :; x ). We call them the i+-neighbor and the i?-neighbor of y , respectively. Butz showed that any unit increment in the Hilbert order produces a unit increment in one of the d coordinates and leaves the other d ? 1 coordinates unchanged 5]. The implication is that, for any grid point y , both the neighbors of y in the linear order imposed by the Hilbert curve are chosen from the 2d neighbors of y in the d-dimensional grid space. Of the two neighbors of y in the Hilbert order, the one closer to the start of the Hilbert traversal is called the predecessor of y .

d i d i d

Notation 3.3 For a grid point y in a d-dimensional grid space, let p+ be the probability that the predecessor of y is the i+-neighbor of y , and let p? be the probability that the predecessor of y is the i?-neighbor of y .

i i

Lemma 3 In a su ciently large d-dimensional grid space, for any i (1

p+ + p? = : d

i i

i

d),

1

Proof. Assume y is a grid point in d-dimensional space and z is its predecessor. Then the edge yz adjacent to y and z is parallel to one of the d dimensional axes. From Lemma 2 and the recursive de nition of the Hilbert curve, the probability that yz is parallel to the i-th dimensional axis is d?1 for

11

any i (1 d?1 .

i

d). This implies that the probability that z is either i+-neighbor or i?-neighbor of y is

For a d-dimensional rectilinear polyhedron representing a query region, the number, sizes and shapes of the surfaces can be arbitrary. Due to the constraint of surface alignment, however, it is feasible to classify the surfaces of a d-dimensional rectilinear polyhedron into 2d di erent kinds: for any i (1 i d), If a point y is inside the polyhedron and its i+-neighbor is outside, then the point y faces an i+-surface. If a point y is inside the polyhedron and its i?-neighbor is outside, then the point y faces an i?-surface. For example, Figure 6 illustrates grid points which face surfaces in a 2-dimensional grid space. The shaded region represents the inside of the polyhedron. Assuming that the rst dimension is vertical and the second dimension is horizontal, grid points A and D face a 1+-surface, and grid point B (on the convex) faces both a 1+-surface and a 2+-surface. Although grid point C (on the concave) is close to the boundary, it does not face any surface because all of its neighbors are inside the polyhedron. Consequently, the chance that the Hilbert curve enters the polyhedron through grid point B is approximately twice that of entering through grid point A (or D). The Hilbert curve cannot enter through grid point C.

A

B

C

D

Figure 6: Illustration of grid points facing surfaces For any d-dimensional rectilinear polyhedron, it is interesting to see that the aggregate area of i+-surface is exactly as large as that of i?-surface. In a d-dimensional grid space, we mean by surface area the number of interior grid points that face a given surface of any kind. 12

Notation 3.4 For a d-dimensional rectilinear polyhedron, let S + and S ? denote the aggregate number of interior grid points that face i+-surface and i?-surface, respectively.

i i

Before proving the following theorem, we state without proof an elementary remark.

Remark 3.2 Given a d-dimensional rectilinear polyhedron, S + = S ? for any i (1 i d).

i i

Notation 3.5 Let N be the average number of clusters within a given d-dimensional rectilinear polyhedron.

d

S lim N = 2d (3) !1 Proof. Assume a grid point y faces an i+-surface (or an i?-surface). Then, the probability that the Hilbert curve enters the polyhedron through y is equivalent to the probability that the predecessor of y is an i+-neighbor (or an i?-neighbor) of y . Thus, the expected number of entries through an i+-surface (or an i?-surface) is S + p+ (or S ? p? ). Since the number of clusters is equal to the total number of entries into the polyhedron through any of the 2d kinds of surfaces (Remark 3.1), it follows that

q k d i i i i k

area of a given rectilinear polyhedral query q . Then,

Theorem 1 In a su ciently large d-dimensional grid space mapped by H , let S be the total surface

d k q

lim N = !1

d

X

d i=1 d

(S + p+ + S ? p? )

i i i i

= =

X

i=1 d

S + (p+ + p? )

i i i

(by Remark 3.2) (by Lemma 3)

X 1 S

+

i i=1 q

= Sd : 2 The proof is complete.

d

Theorem 1 con rms our early conjecture that the number of clusters is approximately proportional to the hyper-surface area of a d-dimensional polyhedron, and provides (2d)?1 as the constant factor of the linear function. In a 2-dimensional space, the average number of clusters for the z curve approaches one third of the perimeter of a query rectangle plus two thirds of the side length of the rectangle in the unfavored direction 23]. It follows that the Hilbert curve achieves better clustering than the z curve, because the average number of clusters for the Hilbert curve is approximately equal to one fourth of the perimeter of a 2-dimensional query rectangle. 13

are satis ed: (i) Given an s1 s2

Corollary 1 In a su ciently large d-dimensional grid space mapped by H , the following properties

d k

s hyper-rectangle, lim !1 N =

d k d k d d

1

d

P ( Q

d i=1

1

d

si

j =1

s ).

j

(ii) Given a hypercube of side length s, lim !1 N = s ?1 .

For a square of side length 2, Corollary 1(ii) provides 2 as an average number of clusters, which is exactly the same as the result given in 14].

4 Exact Analysis : A Special Case

Theorem 1 states that as the size of a grid space grows in nitely, the average number of clusters approaches half the surface area of a given query region divided by the dimensionality. It does not provide an intuition as to how rapidly the number of clusters converges to the asymptotic solution. To address this issue, in this section, we derive a closed-form, exact formula for a 2-dimensional nite space. We can then measure how closely the asymptotic solution re ects the reality in a nite space, by comparing it with the exact formula. Speci cally, we assume that a nite 2 + 2 + grid space is mapped by H2+ and a query region is a square of size 2 2 . We rst describe our approach and then present the formal derivation of the solution in several lemmas and a theorem. Table 2 summarizes the symbols used in this section.

k n k n k k k n

4.1 Basic concepts

Remark 3.1 states that the number of clusters within a given query region is equal to the number of entries into the region made by the Hilbert curve traversal. Since each entry is eventually followed by an exit from the region, an entry is equivalent to two cuts of the Hilbert curve by the boundary of the query region. We restate Remark 3.1 as follows:

Remark 4.1 The number of clusters within a given query region is equal to half the number of edges cut by the boundary of the region.

Here we mean by edges the line segments of the Hilbert curve connecting two neighboring grid points. Now we know from Remark 4.1 that deriving the exact formula is reduced to counting the number of edge cuts by the boundary of a 2 2 query window at all possible positions within a 2 + 2 + grid region. Then the average number of clusters is simply obtained by dividing this number by twice the number of possible positions of the query window.

k k k n k n

14

2k F

2

k+n

k+1 ?2

2k H

k

B

2

D

A

E

2

k+n

k+1 ?2

G

C

I

2k

Figure 7: H2+ divided into nine subregions

k n

Notation 4.1 Let N2 (k; k + n) be the average number of clusters inside a 2 2 square window in a 2 + 2 + grid region.

k k k n k n

The di culty of counting the edge cuts lies in the fact that, for each edge within the grid region, the number of cuts varies depending on the location of the edge. Intuitively, the edges near the boundary of the grid region are cut less often than those near the center. This is because a smaller number of square windows can cut the edges near the boundary. Thus, to make it easier to count the edge cuts, the grid region H2+ is divided into nine subregions, as shown in Figure 7. The width of the subregions on the boundary is 2 . Then the 2 + 2 + grid region (H2+ ) can be considered as a collection of 22 H2 approximations each of which is connected to one or two neighbors by connection edges. From now on, we mean by an internal edge one of the 22 ? 1 edges in a H2 , and by a connection edge one that connects two H2 subregions. For example, subregion F includes only one H2 and is connected to subregions B and D by a horizontal and a vertical connection edge, respectively. Subregion B includes (2 ? 2) H2 approximations each of which is connected to its two neighbors by connection edges. Consider an edge (internal or connection) near the center of subregion A, and a horizontal edge in subregion B. An edge in subregion A can be cut by 2 +1 square windows whose positions within the region are mutually distinct. On the other hand, a horizontal edge in subregion B can be cut by a di erent number of distinct windows, depending on the position of the edge. Speci cally, if the edge in subregion B is on the i-th row from the topmost, then it is cut 2 i times. The observations we have

k n k k n k n k n n k k k k k n k k

15

made are summarized as follows: A1. Every edge (either horizontal or vertical) at least one of whose end points resides in subregion A is cut 2 +1 times.

k

A2. Every vertical edge in subregions B and C is cut 2 times by the top or bottom side of a window. A3. Every horizontal edge in subregions D and E is cut 2 times by the left or right side of a window. A4. Every connection edge in subregions fB,F,Hg is horizontal and resides in the 2 -th row from the topmost, and is cut 2 +1 times by the left and right sides of a window. Similarly, every connection edge in subregions fC,G,Ig is horizontal and resides in the 2 -th row from the topmost, and is cut twice by the left and right sides of a window. A5. Every connection edge in subregions fD,F,Gg is vertical and resides in the rst column from the leftmost, and is cut twice by the top and bottom sides of a window. Every connection edge in subregions fE,H,Ig is vertical and resides in the rst column from the rightmost, and is cut twice by the top and bottom sides of a window. A6. Every horizontal edge in the i-th row from the topmost of subregion B is cut 2 i times by both the left and right sides of a window, and every horizontal edge in the i-th row from the topmost of subregion C is cut 2 +1 ? 2 i + 2 times by both the left and right sides of a window. A7. Every vertical edge in the i-th column from the leftmost of subregion D is cut 2 i times by both the top and bottom sides of a window, and every vertical edge in the i-th column from the leftmost of subregion E is cut 2 +1 ? 2 i + 2 times by both the top and bottom sides of a window. A8. Every horizontal edge in the i-th row from the topmost of subregions fF,Hg is cut i times by either the left or right side of a window. A9. Every horizontal edge in the i-th row from the topmost of subregions fG,Ig is cut 2 ? i + 1 times by either the left or right side of a window. A10. Every vertical edge in the i-th column from the leftmost of subregions fF,Gg is cut i times by either the top or bottom side of a window. A11. Every vertical edge in the i-th column from the leftmost of subregions fH,Ig is cut 2 ? i +1 times by either the top or bottom side of a window. A12. Two connection edges through which the Hilbert curve enters into and leaves from the grid region are cut once each. From these observations, we can categorize the edges in the H2+ grid region into the following ve groups:

k k k k k k k k k k n

16

(i) E1: a group of edges as described in observation A1. Each edge is cut 2

k +1

times.

k

(ii) E2: a group of edges as described in observations A2 and A3. Each edge is cut 2 times. (iii) E3: a group of edges as described in observations A4 and A5. Each connection edge on the top boundary (i.e., subregions fB,F,Hg) is cut 2 +1 times, and any other connection edge is cut twice.

k

(iv) E4: a group of edges as described in observations A6 and A7. Each edge is cut 2i or 2(2 ? i + 1) times if it is in the i-th row (or column) from the topmost (or leftmost).

k

(v) E5: a group of edges as described in observations A8 to A11. Each edge is cut i or 2 ? i +1 times if it is in the i-th row (or column) from the topmost (or leftmost).

k

Notation 4.2 N denotes the number of edge cuts from an edge group E .

i i

In a H2+ grid region, the number of all possible positions of a 2 2 window is (2 + ? 2 + 1)2. Since there are two more cuts from observation A12, in addition to N1; : : :; N5, the average number of clusters N2 (k; k + n) is given by N 3 4 (4) N2(k; k + n) = N1 + N2 + N?+ N+ + 2 5 + 2 : 2(2 + 2 1) In the next section, we derive a closed-form expression for each of the edge groups N1 ; : : :; N5.

k k k n k k n k n k

4.2 Formal derivation

Axis 2 Axis 1

(a) 1 +-oriented

(b) 1 ?-oriented

(c) 2 +-oriented

(d) 2 ?-oriented

Figure 8: Four di erent orientations of H2 2 We adopt the notion of orientations of H given in Section 3 and extend so that it can be used to derive inductions.

d k

Notation 4.3 An i-oriented H is called i +-oriented (or i ?-oriented) if the i-th coordinate of its start point is not greater (or less) than that of any grid point in the H .

d k d k

17

De nition t Number of connection edges in the top boundary of a 2 +-oriented H2+ b Number of connection edges in the bottom boundary of a 2 +-oriented H2+ s Number of connection edges in the side boundary of a 2 +-oriented H2+ E A group of edges between grid points N Number of edge cuts from an edge group E f g Number of i +-oriented H2 approximations in the subregion R of a 2 +-oriented H2+ + f g Number of i ?-oriented H2 approximations in the subregion R of a 2 +-oriented H2+ ? H Number of horizontal edges in a 2-oriented H2 V Number of vertical edges in a 2-oriented H2 h (i) Number of horizontal edges in the i-th row from the topmost of a 2 +-oriented H2 v (i) Number of vertical edges in the i-th column from the leftmost of a 2 +-oriented H2 N2(k; k + n) Exact number of clusters covering a 2 2 square in a 2 + 2 + grid region

n k n n k n n k n i i i R i ;n k k R i ;n k k k k k k k k k k k k k n k n

Symbol

n

n

Table 2: De nition of Symbols Figure 8 illustrates 1 +-oriented, 1 ?-oriented, 2 +-oriented and 2 ?-oriented H2 approximations. Note that 2 either of the two end points can be a start point for each curve. We begin by deriving N1 and N3 . It appears at the rst glance that the derivation of N1 is simple because each edge in E1 is cut 2 +1 times. However, the derivation of N1 involves counting the number of connection edges crossing the boundary between subregion A and the other subregions, as well as the number of edges inclusive to subregion A. We accomplish this by counting the number of edges in the complementary set E1 (that is, fedges in H2+ g ? E1). Since E1 consists of edges in 4(2 ? 1) H2 approximations in boundary subregions B through I and connection edges in E3, jE1j is equal to 4(2 ? 1) (22 ? 1) + jE3j . To nd the number of connection edges in E3, we de ne the number of connection edges in di erent parts of the boundary subregions. In the following, without loss of generality, we assume that the grid region is 2 +-oriented H2+ .

k n k n k n k k n

Notation 4.4 Let t , b and s denote the number of connection edges in the top boundary (i.e., subregions fB,F,Hg), in the bottom boundary (i.e., subregions fC,G,Ig), and in the left or right boundary (i.e., subregions fD,F,Gg or fE,H,Ig) of a 2 +-oriented H2+ , respectively.

n n n k n

Note that the number of connection edges in subregions fD,F,Gg and the number of connection edges 18

in subregions fE,H,Ig are identical, because the 2 +-oriented H2+ is vertically self-symmetric.

k n

Lemma 4 For any positive integer n,

t = 2 ?1

n n

and

b + 2s = 2(2

n n

n

? 1):

(5)

Proof. Given in Appendix A.

From Lemma 4, the number of connection edges inclusive to the boundary subregions (i.e., E3) is given by t + b + 2s = 5 2 ?1 ? 2. From this, we can obtain the number of edges in E1 as well as E3 and hence the number of cuts from E1 and E3. The results are presented in the following lemma.

n n n n

Lemma 5 The numbers of edge cuts from E and E are

1 3

N1 = 2(2 N3 = 2

n

n+k

? 2) 2 + 3(2 ? 2)2 + 4(2 ? 1)

2 3k

n n

k

(6) (7)

Proof. Given in Appendix A.

All that we need to derive N2 is then to count the number of vertical edges in subregions fB,Cg and the number of horizontal edges in subregions fD,Eg. No connection edges in these subregions are involved. Since the number of horizontal (or vertical) edges in a H2 is determined by its orientation, it is necessary to nd the number of H2 approximations of di erent orientations in subregions fB,C,D,Eg. In the following, we give notations for the number of horizontal and vertical edges in a H2 , and the number of H2 approximations of di erent orientations in the boundary subregions in Figure 7.

k k k k

Notation 4.5 Let H and V denote the number of horizontal and vertical edges in a respectively.

k k k k

2-oriented

H,

2

k

By de nition, the numbers of horizontal and vertical edges in a 1-oriented H2 are V and H , respectively.

k

g and f 1 2 g Notation 4.6 For a set of subregions fR1; R2; : : :; R g in Figure 7, let f+ 1 2 ? denote the number of i +-oriented and i ?-oriented H2 approximations in those subregions, respectively.

R ;R ;:::;Rj ;n R ;R ;:::;Rj ;n j i i k

Lemma 6 Given a 2

+

-oriented

H

2

k +n

as depicted in Figure 7,

B n

f g = 2 ?2 + f g + f g + f g = 2 ?2 1+ 1? 2+ f g + f g + f g + f g = 2(2 ? 2): 2? 2+ 1? 1+

2

;n D E C n ;n ;n ;n C C D;E ;n D;E ;n n ;n ;n

(8) (9) (10)

19

Proof. Given in Appendix A.

From Lemma 6, a closed-form expression of N2 is derived in the following lemma.

Lemma 7 The number of edge cuts from E is

2

N2 = 2(2

n

? 2)2 ? 2(2 ? 2)2 :

3k

n k

(11)

Proof. Given in Appendix A.

Now we consider the number of cuts from E4 and E5. The edges in these groups are cut di erent numbers of times depending on their relative locations within the H2 which they belong to. Consequently, the expressions for N4 and N5 include such terms as i v (i) and i h (i). The de nitions of v (i) and h (i) are given below. We call H2 approximations having such terms gradients.

k k k k k k

Notation 4.7 Let h (i) be the number of horizontal edges in the i-th row from the topmost, and v (i) be the number of vertical edges in the i-th column from the leftmost of a 2 +-oriented H2 .

k k k

(a) u-gradient2

(b) d-gradient2

(c) s-gradient2

Figure 9: Three di erent gradients and cutting windows To derive the closed-form expressions for N4 and N5, we rst de ne di erent types of gradients. Consider the 2 +-oriented H2 approximations in subregions fB,C,D,Eg. From observations A6 and A7, P the number of cuts from the horizontal edges in a 2 +-oriented H2 in subregion B is 2=1 2ih (i). Likewise, P the number of cuts from the horizontal edges in a 2 +-oriented H2 in subregion C is 2=1 2(2 ? i +1)h (i), P and the number of cuts from the vertical edges in a 2 +-oriented H2 in subregion D or E is 2=1 2iv (i). The number of cuts from vertical edges is the same in both subregions D and E, because a 2 +-oriented H2 is vertically self-symmetric. Based on this, we de ne three types of gradients for a 2 +-oriented H2 :

k

k

k

i

k

k

k

k

i

k

k

k

i

k

k

k

20

De nition 4.1 (i) A 2 +-oriented H2 is called u-gradient if each of its horizontal edges in the i-th row from the topmost is cut i or 2i times.

k k

(ii) A 2 +-oriented H2 is called d-gradient if each of its horizontal edges in the i-th row from the topmost is cut 2 ? i + 1 or 2(2 ? i + 1) times.

k k k k

(iii) A 2 +-oriented H2 is called s-gradient if each of its vertical edges in the i-th column from either the leftmost or rightmost is cut i or 2i times.

k k

Figure 9 illustrates the three di erent gradients (u-gradient2 , d-gradient2 and s-gradient2 ) and the cutting boundaries of a sliding window. These de nitions can be applied to the H2 approximations of di erent orientations as well, by simply rotating the directions. For example, a 1 +-oriented H2 in subregion D is d-gradient , and a 2 ?-oriented H2 in subregion D is s-gradient .

k k k k k

Lemma 8 Let

k

=

P

2k i=1

ih (i),

k

k

=

k

P (2 ? i + 1)h (i) and

2k i=1

k k k

k

=

P

k

2k i=1

iv (i). Then,

k

k

+

k

= (2 + 1)H

and

k

1 = 2 (2 + 1)V

k

(12)

Proof. Given in Appendix A.

Next, we need to know the number of gradients of each type in the boundary subregions B through I so that we can derive N4 and N5 . For H2 approximations in subregions fB,C,D,Eg,

k

Every 2 +-oriented H2 in B is u-gradient .

k k

Every 2 +-oriented H2 in C, 1 +-oriented H2 in D, and 1 ?-oriented H2 in E is d-gradient .

k k k k

Every 1 +-oriented or 1 ?-oriented H2 in C, and 2 +-oriented or 2 ?-oriented in fD,Eg is s-gradient .

k k

The H2 approximations in subregions fF,G,H,Ig are dual-type gradients. In other words,

k

Each of the 2 +-oriented H2 approximations in fF,Hg is both u-gradient and s-gradient .

k k k

The H2 in

k

G

1 +-oriented

.

I

is both is both

d-gradientk

and and

s-gradientk

because the subgrid is either because the subgrid is either

2 +-oriented

or or

The H2 in

k

1 ?-oriented

d-gradientk

s-gradientk

2 +-oriented

.

21

f Thus, in subregions fB,C,D,Eg, the number of u-gradient approximations is 2+ g , the number of f g f g f g d-gradient approximations is 2+ + 1+ + 1? , and the number of s-gradient approximations is f g + f g + f g + f g . In subregions fF,G,H,Ig, the number of u-gradient approximations 2+ 2? 1? 1+ is two, the number of d-gradient approximations is two, and the number of s-gradient approximations is four. From this observation, and Lemma 6 and Lemma 8, it follows that

B k ;n C D E k ;n ;n ;n k D;E ;n D;E ;n C C ;n ;n k k k

Lemma 9 The numbers of edge cuts from E and E are N = 2(2 ? 2)(2 + 1)(2 ? 1) N = 2(2 + 1)(2 ? 1)

4 5 4

n k

2k

5

k

2k

(13) (14)

Proof. Given in Appendix A.

Finally, in the following theorem, we present a closed-form expression of the average number of clusters.

window is

Theorem 2 Given a 2

k +n

2

2

k +n

grid region, the average number of clusters within a 2

n

k

2 query

k

Proof. From Equation (4),

2

? 1)2 N (k; k + n) = (2 ? 1)(22 + (2 + 1) + 2 ?2

2 3k

n

2k

n

k +n

k

2

(15)

2

N (k; k + n) = (N + N + N + N + N + 2)=2(2 ? 2 + 1) = ((2 ? 1) 2 + (2 ? 1)2 + 2 )=(2 ? 2 + 1) :

1 2 3 4 5

k +n k n

2 3k

n

2k

n

k +n

k

2

For increasing n, N2 (k; k + n) asymptotically approaches a limit of 2 , which is the side length of the square query region. This matches the asymptotic solution given in Corollary 1(ii) for d = 2.

k

5 Experimental Results

To demonstrate the correctness of the asymptotic and exact analyses presented in the previous sections, we carried out simulation experiments for range queries of various sizes and shapes. The objective of our experiments was to evaluate the accuracy of the formulas given in Theorem 1 and Theorem 2. Speci cally, we intended to show that the asymptotic solution is an excellent approximation for general d-dimensional range queries of arbitrary sizes and shapes. We also intended to validate the correctness of the exact solution for a 2-dimensional 2 2 square query.

k k

22

5.1 Arrangements of experiments

To obtain exact measurements of the average number of clusters, it was required that we average the number of clusters within a query region at all possible positions in a given grid space. Such exhaustive simulation runs allowed us to validate empirically the correctness of the exact formula given in Theorem 2 for a 2 2 square query.

k k

S

S/2

S S

S S/2

S/2

S

S

S

S

S

S/2

S/2

S

(a) square

(b) polygon

(c) circle

(d) cube

(e) polyhedron

Figure 10: Illustration of sample query shapes However, the number of all possible queries is exponential on the dimensionality. In a d-dimensional N N : : : N grid space, the total number of distinct positions of a d-dimensional k k : : : k hypercubic query is (N ? k + 1) . Consequently, for a large grid space and a high dimensionality, each simulation run may require processing an excessively large number of queries, which in turn makes the simulation take too long. Thus, we carried out exhaustive simulations only for relatively small 2-dimensional and 3-dimensional grid spaces. Instead, for relatively large or high dimensional grid spaces, we did statistical simulation by random sampling of queries. For query shapes, we chose squares, circles and concave polygons for 2-dimensional cases, and cubes, concave polyhedra and spheres for 3-dimensional cases. Figure 10 illustrates some of the query shapes used in our experiments. In higher dimensional spaces, we used hypercubic and hyperspherical query shapes because it was relatively easy to identify the query regions by simple mathematical formulas.

d

5.2 Empirical validation

The rst set of experiments was carried out in 2-dimensional grid spaces with two di erent sizes. The table in Figure 11(a) compares the empirical measurements with the exact and asymptotic formulas for a 2 2 square query. The second column of the table contains the average numbers of clusters obtained by an exhaustive simulation performed on a 1024 1024 grid space. The numbers in the third and fourth columns were computed by the formulas in Theorem 1 and Theorem 2, respectively. The

k k

23

Average Number of Clusters (Grid: 32768 x 32768) Square Circle Concave2d

query empirical asymptotic exact 1 21 1.998534 2 2 2091524/1046529 2 22 3.996328 2 4 4165936/1042441 3 23 7.992257 2 8 8266304/1034289 4 24 15.984206 2 16 16273216/1018081 5 25 31.967807 2 32 31521824/986049

30

Average number of clusters

25

20

15

10

5

0 4 8 12 16 20 24 The side length of query (s) 28 32

(a) Exhaustive simulation (grid: 1024 1024)

(b) Statistic simulation (grid: 32K 32K )

Figure 11: Average number of clusters for 2-dimensional queries numbers from the simulation are identical to those from the exact formula ignoring round-o errors. Moreover, by comparing the second and third columns, we can measure how closely the asymptotic formula re ects the reality in a nite grid space. Figure 11(b) compares three di erent 2-dimensional query shapes: squares, circles and concave polygons. The average number of clusters were obtained by a statistical simulation performed on a 32K 32K grid space. For the statistical simulation, a total of 200 queries were generated and placed randomly within the grid space for each combination of query shape and size. With a few exceptional cases, the numbers of clusters form a linear curve for each query shape; the linear correlation coe cients are 0.999253 for squares, 0.999936 for circles, and 0.999267 for concave polygons. The numbers are almost identical for the three di erent query shapes despite their covering di erent areas. A square covers s2 grid points, a concave polygon 3s2 =4 grid points and a circle approximately s2 =4 grid points. However, this should not be surprising, as the three query shapes have the same length of perimeter for a given side length s. For a circular query of diameter s, we can always nd a rectilinear polygon that contains the same set of grid points as the circular query region. And it is always the case that the perimeter of the rectilinear polygon (as shown in Figure 10(c)) is equal to that of a square of side length s. In general, in a 2-dimensional grid space, the perimeter of a rectilinear polygon is greater than or equal to that of the minimum bounding rectangle (MBR) of the polygon. This justi es the general approach of using a minimum bounding rectangle to represent a 2-dimensional range query, because the use of an MBR does not increase the actual number of clusters (i.e., the number of non-consecutive disk accesses). A similar set of experiments was carried out in higher dimensional grid spaces. The results in 24

Average Number of Clusters (Grid: 32768 x 32768 x 32768) 250 Square Concave3d Sphere 20000

Average Number of Clusters (N = 32768; s = 3) Hypercube_empirical Hypercube_formula

Average number of clusters

150

100

Average number of clusters

4 8 12 The side length of query (s) 16

200

15000

10000

5000

50

0

0 2 4 6 8 The dimensionality (d) of a data space 10

(a) 3-dimensional queries

(b) d-dimensional hypercubic queries

Figure 12: Average number of clusters for higher dimensional queries Figure 12(a) were obtained by a statistical simulation performed on a 32K 32K 32K grid space. For the statistical simulation, a total of 200 queries were generated and placed randomly within the grid space for each combination of query shape and size. Those in Figure 12(b) were obtained by a statistical simulation with 200 random d-dimensional 3 3 : : : 3 hypercubic queries in a d-dimensional 32K 32K : : : 32K grid space (2 d 10). In Figure 12(a), the numbers of clusters form quadratic curves for all the three query shapes, but with slightly di erent coe cients for the quadratic term. To determine the quadratic functions, we applied the least-square curve tting method for each query shape. The approximate quadratic functions were obtained as follows.

f f

cube

f

poly

sphere

(s) = 1:02307s2 + 1:60267s + 1:93663 (s) = 0:947168s2 + 1:26931s + 1:95395 (s) = 0:816674s2 + 1:27339s + 2:6408:

The approximate function f (s) for a cubic query con rms the asymptotic solution given in Corollary 1(ii), as it is quite close to s2 . Furthermore, Figure 12(b) illustrates that the empirical results from the hypercubic queries coincide with the formula (s ?1 ) even in higher dimensional spaces.2 The numbers from the experiments were less than 2 percent o from the formula. In contrast, the functions f (s) and f (s) for concave polyhedral and spherical queries are lower

cube d poly sphere

clustering high dimensional data objects. For instance, in a 10-dimensional space, the expected number of clusters was 19,683.

2 The exponential growth gives rise to the question of whether using the Hilbert curve is a practical technique for

25

than s2 . The reason is that, unlike in the 2-dimensional case, the surface area of a concave polyhedron or a sphere is smaller than that of its minimum bounding cube. For example, the surface area of the polyhedron illustrated in Figure 10(e) is 11 s2 , while that of the corresponding cube is 6s2. For a sphere 2 of diameter s = 16, the surface area (i.e., the number of grid points on the surface of the sphere) is 1248. This is far smaller than the surface area of the corresponding cube, which is 6 162. Note 11 that the coe cients of the quadratic terms in f (s) and f (s) are fairly close to 12 = 0:9166 and 612482 = 0:8125, respectively. This indicates that, in a d-dimensional space (d 3), accessing the 32 minimum bounding hyper-rectangle of a given query region may incur additional non-consecutive disk accesses, and hence supports the argument made in 15] that the minimum bounding rectangle may not be a good approximation of a non-rectangular object.

poly sphere

5.3 Comparison with the Gray-coded and z curves

It may be argued that it is not convincing to make a de nitive conclusion that the Hilbert curve is better or worse than others solely on the basis of the average behaviors, because the clustering achieved by the Hilbert curve might have a wider deviation from the average than other curves. Therefore, it is desirable to perform a worst-case analysis to determine the bounds on the deviation. A full- edged worst-case analysis, however, is beyond the scope of this paper. Instead, we measured the worse-case numbers of clusters for the Hilbert curve, and compared with those for the Gray-coded and z curves in the same simulation experiments. Figure 13 and Figure 14 show the worst-case and average numbers of clusters, respectively. Each gure presents the results from an exhaustive simulation performed on a 1K 1K 2-dimensional space and a statistical simulation performed on a 32K 32K 32K 3-dimensional space. The Hilbert curve achieves much better clustering than the other curves in both the worst and average cases. For example, for a 2-dimensional square query, the Hilbert curve signi cantly reduced the numbers of clusters, yielding an improvement of up to 43 percent for the worst-case behaviors, and 48 percent for the average cases. For a 3-dimensional spherical query, the Hilbert curve achieved an improvement of up to 28 percent from the z curve and 18 percent from the Gray-coded curve for the worst cases, and up to 31 percent from the z curve and 22 percent from the Gray-coded curve for the average cases. Although it is not the focus of this paper, it is worth noting that the Gray-coded curve was not always better than the z curve, which is in contrast to a previous study 14] that the Gray-coded curve achieves better clustering than the z curve for a 2-dimensional 2 2 square query. In particular, for 2-dimensional circular queries (Figure 13(b) and Figure 14(b)), the Gray-coded curve was worse than 26

Max Number of Clusters (Grid: 1024 x 1024) 100 Z_Worst Gray_Worst Hilbert_Worst 80 70

Max Number of Clusters (Grid: 1024 x 1024) Z_Worst Gray_Worst Hilbert_Worst

60

Max number of clusters

60

Max number of clusters

4 8 12 16 20 24 The side length of query (s) 28 32

50

40 30

40

20

20 10

0

0 4 8 12 16 20 24 The side length of query (s) 28 32

(a) 2-dimensional square queries

Max Number of Clusters (Grid: 32768 x 32768 x 32768) 900 800 700 Z_Worst Gray_Worst Hilbert_Worst

(b) 2-dimensional circular queries

Max Number of Clusters (Grid: 32768 x 32768 x 32768) 300 Z_Worst Gray_Worst Hilbert_Worst

250

Max number of clusters

600 500 400 300 200

Max number of clusters

4 8 12 The side length of query (s) 16

200

150

100

50 100 0 0 4 8 12 The side length of query (s) 16

(c) 3-dimensional cubic queries

(d) 3-dimensional spherical queries

Figure 13: Worst-case number of clusters for three di erent space- lling curves the z curve in both the worst and average cases. On the other hand, for 2-dimensional square queries, the Gray-coded curve was better than the z curve for the average clustering only by negligible amounts (the two measurements were almost identical, as shown in Figure 14(a)). Furthermore, it was surprising that both the Gray-coded and z curves performed exactly the same for the worst-case clustering (the two measurements were completely identical, as shown in Figure 13(a)). In a 3-dimensional space, however, the Gray-coded curve was clearly better than the z curve for both types of queries in both the worst and average cases.

5.4 Summary

The main conclusions from our experiments are: The exact solution given in Theorem 2 matches exactly the experimental results from exhaustive simulations for the square queries of size 2 2 . (See Figure 11(a).)

k k

27

Average Number of Clusters (Grid: 1024 x 1024) 70 Z_square Gray_square Hilbert_square 60

Average Number of Clusters (Grid: 1024 x 1024) Z_sphere Gray_sphere Hilbert_sphere

60

Average number of clusters

50

Average number of clusters

4 8 12 16 20 24 The side length of query (s) 28 32

50

40 30

40

30

20

20

10

10

0

0 4 8 12 16 20 24 The side length of query (s) 28 32

(a) 2-dimensional square queries

Average Number of Clusters (Grid: 32768 x 32768 x 32768) 500 450 400 Z_cube Gray_cube Hilbert_cube

(b) 2-dimensional circular queries

Average Number of Clusters (Grid: 32768 x 32768 x 32768) 300 Z_sphere Gray_sphere Hilbert_sphere

250

Average number of clusters

350 300 250 200 150 100

Average number of clusters

4 8 12 The side length of query (s) 16

200

150

100

50 50 0 0 4 8 12 The side length of query (s) 16

(c) 3-dimensional cubic queries

(d) 3-dimensional spherical queries

Figure 14: Average number of clusters for three di erent space- lling curves The asymptotic solutions given in Theorem 1 and Corollary 1 provide excellent approximations for d-dimensional queries of arbitrary shapes and sizes. (See Figure 11(b) and Figure 12.) For example, the relative errors did not exceed 2 percent for d-dimensional (2 d 10) hypercubic queries. Assuming that blocks are arranged on disk by the Hilbert ordering, accessing the minimum bounding rectangles of a d-dimensional (d 3) query region may increase the number of non-consecutive accesses, whereas this is not the case for a 2-dimensional query. The Hilbert curve outperforms the z and Gray-coded curves by a wide margin for both the worst and average case clustering. (See Figure 13 and Figure 14.) For 3-dimensional cubic and spherical queries, the Gray-coded curve outperformed the z curve for both the worst-case and average clustering. However, the clustering by the Gray-coded curve was 28

almost identical to that by the z curve for 2-dimensional square queries (in Figure 13(a) and Figure 14(a)), and clearly worse for 2-dimensional circular queries (in Figure 13(b) and Figure 14(b)).

6 Conclusions

We have studied the clustering property of the Hilbert space- lling curve as a linear mapping of a multidimensional space. Through algebraic analysis, we have provided simple formulas that state the expected number of clusters for a given query region, and also validated their correctness through simulation experiments. The main contributions of this paper are: Theorem 2 generalizes the previous work done only for a 2 2 query region 14], by providing an exact closed-form formula for 2 2 square queries for any k (k 1). The asymptotic solution given in Theorem 1 further generalizes it for d-dimensional polyhedral query regions (d 2).

k k

We have proved that the Hilbert curve achieves better clustering than the z curve in a 2dimensional space; the average number of clusters for the Hilbert curve is one fourth of the perimeter of a query rectangle, while that of the z curve is one third of the perimeter plus two thirds of the side length of the rectangle in the unfavored direction 23]. Furthermore, by simulation experiments, we have shown that the Hilbert curve outperforms both the z and Gray-coded curves in 2-dimensional and 3-dimensional spaces. We conjecture that this trend will hold even in higher dimensional spaces. We have shown that it may incur extra overhead to access the minimum bounding hyper-rectangle for a d-dimensional non-rectangular query (d 3), because it may increase the number of clusters (i.e., non-consecutive disk accesses). The approaches used in this paper can be applied to other space- lling curves. In particular, the basic intuitions summarized in Remark 3.1 and Remark 4.1 are true for any space- lling curves. From a practical point of view, it is important to predict and minimize the number of clusters because it determines the number of non-consecutive disk accesses, which in turn incur additional seek time. Assuming that blocks are arranged on disk by the Hilbert ordering, we can provide a simple measure that depends only on the perimeter or surface area of a given query region and its dimensionality. The measure can then be used to predict the required disk access behaviors and thereby the total access time. 29

The full- edged analysis of the worst-case behaviors for the Hilbert curve is left for future research. Future work also includes the extension of the exact analysis for d-dimensional spaces (d 3), and the investigation of the distribution of distances between clusters.

A Appendix: Proofs

Proof of Lemma 4: A 2 +-oriented H2+ approximation is composed of four H2+ ?1 approximations (two on the top and two on the bottom) and three connection edges. The two H2+ ?1 approximations on the top half are 2 +-oriented and the two H2+ ?1 approximations on the bottom half are 1 +-oriented on the left and 1 ?-oriented on the right. Among the three edges connecting the four H2+ ?1 approximations, the horizontal edge is not included in the boundary subregion of the H2+ , because the edge resides on the 2 + ?1 -th row from the topmost of the H2+ . The other two vertical connection edges are on the leftmost and rightmost columns and included in the boundary subregion of the H2+ . Thus, the main observations are: (i) The number of connection edges in the top boundary subregion of the 2 +-oriented H2+ is the sum of those in the top boundary subregions of the two 2 +-oriented H2+ ?1 approximations.

k n k n k n k n k n k n k n k n k n k n k n

(ii) The number of connection edges in the bottom boundary subregion of the 2 +-oriented H2+ is the sum of those in the bottom boundary subregions of the 1 +-oriented H2+ ?1 and 1 ?-oriented H2+ ?1 approximations.

k n k n k n

(iii) The number of connection edges in the left (or right) boundary subregion of the 2 +-oriented H2+ is the sum of those in the left (or right) boundary subregions of the 2 +-oriented H2+ ?1 and 1 +-oriented (or 1 ?-oriented) H2+ ?1 approximations, plus one for a connection edge.

k n k n k n

Since the bottom boundary subregion of a 1 +-oriented H2+ ?1 is equivalent to the right boundary subregion of a 2 +-oriented H2+ ?1 and so on, it follows that

k n k n

t

n

b s

n

n

= 2 t ?1 = 2 s ?1 = s ?1 + b ?1 + 1:

n n n n n n n n n n

Since t1 = 1; b1 = 0 and s1 = 1, we obtain t = 2 ?1 and b + 2s = 2(b ?1 + 2s ?1 ) + 2, which yields b + 2s = 2(2 ? 1).

n n n

30

Proof of Lemma 5: The H2+ and H2 approximations contain 22( + ) ?1 and 22 ?1 edges, respectively. Since there are a total of 4(2 ? 1) H2 approximations in the boundary subregions, the total number of edges in E1 is given by

k n k k n k n k

(22( + ) ? 1) ? 4(2 ? 1)(22 ? 1) ? (5 2 ?1 ? 2) = 22 (2 ? 2)2 + 3(2 ?1 ? 1):

k n n k n k n n

Because each edge in E1 is cut 2

N1 = 2

n k +1 k n

k +1

times, it follows that

n n k n k

(22 (2 ? 2)2 + 3(2 ?1 ? 1)) = 2(2 ? 2)2 23 + 3(2 ? 2)2 :

n k +1

Among the 5 2 ?1 ? 2 edges in E3, t edges are cut 2 twice. Therefore,

N3 = 2

k +1 n n n

times, and the other b + 2s edges are cut

n n

t + 2(b + 2s ) = 2

n+k

+ 4(2 ? 1):

n

Proof of Lemma 6: Consider a 2 +-oriented H2+ , which is composed of four H2+ ?1 approximations and three connection edges. The number of 2 +-oriented H2 approximations in the top subregions (i.e., fB,F,Hg) of the 2 +-oriented H2+ is twice the number of 2 +-oriented H2 approximations in the top subregions of the 2 +-oriented H2+ ?1 . This is because the top half of the 2 +-oriented H2+ consists f f of two 2 +-oriented H2+ ?1 approximations. Thus the recurrence relation is 2+ g = 2 2+ ?1 g. f Since 2+ 1 g = 2, we obtain f g=2 : 2+

k n k n k k n k k n k n B;F;H ;n B;F;H ;n k n B;F;H ; B;F;H ;n n

The bottom half of the 2 +-oriented H2+ consists of a 1 +-oriented H2+ ?1 and a 1 ?-oriented H2+ ?1 . In the bottom boundary subregions fC,G,Ig, each 1 ?-oriented H2 in the 1 +-oriented H2+ ?1 approximation becomes a 2 +-oriented H2 in the 2 +-oriented H2+ approximation; each 1 +-oriented H2 in the 1 ?-oriented H2+ ?1 approximation becomes a 2 +-oriented H2 in the 2 +-oriented H2+ approximation. No other than the 1 ?-oriented and 1 +-oriented H2 approximations in the H2+ ?1 approximations becomes a 2 +-oriented H2 in the H2+ . Thus, it follows that

k n k n k n k k n k k n k k n k k n k k n k k n

2

f

+ ;n

C;G;I

g= f

1

?;n?1

C;G;I

g+ f

1

+ ;n?1 :

C;G;I C;G;I ;n

g

f Since there exist no 2 ?-oriented H2 approximations in the bottom boundary subregions, 2? Thus, f g+ f g+ f g=2 : 2+ 1? 1+

k C;G;I ;n C;G;I ;n C;G;I ;n n

g = 0.

31

Similarly, on the left boundary subregion, we obtain the following recurrence relations.

f

+ ;n

D;F;G

1

g+ f

2

+;n

D;F;G

g+ f

2

1

f

+ ;n

D;F;G

?;n

D;F;G

g = f g+ f g 2+ ?1 2? ?1 g = 2 :

D;F;G ;n D;F;G ;n n

Then, from the above four recurrence relations,

2

f

+ ;n

C;G;I

f f Since 2+ 1 g + 2 1+ 1

C;G;I ; ;

D;F;G

g) g) + 2(2 ?1 ? f g = (2 ?1 ? f 1+ ?1 2+ ?1 f f = (2 ?2 + 2+ ?2g) + 2(2 ?2 + 1+ ?2g) f f = 3 2 ?2 + ( 2+ ?2g + 2 1+ ?2g): g = 4, we obtain g+2 f g = 2 and f 1+ 2 2+ 2 g+2 f

1

+;n

D;F;G

n

C;G;I ;n

n

D;F;G ;n

n

C;G;I ;n

n

D;F;G ;n

n

C;G;I ;n

D;F;G ;n

C;G;I ;

D;F;G ;

2

f

+ ;n

C;G;I

g+2 f

1

+;n

D;F;G

g=2 :

n k n

f From 1?

E;H;I ;n

g= f

1

+ ;n

2

D;F;G

g due to the self-symmetry of the 2 +-oriented H2 , it follows that + g+ f

1

f

+ ;n

C;G;I

+;n

D;F;G

g+ f

1

?;n

E;H;I

g= f

2

+ ;n

C;G;I

g+2 f

1

+;n

D;F;G

g=2 :

n k

Now consider subregions fF,G,H,Ig. The H2 approximations in F,H are always 2 +-oriented, the H2 in G is either 2 +-oriented or 1 +-oriented, and the H2 in I is either 2 +-oriented or 1 ?-oriented. Thus, f g = 2 and f g + f g + f g = 2. Therefore, 1? 1+ 2+ 2+

k k F;H ;n G;I G;I G;I ;n ;n ;n

f g = f + 2+ f g+ f g+ f g = ( f 2+ 1+ 1? 2+

B

B;F;H ;n

2

;n

g? f

2

C

D

E

C;G;I ;n

;n

;n

;n

g+ f

1

+ ;n

F;H

g=2

n

+;n

D;F;G

g+ f

1

?2

?;n

E;H;I

g) ? ( f

2

+;n

G;I

g+ f

1

+;n

G;I

g+ f

1

?;n

G;I

g)

= 2 ? 2:

n

So far we have derived the rst two equations given in this lemma. Finally, to derive the third equation, consider subregions fB,C,D,Eg. Since the total number of H2 approximations in those subregions is 4(2 ? 2),

k n

2

f

+ ;n

B;C;D;E

g+ f

2

k B;E ;n

?;n

B;C;D;E

g+ f

1

?;n

B;C;D;E

g+ f

1

k

+;n

B;C;D;E

g = 4(2

n

? 2):

k

There exist no 2 ?-oriented H2 in fB,Cg, no 1 ?-oriented H2 in fB,Dg, and no 1 +-oriented H2 in fB,Eg. f f f That is, 2? g = 1? g = 1+ g = 0. Therefore,

B;C ;n B;D ;n

2

f

+;n

D;E

g+ f

2

? ;n

D;E

g + f g + f g = 4(2 1+ 1?

C C ;n ;n

? 2) ? ( = 4(2 ? 2) ? ( = 2(2 ? 2):

n n n

2

f

2

f

+;n

B;C

g+ f

2

+;n

B;C

g+ f g + f g) 1? 1+

E D ;n ;n

?;n

B;C

g+ f

1

?;n

B;D;E

g+ f

1

+ ;n

B;D;E

g)

32

Proof of Lemma 7: Every H2 approximation in subregion B is 2 +-oriented, and there exists no 2 ?oriented H2 approximation in subregion C. Thus, the number of vertical edges in subregions fB,Cg is f f f the sum of 2+ gV and ( 1+ g + 1? g )H . Likewise, the number of horizontal edges in subregions f f f f fD,Eg is the sum of ( 2+ g + 2? g)H and ( 1+ g + 1? g )V , because there exist no 1 ?-oriented H2 in subregion D and no 1 +-oriented H2 in subregion E. Thus, the total number of edges in E2 is given by

k k B;C ;n C C k ;n ;n k D;E ;n D;E ;n D E k ;n ;n k k k

f f f f f f f ( 2+ g + 1+ g + 1? g )V + ( 1+ g + 1? g + 2+ g + 2? g)H = 2(2 ? 2)(H + V ) (by Lemma 6).

B;C ;n D E C C D;E ;n D;E ;n ;n ;n k ;n ;n n k k

k

Each edge in E2 is cut 2 times and H + V = 22 ? 1. Therefore,

k k k k

N2 = 2(2

n

? 2)(2 ? 1)2 = 2(2 ? 2)2 ? 2(2 ? 2)2 :

2k

k n

3k

n

k

Proof of Lemma 8: First, + = 2=1 ih (i) + P the de nition of H , H = 2=1 h (i). Therefore,

k

k k i k

P

P (2 ? i + 1)h (i) = P (2 + 1)h (i). From

2k i=1

k k

2k i=1

k

k

k

k

k

i

k

k

+

k

k

= (2 + 1)H :

k k

? ? ? = 2=1 1 iv (i) + 2=2 ?1 +1 iv (i) = 2=1 1 iv (i) + 2=1 1 (2 ?1 + i)v (2 ?1 + i). Since 2oriented H2 approximations are vertically self-symmetric, v (2 ? i + 1) = v (i) holds for any i (1 i ?1 ): Thus, = P2 ?1 iv (i)+ P2 ?1 (2 ?1 + i)v (2 ?1 ? i + 1) = P2 ?1 iv (i)+ P2 ?1 (2 ? i +1)v (i). 2 =1 =1 =1 =1 P2 ?1 v (i). Therefore, From the de nition of V and self-symmetry, V = 2 =1

Second,

k

P

k

i

k

P

k

i

k

P

k

i

k

P

k

k

i

k

k

k

k

k

k

k

k

k

k

i

k

k

i

k

k

k

k

i

k

k

i

k

k

k

k

i

k

k

=

2k

? X1

i=1

1 (2 + 1)v (i) = 2 (2 + 1)V :

k k k k

Proof of Lemma 9: In E4, the number of horizontal cuts from a single u-gradient is 2 , the number of horizontal cuts from a single d-gradient is 2 , and the number of vertical cuts from a single s-gradient is 2 . Thus,

k k k k k k

N4 = 2

k

2

f g +2 ( f g + f g + f g)+2 ( f g+ f g+ f g + f g) + 2+ 1+ 1? 2+ 2? 1? 1+

B C D E D;E ;n D;E ;n C C ;n k ;n n ;n ;n k ;n ;n n k k n

= 2 (2 ? 2) + 2 (2 ? 2) + 4 (2 ? 2)

k

(by Lemma 6)

33

= 2(2 ? 2)( + + 2 ) = 2(2 ? 2)(2 + 1)(H + V ) = 2(2 ? 2)(2 + 1)(22 ? 1)

n k k k n k k k n k k k

(by Lemma 8)

In E5, the number of horizontal cuts from a single u-gradient is , the number of horizontal cuts from a single d-gradient is , and the number of vertical cuts from a single s-gradient is . Thus, N5 = 2 + 2 + 4 = 2(2 + 1)(22 ? 1):

k k k k k k k k k k

References

1] David J. Abel and David M. Mark. A comparative analysis of some two-dimensional orderings. Int. J. Geographical Information Systems, 4(1):21{31, January 1990. 2] J. J. Bartholdi and L. K. Platzman. An O(n log n) travelling salesman heuristic based on space lling curves. Operation Research Letters, 1(4):121{125, September 1982. 3] Marcel Berger. Geometry II. Springer-Verlag, New York, NY, 1987. 4] Theodore Bially. Space- lling curves: Their generation and their application to bandwidth reduction. IEEE Trans. on Information Theory, IT-15(6):658{664, November 1969. 5] Arthur R. Butz. Convergence with Hilbert's space lling curve. Journal of Computer and System Sciences, 3:128{146, 1969. 6] I. S. Du . Design features of a frontal code for solving sparse unsymmetric linear systems out-ofcore. SIAM J. Sci. Stat. Computing, 5(2):270{280, June 1984. 7] C. R. Dyer. The space e ciency of quadtrees. Computer Graphics and Image Processing, 19(4):335{ 348, August 1982. 8] Christos Faloutsos. Multiattribute hashing using gray codes. Proceedings of the 1986 ACM SIGMOD Conference, pages 227{238, May 1986. 9] Christos Faloutsos. Analytical results on the quadtree decomposition of arbitrary rectangles. Pattern Recognition Letters, 13(1):31{40, January 1992.

34

10] Christos Faloutsos, H. V. Jagadish, and Yannis Manolopoulos. Analysis of the n-dimensional quadtree decomposition for arbitrary hyperrectangles. IEEE Transactions on Knowledge and Data Engineering, 9(3):373{383, May/June 1997. 11] Christos Faloutsos and Shari Roseman. Fractals for secondary key retrieval. Proceedings of the 1989 ACM PODS Conference, pages 247{252, March 1989. 12] P. J. Giblin. Graphs, Surfaces and Homology. Chapman and Hall, New York, NY, second edition, 1981. 13] D. Hilbert. U ber die stetige Abbildung einer Linie auf Flachenstuck. Math. Annln., 38:459{460, 1891. 14] H. V. Jagadish. Linear clustering of objects with multiple attributes. Proceedings of the 1990 ACM SIGMOD Conference, pages 332{342, May 1990. 15] H. V. Jagadish. Spatial search with polyhedra. In Proceedings of the 6th Inter. Conference on Data Engineering, pages 311{319, Los Angeles, LA, February 1990. 16] H. V. Jagadish. Analysis of the Hilbert curve for representing two-dimensional space. Information Processing Letters, 62(1):17{22, April 1997. 17] Maher Kaddoura, Chao-Wei Ou, and Sanjay Ranka. Partitioning unstructured computational graphs for nonuniform and adaptive environments. IEEE Parallel and Distributed Technology, 3(3):63{69, Fall 1995. 18] A. Lempel and J. Ziv. Compression of two-dimensional images. NATO ASI Series, F12:141{154, June 1984. 19] J. Orenstein. Spatial query processing in an object-oriented database system. Proceedings of the 1986 ACM SIGMOD Conference, pages 326{336, May 1986. 20] Edward A. Patrick, Douglas R. Anderson, and F. K. Bechtel. Mapping multidimensional space to one dimension for computer output display. IEEE Transactions on Computers, C-17(10):949{953, October 1968. 21] G. Peano. Sur une courbe qui remplit toute une aire plane. Math. Annln., 36:157{160, 1890. 22] R. L. Rivest. Partial match retrieval algorithms. SIAM J. Comput, 5(1):19{50, March 1976. 35

23] Yi Rong and Christos Faloutsos. Analysis of the clustering property of Peano curves. Techn. Report CS-TR-2792, UMIACS-TR-91-151, Univ. of Maryland, December 1991. 24] J. B. Rothnie and T. Lozano. Attribute based le organization in a paged memory environment. CACM, 17(2):63{69, February 1974. 25] Chris Ruemmler and John Wilkes. An introduction to disk drive modeling. IEEE Computer, 27(3):17{28, March 1994. 26] Hans Sagan. A three-dimensional Hilbert curve. Inter. J. Math. Ed. Sc. Tech., 24:541{545, 1993. 27] C. A. Sha er. A formula for computing the number of quadtree node fragments created by a shift. Pattern Recognition Letters, 7(1):45{49, January 1988. 28] George F. Simmons. Introduction to Topology and Modern Analysis. McGraw-Hill Book Company, Inc., New Work, 1963.

36