R* Split Algorithm Analysis

Historical Note: This analysis documents the R* split algorithm evaluation experiment. Absolute timings cited reflect the experimental conditions at the time. Core findings remain valid: R* provides best construction performance and workload-dependent query advantages. For current measurements, see BENCHMARKS.md.

Finding: R* split (Beckmann 1990) achieves fastest construction. Query performance is workload-dependent.

Result

Split Algorithm	Construction Time	Tree Quality	Status
R* (Beckmann 1990)	1.92ms (baseline)	Best (minimal overlap)	✅ Production choice
Midpoint	2.20ms (15% slower)	Worst (high overlap)	Alternative
Quadratic (Guttman 1984)	43.5ms (22x slower)	Good	❌ Too slow

Impact: R* is fastest for construction. Query performance: equivalent on sequential data, faster on overlapping/large data.

Conclusion: R* is production choice for mixed workloads, but Midpoint competitive for sequential-heavy queries (updated 2025-10-06).

Background

The R*-tree (Beckmann et al., 1990) improves upon Guttman's original R-tree (1984) with smarter axis selection and overlap minimization.

Split Algorithm Comparison

Three Strategies Evaluated

1. Guttman's Quadratic Split (1984)

Algorithm:

1. Find seed pair: test all O(m²) pairs, pick worst overlap
2. Distribute remaining: O(m²) greedy assignment
Total: O(m²) per split

Complexity: For m=10 entries → ~100 operations per split

Performance:

large-grid (n=2500): 1.40ms (best on sequential data)
large-overlapping (n=1250): 5.70ms (worst on fragmented data)
Penalty: 59% slower on overlapping workload

Problem: High algorithmic cost makes it vulnerable to high-fragmentation scenarios where many splits occur.

2. R* Split (Beckmann 1990)

Algorithm:

1. Choose axis: sort by each axis, compute perimeter sum, pick minimum
   - X-axis sort: O(m log m)
   - Y-axis sort: O(m log m)
   - Compare: O(m) for each distribution
2. Minimize overlap: test all distributions on chosen axis, pick minimum overlap
   - O(m) distributions × O(m) bbox computation = O(m²)
Total: O(m log m) for axis selection + O(m²) for overlap minimization

Actual Complexity: O(m²) worst case, but with better constants and data-dependent early exit

Performance (statistical mean over 5 runs):

large-grid (n=2500): 1.92ms ± 0.07ms
large-overlapping (n=1250): 3.14ms ± 0.05ms
Consistency: Excellent across all workloads

Advantage: Balanced performance. The perimeter metric and overlap minimization produce high-quality trees that avoid pathological cases.

3. Midpoint Split (Simple)

Algorithm:

1. Sort by center coordinate
2. Split at midpoint
Total: O(m log m) per split

Complexity: For m=10 → ~30 operations per split

Performance:

large-grid (n=2500): 2.20ms (15% slower than R*)
large-overlapping (n=1250): 3.33ms (6% slower than R*)

Trade-off: Simpler algorithm, but R* is actually faster in current implementation while also providing better tree quality.

Empirical Results

Statistical Analysis (5 Runs)

Metric	RTree (R*)	ArrayBufferRTree (Midpoint)	Difference
large-grid
Mean	1.92ms	2.20ms	-15% (faster)
Std Dev	0.01ms	0.03ms	CV < 1.5%
large-overlapping
Mean	3.14ms	3.33ms	-6% (faster)
Std Dev	0.05ms	0.05ms	CV < 1.5%

Practical Significance:

Grid: R* is 16% faster (large effect size, stable measurement)
Overlapping: R* is 6% faster (moderate effect size, stable measurement)

Interpretation

R* is actually FASTER than midpoint split: Contrary to theoretical expectations, the current implementation of R* outperforms midpoint by 6-16% while also providing better tree quality
R* is 37% faster than quadratic on overlapping data: O(m log m) vs O(m²) matters when splits are frequent
Variance is low: Both implementations are highly optimized and repeatable (CV < 1.5%)

Theoretical Analysis

Why R* Balances Performance

Key insight: R* optimizes for tree quality while keeping split cost reasonable.

Perimeter sum (margin): ∑ perimeter(bbox₁) + perimeter(bbox₂)

Minimizing margin → more square bounding boxes → better spatial coverage
Cost: O(m) per distribution × 2 axes = O(m) overhead

Overlap minimization: area(bbox₁ ∩ bbox₂)

Minimizing overlap → better spatial pruning during queries
Cost: O(m) distributions × O(m) bbox computation = O(m²)

Total split cost:

Axis selection: 2 × O(m log m) = O(m log m)
Distribution testing: O(m) × O(m) = O(m²)
Dominant term: O(m²), but with small constant (better than quadratic seed selection)

Insertion Complexity with Rectangle Decomposition

Context: Our use case differs from standard R-tree insertion because we must find and modify ALL overlapping entries (not just find single insertion point).

Complete Analysis:

Step 1 - Find k overlapping entries:

Traverse tree paths whose bounding boxes intersect new range
Cost: O(k × log n) worst case, O(log n) average when k is small

Step 2 - Decompose k rectangles:

Generate ≤4 fragments per rectangle
Cost: O(k)

Step 3 - Reinsert R_new + fragments:

Up to 4k + 1 insertions
Cost: O(k × log n)

Overall:

Best case (k=0 no overlaps): O(log n)
Average case (k constant): O(log n)
Worst case (k=Θ(n)): O(n log n)

Empirical: Adversarial tests show average k ≈ 2.3 even under pathological patterns (run via deno task test:adversarial), validating that practical complexity is O(log n).

For full proof, see theoretical-foundation.md.

When R* Wins

Write-heavy workloads:

Quadratic: High split cost × many splits = poor performance
R*: Moderate split cost × many splits = acceptable performance
Midpoint: Low split cost × many splits = best performance

Query-heavy workloads (not benchmarked in current tests):

Tree quality matters more than construction time
R* overlap minimization → better spatial pruning
Expected: R* > Midpoint > Quadratic on query performance

Balanced workloads:

R* fastest construction across all workloads
Query performance: equivalent on sequential, faster on overlapping (12-34%)
Production choice for mixed workloads

Implementation Notes

TypedArray Optimization

Both use TypedArrays (Int32Array) for cache-friendly access and minimal GC.

R* competitive despite complexity: V8 JIT optimizes predictable branching, TypedArray operations, and modern CPU branch prediction. V8 TimSort (~30 comparisons for m=10) is highly optimized.

Recommendations

Research Project

Use RStarTreeImpl (R*): Academic correctness, excellent performance, no pathological cases.

Keep ArrayBufferRTreeImpl: Demonstrates midpoint trade-offs, research baseline.

Production

Use RStarTreeImpl: Faster (6-16%), better tree quality, no trade-offs.

Alternative ArrayBufferRTreeImpl: Simpler code, within 6-16% of R*.

Not recommended CompactRTreeImpl: 13x slower (quadratic split).

References

Guttman, A. (1984). "R-trees: A Dynamic Index Structure for Spatial Searching." SIGMOD '84.
Beckmann, N., Kriegel, H.-P., Schneider, R., & Seeger, B. (1990). "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles." SIGMOD '90.

Conclusion

The R* implementation successfully achieves its design goals:

✅ Academic correctness: Implements canonical Beckmann algorithm
✅ Excellent performance: 2.32-3.58ms range, competitive with simpler approaches
✅ Balanced behavior: No pathological cases (unlike quadratic split)
✅ Production quality: TypedArray optimization, clean code, well-tested

R* achieves both faster construction AND better tree quality compared to midpoint split - a win-win outcome. The 6-16% performance advantage likely comes from V8 JIT optimizations and code improvements since the original implementation. Future work could benchmark query performance to quantify R*'s tree quality advantage on read-heavy workloads.