R* Split Algorithm Analysis
Historical Note: This analysis documents the R* split algorithm evaluation experiment. Absolute timings cited reflect the experimental conditions at the time. Core findings remain valid: R* provides best construction performance and workload-dependent query advantages. For current measurements, see BENCHMARKS.md.
Finding: R* split (Beckmann 1990) achieves fastest construction. Query performance is workload-dependent.
Result
| Split Algorithm | Construction Time | Tree Quality | Status |
|---|---|---|---|
| R* (Beckmann 1990) | 1.92ms (baseline) | Best (minimal overlap) | ✅ Production choice |
| Midpoint | 2.20ms (15% slower) | Worst (high overlap) | Alternative |
| Quadratic (Guttman 1984) | 43.5ms (22x slower) | Good | ❌ Too slow |
Impact: R* is fastest for construction. Query performance: equivalent on sequential data, faster on overlapping/large data.
Conclusion: R* is production choice for mixed workloads, but Midpoint competitive for sequential-heavy queries (updated 2025-10-06).
Background
The R*-tree (Beckmann et al., 1990) improves upon Guttman's original R-tree (1984) with smarter axis selection and overlap minimization.
Split Algorithm Comparison
Three Strategies Evaluated
1. Guttman's Quadratic Split (1984)
Algorithm:
1. Find seed pair: test all O(m²) pairs, pick worst overlap
2. Distribute remaining: O(m²) greedy assignment
Total: O(m²) per split
Complexity: For m=10 entries → ~100 operations per split
Performance:
- large-grid (n=2500): 1.40ms (best on sequential data)
- large-overlapping (n=1250): 5.70ms (worst on fragmented data)
- Penalty: 59% slower on overlapping workload
Problem: High algorithmic cost makes it vulnerable to high-fragmentation scenarios where many splits occur.
2. R* Split (Beckmann 1990)
Algorithm:
1. Choose axis: sort by each axis, compute perimeter sum, pick minimum
- X-axis sort: O(m log m)
- Y-axis sort: O(m log m)
- Compare: O(m) for each distribution
2. Minimize overlap: test all distributions on chosen axis, pick minimum overlap
- O(m) distributions × O(m) bbox computation = O(m²)
Total: O(m log m) for axis selection + O(m²) for overlap minimization
Actual Complexity: O(m²) worst case, but with better constants and data-dependent early exit
Performance (statistical mean over 5 runs):
- large-grid (n=2500): 1.92ms ± 0.07ms
- large-overlapping (n=1250): 3.14ms ± 0.05ms
- Consistency: Excellent across all workloads
Advantage: Balanced performance. The perimeter metric and overlap minimization produce high-quality trees that avoid pathological cases.
3. Midpoint Split (Simple)
Algorithm:
1. Sort by center coordinate
2. Split at midpoint
Total: O(m log m) per split
Complexity: For m=10 → ~30 operations per split
Performance:
- large-grid (n=2500): 2.20ms (15% slower than R*)
- large-overlapping (n=1250): 3.33ms (6% slower than R*)
Trade-off: Simpler algorithm, but R* is actually faster in current implementation while also providing better tree quality.
Empirical Results
Statistical Analysis (5 Runs)
| Metric | RTree (R*) | ArrayBufferRTree (Midpoint) | Difference |
|---|---|---|---|
| large-grid | |||
| Mean | 1.92ms | 2.20ms | -15% (faster) |
| Std Dev | 0.01ms | 0.03ms | CV < 1.5% |
| large-overlapping | |||
| Mean | 3.14ms | 3.33ms | -6% (faster) |
| Std Dev | 0.05ms | 0.05ms | CV < 1.5% |
Practical Significance:
- Grid: R* is 16% faster (large effect size, stable measurement)
- Overlapping: R* is 6% faster (moderate effect size, stable measurement)
Interpretation
- R* is actually FASTER than midpoint split: Contrary to theoretical expectations, the current implementation of R* outperforms midpoint by 6-16% while also providing better tree quality
- R* is 37% faster than quadratic on overlapping data: O(m log m) vs O(m²) matters when splits are frequent
- Variance is low: Both implementations are highly optimized and repeatable (CV < 1.5%)
Theoretical Analysis
Why R* Balances Performance
Key insight: R* optimizes for tree quality while keeping split cost reasonable.
Perimeter sum (margin): ∑ perimeter(bbox₁) + perimeter(bbox₂)
- Minimizing margin → more square bounding boxes → better spatial coverage
- Cost: O(m) per distribution × 2 axes = O(m) overhead
Overlap minimization: area(bbox₁ ∩ bbox₂)
- Minimizing overlap → better spatial pruning during queries
- Cost: O(m) distributions × O(m) bbox computation = O(m²)
Total split cost:
- Axis selection: 2 × O(m log m) = O(m log m)
- Distribution testing: O(m) × O(m) = O(m²)
- Dominant term: O(m²), but with small constant (better than quadratic seed selection)
Insertion Complexity with Rectangle Decomposition
Context: Our use case differs from standard R-tree insertion because we must find and modify ALL overlapping entries (not just find single insertion point).
Complete Analysis:
Step 1 - Find k overlapping entries:
- Traverse tree paths whose bounding boxes intersect new range
- Cost: O(k × log n) worst case, O(log n) average when k is small
Step 2 - Decompose k rectangles:
- Generate ≤4 fragments per rectangle
- Cost: O(k)
Step 3 - Reinsert R_new + fragments:
- Up to 4k + 1 insertions
- Cost: O(k × log n)
Overall:
- Best case (k=0 no overlaps): O(log n)
- Average case (k constant): O(log n)
- Worst case (k=Θ(n)): O(n log n)
Empirical: Adversarial tests show average k ≈ 2.3 even under pathological patterns (run via deno task test:adversarial), validating that practical complexity is O(log n).
For full proof, see theoretical-foundation.md.
When R* Wins
Write-heavy workloads:
- Quadratic: High split cost × many splits = poor performance
- R*: Moderate split cost × many splits = acceptable performance
- Midpoint: Low split cost × many splits = best performance
Query-heavy workloads (not benchmarked in current tests):
- Tree quality matters more than construction time
- R* overlap minimization → better spatial pruning
- Expected: R* > Midpoint > Quadratic on query performance
Balanced workloads:
- R* fastest construction across all workloads
- Query performance: equivalent on sequential, faster on overlapping (12-34%)
- Production choice for mixed workloads
Implementation Notes
TypedArray Optimization
Both use TypedArrays (Int32Array) for cache-friendly access and minimal GC.
R* competitive despite complexity: V8 JIT optimizes predictable branching, TypedArray operations, and modern CPU branch prediction. V8 TimSort (~30 comparisons for m=10) is highly optimized.
Recommendations
Research Project
Use RStarTreeImpl (R*): Academic correctness, excellent performance, no pathological cases.
Keep ArrayBufferRTreeImpl: Demonstrates midpoint trade-offs, research baseline.
Production
Use RStarTreeImpl: Faster (6-16%), better tree quality, no trade-offs.
Alternative ArrayBufferRTreeImpl: Simpler code, within 6-16% of R*.
Not recommended CompactRTreeImpl: 13x slower (quadratic split).
References
- Guttman, A. (1984). "R-trees: A Dynamic Index Structure for Spatial Searching." SIGMOD '84.
- Beckmann, N., Kriegel, H.-P., Schneider, R., & Seeger, B. (1990). "The R*-tree: An Efficient and Robust Access Method for Points and Rectangles." SIGMOD '90.
Conclusion
The R* implementation successfully achieves its design goals:
- ✅ Academic correctness: Implements canonical Beckmann algorithm
- ✅ Excellent performance: 2.32-3.58ms range, competitive with simpler approaches
- ✅ Balanced behavior: No pathological cases (unlike quadratic split)
- ✅ Production quality: TypedArray optimization, clean code, well-tested
R* achieves both faster construction AND better tree quality compared to midpoint split - a win-win outcome. The 6-16% performance advantage likely comes from V8 JIT optimizations and code improvements since the original implementation. Future work could benchmark query performance to quantify R*'s tree quality advantage on read-heavy workloads.