Sparse Data Analysis: Why Linear Scan Wins
Historical Note: This analysis documents the original Hilbert-based experiment that established the sparse data optimization strategy. Morton curves have since replaced Hilbert (25% faster), and current performance exceeds the numbers cited here. Core findings remain valid: linear scan dominates at n<100, R-trees win at n>1000. For current measurements, see BENCHMARKS.md.
Finding: For sparse data (n < 100), linear scan O(n) outperforms R-tree O(log n).
Result
| Data Size | Winner | Performance | Reason |
|---|---|---|---|
| n < 100 | Linear scan | 6.9µs vs 20.0µs (2.9x) | Cache locality, no tree overhead |
| n > 1000 | R-tree | 2.2ms vs 26ms (12x) | Log traversal beats linear |
| 100-1000 | Workload-dependent | See benchmarks | Transition zone |
Performance data (from BENCHMARKS.md, sparse-sequential write-heavy):
- n=50: HilbertLinearScan 6.9µs vs R*-tree 20.0µs (Hilbert 2.9x faster)
- n=2500: R*-tree 2.2ms vs HilbertLinearScan 26ms (R*-tree 12x faster)
Impact: Changed recommendation from "always use R-tree" to "linear scan for sparse, R-tree for large".
Latest: See morton-vs-hilbert-analysis.md - Morton curves provide 25% additional speedup over Hilbert due to simpler encoding.
Problem Context
Scenario: Spreadsheet API with per-property spatial indices
class Sheet {
backgrounds = new SpatialIndex<string>(); // Which implementation?
fontColors = new SpatialIndex<string>();
borders = new SpatialIndex<BorderStyle>();
// ... 10-20 properties total
}
Each property needs its own spatial index. Should I use flat arrays (O(n)) or R-trees (O(log n))?
Sparsity Analysis
Typical spreadsheet data distribution:
For a 1000-row × 50-column sheet (50,000 cells):
| Property | Cells w/ Custom Value | Spatial Ranges | Density |
|---|---|---|---|
| Background colors | ~200 cells | ~50 ranges | 0.4% |
| Font colors | ~150 cells | ~30 ranges | 0.3% |
| Borders | ~100 cells | ~20 ranges | 0.2% |
| Font sizes | ~50 cells | ~10 ranges | 0.1% |
| Data validation | ~25 cells | ~5 ranges | 0.05% |
| Conditional formats | ~75 cells | ~15 ranges | 0.15% |
Observation: Each spatial index contains n < 100 ranges typically.
The classic performance analysis assumes dense data in a single index. This use case has:
- Many indices (10-20 properties)
- Sparse data per index (n < 100)
- Rare queries (mostly writes, occasional serialize)
Performance Comparison
Linear Scan (Flat Array Storage)
query(bounds) {
return this.items.filter(item => intersects(item.bounds, bounds));
}
Complexity: O(n) where n = rectangle count
Reality: n < 100 → ~400 comparisons → ~0.01ms
Advantages:
- ✅ Cache-friendly (contiguous memory)
- ✅ No pointer chasing
- ✅ V8 optimizes array iteration heavily
- ✅ Simple, predictable performance
R-Tree (Hierarchical)
query(bounds) {
return this.root.search(bounds); // Tree traversal
}
Complexity: O(log n) where n = rectangle count
Reality: log₁₀(100) = 2 levels, but tree overhead dominates
Overhead:
- ❌ Bounding box checks at each level
- ❌ Pointer chasing (cache misses)
- ❌ Tree management overhead
- ❌ Only beneficial when n > 1000
Empirical Results
See BENCHMARKS.md for detailed measurements. Summary:
- Large datasets (n > 1000): R-tree wins dramatically (up to 28x faster at n ≈ 2500)
- Sparse data (n < 100): Flat storage wins due to lower overhead and cache locality
R-tree's advantage comes from O(log n) tree traversal. Flat implementations do O(n) linear scans on every insert. But for n < 100, constant factors dominate - spatial locality and V8 optimizations make flat storage faster despite worse Big-O.
API Usage Patterns
Write-Heavy Pattern (Dominant)
// User building a spreadsheet
sheet.getRange('A1:D10').setBackground('red');
sheet.getRange('C5:F8').setFontColor('blue');
sheet.getRange('B2:B20').setBorder('thick');
// ... many more writes
Operation: insert() - O(n) for both flat and hierarchical
Frequency: Thousands of writes per session
Serialize Pattern (Occasional)
// Export to JSON or sync to backend
const serialized = sheet.toJSON();
Operation: query() (all ranges) - O(n) for both approaches
Frequency: Once per save/export
Query Pattern (Rare)
// Get background colors for specific region
const range = sheet.getRange('C5:F8');
const backgrounds = range.getBackgrounds(); // Uses query()
Operation: query(bounds) - O(n) flat vs O(log n) hierarchical
Frequency: Rare (most operations are writes)
Reality: O(n) with n=50 is still < 0.01ms
The Math
Break-Even Point Analysis
When does R-tree become faster than flat array?
Naive theoretical model (per-query cost only):
- AABB comparison: 10ns
- Tree node traversal: 50ns (pointer + bbox + cache miss)
- Tree overhead: 100ns per query
Flat array: n × 10ns = 50 × 10ns = 500ns (for n=50)
R-tree: log₂(n) × 50ns + 100ns = log₂(50) × 50ns + 100ns ≈ 380ns
Theoretical prediction: R-tree wins at n ≈ 50
Empirical reality (from BENCHMARKS.md, write-heavy workload):
- Flat array (Hilbert): 6.9µs = 6,900ns
- R-tree: 20.0µs = 20,000ns
Why 40x slower than theory?
-
Construction cost dominates: Theory assumes tree already built! Actual cost includes:
- Node allocation: ~100ns per node
- Bounding box updates: ~50ns per level
- Split operations: ~1000ns when node overflows
- For n=50 inserts: ~50 × (allocation + bbox + occasional split) ≈ 15,000ns
-
Cache effects: Flat array fits entirely in L1 cache (64 bytes × 50 = 3.2KB < 32KB L1)
- L1 hit: ~1ns (measured 10ns with V8 overhead)
- L3 miss: ~100ns (pointer chasing in tree)
-
V8 optimizations: Array iteration has dedicated fast-path in JIT compiler
Corrected model (total cost including construction):
- Flat array: 50 × 200ns (AABB + insertion) = 10,000ns ✓ matches empirical
- R-tree: 50 × 400ns (tree overhead + splits) = 20,000ns ✓ matches empirical
Real break-even: n ≥ 100 for write-heavy, n ≥ 1000 for read-heavy (measured empirically)
Decision Matrix
| Scenario | n (ranges) | Best Choice | Reasoning |
|---|---|---|---|
| Single cell formatting | < 50 | RectangleStore | Trivial data, flat is fastest |
| Typical spreadsheet use | 50-200 | RectangleStore | Sparse data, low overhead wins |
| Heavy conditional formats | 200-1000 | RectangleStore | Still sparse, tree overhead not justified |
| Massive formatted regions | > 1000 | R-tree (maybe) | Hierarchical pruning starts to help |
| GIS/mapping application | > 10,000 | R-tree | Dense spatial data, query-heavy |
Recommendations
For Spreadsheet Property Storage
Use linear scan (flat array storage):
- Sparse data per index (n < 100 typically)
- Write-heavy workload (insert dominates)
- Simple, predictable performance
- Lower memory overhead (20 indices × flat array < 20 indices × tree)
When to Consider R-Trees
Use hierarchical indexing when:
- Dense data (n > 5000 ranges per index)
- Query-heavy workload (frequent spatial lookups)
- Large query regions (where pruning matters)
- Single large index (not 20 sparse ones)
Conclusion
For the spreadsheet property storage use case:
- Architecture: Separate spatial index per property (backgrounds, borders, etc.)
- Data distribution: Sparse (< 100 ranges per property typically)
- Access pattern: Write-heavy (insert), occasional serialize (query all), rare queries
- Optimal choice: Linear scan implementations with O(n) operations
The R-tree implementation serves as an academic reference and would be appropriate for:
- Dense spatial data (GIS, mapping)
- Query-heavy workloads
- Large single indices (> 5000 ranges)
Measured result: For n < 100, HilbertLinearScan is 2.9x faster than R-tree (6.9µs vs 20.0µs) due to lower overhead, spatial ordering, and V8 optimizations.
Evolution: Initial research used OptimizedLinearScan (11.6µs, 1.7x faster). Hilbert curve ordering improved this to 6.9µs (2.9x faster), validating that spatial locality optimizations matter even for small datasets.
References
- Empirical measurements: BENCHMARKS.md for detailed performance data
- Guttman, A. (1984). "R-trees: A Dynamic Index Structure For Spatial Searching". Shows R-trees excel for large, query-heavy datasets.
- V8 optimization docs: Array iteration is heavily optimized in modern JavaScript engines.