Sparse Data Analysis: Why Linear Scan Wins

Historical Note: This analysis documents the original Hilbert-based experiment that established the sparse data optimization strategy. Morton curves have since replaced Hilbert (25% faster), and current performance exceeds the numbers cited here. Core findings remain valid: linear scan dominates at n<100, R-trees win at n>1000. For current measurements, see BENCHMARKS.md.

Finding: For sparse data (n < 100), linear scan O(n) outperforms R-tree O(log n).

Result

Data Size Winner Performance Reason
n < 100 Linear scan 6.9µs vs 20.0µs (2.9x) Cache locality, no tree overhead
n > 1000 R-tree 2.2ms vs 26ms (12x) Log traversal beats linear
100-1000 Workload-dependent See benchmarks Transition zone

Performance data (from BENCHMARKS.md, sparse-sequential write-heavy):

Impact: Changed recommendation from "always use R-tree" to "linear scan for sparse, R-tree for large".

Latest: See morton-vs-hilbert-analysis.md - Morton curves provide 25% additional speedup over Hilbert due to simpler encoding.


Problem Context

Scenario: Spreadsheet API with per-property spatial indices

class Sheet {
	backgrounds = new SpatialIndex<string>(); // Which implementation?
	fontColors = new SpatialIndex<string>();
	borders = new SpatialIndex<BorderStyle>();
	// ... 10-20 properties total
}

Each property needs its own spatial index. Should I use flat arrays (O(n)) or R-trees (O(log n))?

Sparsity Analysis

Typical spreadsheet data distribution:

For a 1000-row × 50-column sheet (50,000 cells):

Property Cells w/ Custom Value Spatial Ranges Density
Background colors ~200 cells ~50 ranges 0.4%
Font colors ~150 cells ~30 ranges 0.3%
Borders ~100 cells ~20 ranges 0.2%
Font sizes ~50 cells ~10 ranges 0.1%
Data validation ~25 cells ~5 ranges 0.05%
Conditional formats ~75 cells ~15 ranges 0.15%

Observation: Each spatial index contains n < 100 ranges typically.

The classic performance analysis assumes dense data in a single index. This use case has:

  1. Many indices (10-20 properties)
  2. Sparse data per index (n < 100)
  3. Rare queries (mostly writes, occasional serialize)

Performance Comparison

Linear Scan (Flat Array Storage)

query(bounds) {
    return this.items.filter(item => intersects(item.bounds, bounds));
}

Complexity: O(n) where n = rectangle count
Reality: n < 100 → ~400 comparisons → ~0.01ms

Advantages:

R-Tree (Hierarchical)

query(bounds) {
    return this.root.search(bounds); // Tree traversal
}

Complexity: O(log n) where n = rectangle count
Reality: log₁₀(100) = 2 levels, but tree overhead dominates

Overhead:

Empirical Results

See BENCHMARKS.md for detailed measurements. Summary:

R-tree's advantage comes from O(log n) tree traversal. Flat implementations do O(n) linear scans on every insert. But for n < 100, constant factors dominate - spatial locality and V8 optimizations make flat storage faster despite worse Big-O.

API Usage Patterns

Write-Heavy Pattern (Dominant)

// User building a spreadsheet
sheet.getRange('A1:D10').setBackground('red');
sheet.getRange('C5:F8').setFontColor('blue');
sheet.getRange('B2:B20').setBorder('thick');
// ... many more writes

Operation: insert() - O(n) for both flat and hierarchical
Frequency: Thousands of writes per session

Serialize Pattern (Occasional)

// Export to JSON or sync to backend
const serialized = sheet.toJSON();

Operation: query() (all ranges) - O(n) for both approaches
Frequency: Once per save/export

Query Pattern (Rare)

// Get background colors for specific region
const range = sheet.getRange('C5:F8');
const backgrounds = range.getBackgrounds(); // Uses query()

Operation: query(bounds) - O(n) flat vs O(log n) hierarchical
Frequency: Rare (most operations are writes)
Reality: O(n) with n=50 is still < 0.01ms

The Math

Break-Even Point Analysis

When does R-tree become faster than flat array?

Naive theoretical model (per-query cost only):

Flat array: n × 10ns = 50 × 10ns = 500ns (for n=50)
R-tree: log₂(n) × 50ns + 100ns = log₂(50) × 50ns + 100ns ≈ 380ns

Theoretical prediction: R-tree wins at n ≈ 50

Empirical reality (from BENCHMARKS.md, write-heavy workload):

Why 40x slower than theory?

  1. Construction cost dominates: Theory assumes tree already built! Actual cost includes:

    • Node allocation: ~100ns per node
    • Bounding box updates: ~50ns per level
    • Split operations: ~1000ns when node overflows
    • For n=50 inserts: ~50 × (allocation + bbox + occasional split) ≈ 15,000ns
  2. Cache effects: Flat array fits entirely in L1 cache (64 bytes × 50 = 3.2KB < 32KB L1)

    • L1 hit: ~1ns (measured 10ns with V8 overhead)
    • L3 miss: ~100ns (pointer chasing in tree)
  3. V8 optimizations: Array iteration has dedicated fast-path in JIT compiler

Corrected model (total cost including construction):

Real break-even: n ≥ 100 for write-heavy, n ≥ 1000 for read-heavy (measured empirically)

Decision Matrix

Scenario n (ranges) Best Choice Reasoning
Single cell formatting < 50 RectangleStore Trivial data, flat is fastest
Typical spreadsheet use 50-200 RectangleStore Sparse data, low overhead wins
Heavy conditional formats 200-1000 RectangleStore Still sparse, tree overhead not justified
Massive formatted regions > 1000 R-tree (maybe) Hierarchical pruning starts to help
GIS/mapping application > 10,000 R-tree Dense spatial data, query-heavy

Recommendations

For Spreadsheet Property Storage

Use linear scan (flat array storage):

  1. Sparse data per index (n < 100 typically)
  2. Write-heavy workload (insert dominates)
  3. Simple, predictable performance
  4. Lower memory overhead (20 indices × flat array < 20 indices × tree)

When to Consider R-Trees

Use hierarchical indexing when:

  1. Dense data (n > 5000 ranges per index)
  2. Query-heavy workload (frequent spatial lookups)
  3. Large query regions (where pruning matters)
  4. Single large index (not 20 sparse ones)

Conclusion

For the spreadsheet property storage use case:

The R-tree implementation serves as an academic reference and would be appropriate for:

Measured result: For n < 100, HilbertLinearScan is 2.9x faster than R-tree (6.9µs vs 20.0µs) due to lower overhead, spatial ordering, and V8 optimizations.

Evolution: Initial research used OptimizedLinearScan (11.6µs, 1.7x faster). Hilbert curve ordering improved this to 6.9µs (2.9x faster), validating that spatial locality optimizations matter even for small datasets.


References

  1. Empirical measurements: BENCHMARKS.md for detailed performance data
  2. Guttman, A. (1984). "R-trees: A Dynamic Index Structure For Spatial Searching". Shows R-trees excel for large, query-heavy datasets.
  3. V8 optimization docs: Array iteration is heavily optimized in modern JavaScript engines.