Sparse Data Analysis: Why Linear Scan Wins

Historical Note: This analysis documents the original Hilbert-based experiment that established the sparse data optimization strategy. Morton curves have since replaced Hilbert (25% faster), and current performance exceeds the numbers cited here. Core findings remain valid: linear scan dominates at n<100, R-trees win at n>1000. For current measurements, see BENCHMARKS.md.

Finding: For sparse data (n < 100), linear scan O(n) outperforms R-tree O(log n).

Result

Data Size	Winner	Performance	Reason
n < 100	Linear scan	6.9µs vs 20.0µs (2.9x)	Cache locality, no tree overhead
n > 1000	R-tree	2.2ms vs 26ms (12x)	Log traversal beats linear
100-1000	Workload-dependent	See benchmarks	Transition zone

Performance data (from BENCHMARKS.md, sparse-sequential write-heavy):

n=50: HilbertLinearScan 6.9µs vs R*-tree 20.0µs (Hilbert 2.9x faster)
n=2500: R*-tree 2.2ms vs HilbertLinearScan 26ms (R*-tree 12x faster)

Impact: Changed recommendation from "always use R-tree" to "linear scan for sparse, R-tree for large".

Latest: See morton-vs-hilbert-analysis.md - Morton curves provide 25% additional speedup over Hilbert due to simpler encoding.

Problem Context

Scenario: Spreadsheet API with per-property spatial indices

class Sheet {
	backgrounds = new SpatialIndex<string>(); // Which implementation?
	fontColors = new SpatialIndex<string>();
	borders = new SpatialIndex<BorderStyle>();
	// ... 10-20 properties total
}

Each property needs its own spatial index. Should I use flat arrays (O(n)) or R-trees (O(log n))?

Sparsity Analysis

Typical spreadsheet data distribution:

For a 1000-row × 50-column sheet (50,000 cells):

Property	Cells w/ Custom Value	Spatial Ranges	Density
Background colors	~200 cells	~50 ranges	0.4%
Font colors	~150 cells	~30 ranges	0.3%
Borders	~100 cells	~20 ranges	0.2%
Font sizes	~50 cells	~10 ranges	0.1%
Data validation	~25 cells	~5 ranges	0.05%
Conditional formats	~75 cells	~15 ranges	0.15%

Observation: Each spatial index contains n < 100 ranges typically.

The classic performance analysis assumes dense data in a single index. This use case has:

Many indices (10-20 properties)
Sparse data per index (n < 100)
Rare queries (mostly writes, occasional serialize)

Performance Comparison

Linear Scan (Flat Array Storage)

query(bounds) {
    return this.items.filter(item => intersects(item.bounds, bounds));
}

Complexity: O(n) where n = rectangle count
Reality: n < 100 → ~400 comparisons → ~0.01ms

Advantages:

✅ Cache-friendly (contiguous memory)
✅ No pointer chasing
✅ V8 optimizes array iteration heavily
✅ Simple, predictable performance

R-Tree (Hierarchical)

query(bounds) {
    return this.root.search(bounds); // Tree traversal
}

Complexity: O(log n) where n = rectangle count
Reality: log₁₀(100) = 2 levels, but tree overhead dominates

Overhead:

❌ Bounding box checks at each level
❌ Pointer chasing (cache misses)
❌ Tree management overhead
❌ Only beneficial when n > 1000

Empirical Results

See BENCHMARKS.md for detailed measurements. Summary:

Large datasets (n > 1000): R-tree wins dramatically (up to 28x faster at n ≈ 2500)
Sparse data (n < 100): Flat storage wins due to lower overhead and cache locality

R-tree's advantage comes from O(log n) tree traversal. Flat implementations do O(n) linear scans on every insert. But for n < 100, constant factors dominate - spatial locality and V8 optimizations make flat storage faster despite worse Big-O.

API Usage Patterns

Write-Heavy Pattern (Dominant)

// User building a spreadsheet
sheet.getRange('A1:D10').setBackground('red');
sheet.getRange('C5:F8').setFontColor('blue');
sheet.getRange('B2:B20').setBorder('thick');
// ... many more writes

Operation: insert() - O(n) for both flat and hierarchical
Frequency: Thousands of writes per session

Serialize Pattern (Occasional)

// Export to JSON or sync to backend
const serialized = sheet.toJSON();

Operation: query() (all ranges) - O(n) for both approaches
Frequency: Once per save/export

Query Pattern (Rare)

// Get background colors for specific region
const range = sheet.getRange('C5:F8');
const backgrounds = range.getBackgrounds(); // Uses query()

Operation: query(bounds) - O(n) flat vs O(log n) hierarchical
Frequency: Rare (most operations are writes)
Reality: O(n) with n=50 is still < 0.01ms

The Math

Break-Even Point Analysis

When does R-tree become faster than flat array?

Naive theoretical model (per-query cost only):

AABB comparison: 10ns
Tree node traversal: 50ns (pointer + bbox + cache miss)
Tree overhead: 100ns per query

Flat array: n × 10ns = 50 × 10ns = 500ns (for n=50)
R-tree: log₂(n) × 50ns + 100ns = log₂(50) × 50ns + 100ns ≈ 380ns

Theoretical prediction: R-tree wins at n ≈ 50

Empirical reality (from BENCHMARKS.md, write-heavy workload):

Flat array (Hilbert): 6.9µs = 6,900ns
R-tree: 20.0µs = 20,000ns

Why 40x slower than theory?

Construction cost dominates: Theory assumes tree already built! Actual cost includes:
- Node allocation: ~100ns per node
- Bounding box updates: ~50ns per level
- Split operations: ~1000ns when node overflows
- For n=50 inserts: ~50 × (allocation + bbox + occasional split) ≈ 15,000ns
Cache effects: Flat array fits entirely in L1 cache (64 bytes × 50 = 3.2KB < 32KB L1)
- L1 hit: ~1ns (measured 10ns with V8 overhead)
- L3 miss: ~100ns (pointer chasing in tree)
V8 optimizations: Array iteration has dedicated fast-path in JIT compiler

Corrected model (total cost including construction):

Flat array: 50 × 200ns (AABB + insertion) = 10,000ns ✓ matches empirical
R-tree: 50 × 400ns (tree overhead + splits) = 20,000ns ✓ matches empirical

Real break-even: n ≥ 100 for write-heavy, n ≥ 1000 for read-heavy (measured empirically)

Decision Matrix

Scenario	n (ranges)	Best Choice	Reasoning
Single cell formatting	< 50	RectangleStore	Trivial data, flat is fastest
Typical spreadsheet use	50-200	RectangleStore	Sparse data, low overhead wins
Heavy conditional formats	200-1000	RectangleStore	Still sparse, tree overhead not justified
Massive formatted regions	> 1000	R-tree (maybe)	Hierarchical pruning starts to help
GIS/mapping application	> 10,000	R-tree	Dense spatial data, query-heavy

Recommendations

For Spreadsheet Property Storage

Use linear scan (flat array storage):

Sparse data per index (n < 100 typically)
Write-heavy workload (insert dominates)
Simple, predictable performance
Lower memory overhead (20 indices × flat array < 20 indices × tree)

When to Consider R-Trees

Use hierarchical indexing when:

Dense data (n > 5000 ranges per index)
Query-heavy workload (frequent spatial lookups)
Large query regions (where pruning matters)
Single large index (not 20 sparse ones)

Conclusion

For the spreadsheet property storage use case:

Architecture: Separate spatial index per property (backgrounds, borders, etc.)
Data distribution: Sparse (< 100 ranges per property typically)
Access pattern: Write-heavy (insert), occasional serialize (query all), rare queries
Optimal choice: Linear scan implementations with O(n) operations

The R-tree implementation serves as an academic reference and would be appropriate for:

Dense spatial data (GIS, mapping)
Query-heavy workloads
Large single indices (> 5000 ranges)

Measured result: For n < 100, HilbertLinearScan is 2.9x faster than R-tree (6.9µs vs 20.0µs) due to lower overhead, spatial ordering, and V8 optimizations.

Evolution: Initial research used OptimizedLinearScan (11.6µs, 1.7x faster). Hilbert curve ordering improved this to 6.9µs (2.9x faster), validating that spatial locality optimizations matter even for small datasets.

References

Empirical measurements: BENCHMARKS.md for detailed performance data
Guttman, A. (1984). "R-trees: A Dynamic Index Structure For Spatial Searching". Shows R-trees excel for large, query-heavy datasets.
V8 optimization docs: Array iteration is heavily optimized in modern JavaScript engines.