Add cross-filtering to Explorer facet counts#94
Add cross-filtering to Explorer facet counts#94rdhyee wants to merge 4 commits intoisamplesorg:mainfrom
Conversation
Pre-cached cross-filter strategy (from Eric Kansa via Slack)Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture. Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset: Our facet dimensions
CombinatoricsSingle-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB. Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable. Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern. Comparison with current PR approach
How we'd build the pre-cacheWe already have a pipeline that generates -- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
'SESAR' as filter_source,
NULL as filter_material,
NULL as filter_context,
NULL as filter_specimen,
'material' as facet_type,
has_material_category as facet_value,
COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
AND otype = 'MaterialSampleRecord'
AND latitude IS NOT NULL
GROUP BY has_material_categoryMultiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally. The resulting file schema: In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny. RecommendationThe hybrid approach is the clear winner:
This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline. |
When any filter is active, facet counts now reflect the intersection of all OTHER active filters. For example, selecting SESAR as source updates material/context/specimen counts to show only what exists in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM. Counts update via DOM manipulation to avoid resetting checkbox selections. Zero-count facet values are dimmed for visual clarity. When no filters are active, pre-computed summaries are used (instant). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Status check + pre-computed cacheI just generated and uploaded a pre-computed cross-filter cache to complement this PR: File: Schema: This covers all single-value filter combinations (e.g., "given source=SESAR, what are the material counts?"). Browser-side lookup is instant — just a filtered read on a 6KB file. Bug found: column name mismatchThe on-the-fly cross-filter queries in this PR reference
These are integer foreign keys in arrays, not URI strings. The facet summaries use URIs. The same mismatch exists in the base WHERE clause builder on main — material/context/specimen filtering silently fails. Recommendation
The cache file is already on R2. Next step is updating the Explorer JS to use it. |
- Add 6KB pre-computed cross-filter cache for instant single-filter lookups - Add 21MB sample_facets view with URI-string columns for on-the-fly fallback - Fix column name mismatch: wide parquet has p__* BIGINT[] columns, but facet values are URI strings — cross-filter now queries sample_facets - Main whereClause uses pid subquery against sample_facets for facet filters - Source filter still queries wide parquet directly (n column is correct) Supplementary files on data.isamples.org: - isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows) - isamples_202601_sample_facets_v2.parquet (21 MB, 6M rows) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f3e3fd3 to
68ec7ee
Compare
Fix pushedThe column name mismatch is now fixed. Here's the architecture: Three tiers of facet data
What changed
All supplementary files are live on |
1. Multi-value within single facet: fast path now requires exactly one value in the active facet, not just one active dimension. Multiple selections (e.g., SESAR+GEOME) correctly fall through to on-the-fly queries. 2. Text search participates in cross-filtering: buildCrossFilterWhere now includes ILIKE conditions. sample_facets_v2 regenerated with label, description, place_name columns (63 MB on R2). 3. Clearing filters restores baseline counts: the update cell now resets all facet-count labels to baseline values and removes zero-count dimming when crossFilteredFacets is null. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codex review found two bugs: 1. facet_summaries counted all 6.68M records but sample_facets only had the 5.98M with coordinates — counts jumped when toggling filters. Regenerated all three parquet files from the same base universe (lat IS NOT NULL). SESAR now consistently 4,389,231 across all files. 2. Baseline summaries included blank-string facet values, but on-the-fly queries excluded them with != ''. Regenerated summaries now exclude blanks, matching the on-the-fly behavior. Also: removed dead getDisplayCounts(), fixed stale 0.3MB comment, added missing quote escaping on source cache lookup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Test plan
🤖 Generated with Claude Code