Skip to content

Add cross-filtering to Explorer facet counts#94

Open
rdhyee wants to merge 4 commits intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering
Open

Add cross-filtering to Explorer facet counts#94
rdhyee wants to merge 4 commits intoisamplesorg:mainfrom
rdhyee:feature/cross-filtering

Conversation

@rdhyee
Copy link
Copy Markdown
Contributor

@rdhyee rdhyee commented Apr 9, 2026

Summary

  • When any filter is active, facet counts update to reflect the intersection of all other active filters (standard faceted search behavior)
  • Selecting SESAR as source → material/context/specimen counts show only what exists in SESAR
  • 4 parallel GROUP BY queries via DuckDB-WASM, each excluding its own dimension
  • DOM manipulation updates count labels without re-rendering checkboxes (preserves selections)
  • Zero-count facet values dimmed for visual clarity
  • When no filters active, pre-computed 2KB summaries used (instant, unchanged)

Test plan

  • Load Explorer with no filters — counts should match pre-computed summaries
  • Check SESAR source → material counts should drop (no archaeology materials)
  • Check SESAR + Rock material → context/specimen counts narrow further
  • Clear all filters → counts restore to pre-computed values
  • Verify checkbox selections persist when counts update
  • Zero-count items should appear dimmed

🤖 Generated with Claude Code

@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 9, 2026

Pre-cached cross-filter strategy (from Eric Kansa via Slack)

Note: analysis below generated by Claude Code based on Eric's suggestion and the current Explorer architecture.

Eric suggested pre-caching facet counts for filtered subsets, similar to how Open Context uses Django caching with a cache-warming script. Here's the full analysis for our dataset:

Our facet dimensions

Facet Values States (any + each value)
Source 4 (SESAR, OpenContext, GEOME, Smithsonian) 5
Material ~10 11
Context (Sampled Feature) ~8 9
Specimen Type ~8 9

Combinatorics

Single-value-per-facet (Eric's model): 5 × 11 × 9 × 9 = 4,455 combinations. Each combination stores counts for all ~30 facet values across all dimensions. That's ~130K rows — trivially small as a parquet file, probably under 1 MB.

Multi-value-per-facet (checkboxes allow this): each facet has 2^n subsets. That's 2^4 × 2^10 × 2^8 × 2^8 = 2^30 ≈ 1 billion combinations. Obviously not pre-cacheable.

Practical middle ground: pre-cache the single-value combinations (covers the most common interaction pattern — user clicks one checkbox at a time), and fall back to on-the-fly for multi-value selections. This is exactly Eric's "not in cache → calculate on the fly" pattern.

Comparison with current PR approach

Approach Latency Extra download Maintenance Multi-value support
Pre-cached file (Eric's pattern) Instant (~0ms lookup) ~1 MB parquet Rebuild when data changes Falls back to on-the-fly
On-the-fly GROUP BY (this PR) 1-3s per change None Zero Works for any combination
Hybrid (pre-cache + on-the-fly fallback) Instant for common, 1-3s for complex ~1 MB parquet Rebuild when data changes Full coverage

How we'd build the pre-cache

We already have a pipeline that generates isamples_202601_facet_summaries.parquet (2KB, the unfiltered counts). The pre-cache would be a natural extension:

-- For each combination of single-value filters, compute cross-filtered counts
-- Example: "given source=SESAR, what are the material counts?"
SELECT
  'SESAR' as filter_source,
  NULL as filter_material,
  NULL as filter_context,
  NULL as filter_specimen,
  'material' as facet_type,
  has_material_category as facet_value,
  COUNT(*) as count
FROM samples
WHERE n = 'SESAR'
  AND otype = 'MaterialSampleRecord'
  AND latitude IS NOT NULL
GROUP BY has_material_category

Multiply that pattern across all 4,455 combinations. DuckDB can generate the entire file in under a minute locally.

The resulting file schema:

filter_source       | VARCHAR (nullable — NULL means "any")
filter_material     | VARCHAR (nullable)
filter_context      | VARCHAR (nullable)
filter_specimen     | VARCHAR (nullable)
facet_type          | VARCHAR (source/material/context/object_type)
facet_value         | VARCHAR
count               | BIGINT

In the browser, lookup is a simple filtered read — DuckDB-WASM with HTTP range requests would resolve it in milliseconds since the file is tiny.

Recommendation

The hybrid approach is the clear winner:

  1. Ship the pre-cache file alongside the existing facet summaries on data.isamples.org
  2. Use it for instant single-value lookups (covers 90%+ of user interactions)
  3. Fall back to on-the-fly GROUP BY for multi-value or text search combinations
  4. Regenerate the pre-cache whenever we update the main parquet files

This mirrors how Open Context does it with Django caching, just with parquet files instead of a database cache layer. The "cache warming script" is a DuckDB query that runs offline.

When any filter is active, facet counts now reflect the intersection
of all OTHER active filters. For example, selecting SESAR as source
updates material/context/specimen counts to show only what exists
in SESAR data. Uses parallel GROUP BY queries via DuckDB-WASM.

Counts update via DOM manipulation to avoid resetting checkbox
selections. Zero-count facet values are dimmed for visual clarity.
When no filters are active, pre-computed summaries are used (instant).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 10, 2026

Status check + pre-computed cache

I just generated and uploaded a pre-computed cross-filter cache to complement this PR:

File: https://data.isamples.org/isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows)

Schema:

filter_source       VARCHAR (NULL = any)
filter_material     VARCHAR (NULL = any)
filter_context      VARCHAR (NULL = any)
filter_object_type  VARCHAR (NULL = any)
facet_type          VARCHAR (source/material/context/object_type)
facet_value         VARCHAR (URI)
count               BIGINT

This covers all single-value filter combinations (e.g., "given source=SESAR, what are the material counts?"). Browser-side lookup is instant — just a filtered read on a 6KB file.

Bug found: column name mismatch

The on-the-fly cross-filter queries in this PR reference has_material_category, has_context_category, and has_specimen_category, but the wide parquet has:

  • p__has_material_category (BIGINT[], not VARCHAR)
  • p__has_context_category (BIGINT[], not VARCHAR)
  • p__has_sample_object_type (BIGINT[], not VARCHAR)

These are integer foreign keys in arrays, not URI strings. The facet summaries use URIs. The same mismatch exists in the base WHERE clause builder on main — material/context/specimen filtering silently fails.

Recommendation

  1. Ship the pre-computed cache for instant single-filter lookups (covers 90%+ of interactions, per Eric K's suggestion)
  2. Fix the column mapping as a follow-up — either alias columns in the CREATE VIEW, or use a mapping table to resolve BIGINT IDs → URIs
  3. For multi-filter combinations not in the cache, fall back to on-the-fly queries (once column mapping is fixed)

The cache file is already on R2. Next step is updating the Explorer JS to use it.

- Add 6KB pre-computed cross-filter cache for instant single-filter lookups
- Add 21MB sample_facets view with URI-string columns for on-the-fly fallback
- Fix column name mismatch: wide parquet has p__* BIGINT[] columns, but
  facet values are URI strings — cross-filter now queries sample_facets
- Main whereClause uses pid subquery against sample_facets for facet filters
- Source filter still queries wide parquet directly (n column is correct)

Supplementary files on data.isamples.org:
- isamples_202601_facet_cross_filter.parquet (6 KB, 526 rows)
- isamples_202601_sample_facets_v2.parquet (21 MB, 6M rows)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rdhyee rdhyee force-pushed the feature/cross-filtering branch from f3e3fd3 to 68ec7ee Compare April 10, 2026 15:23
@rdhyee
Copy link
Copy Markdown
Contributor Author

rdhyee commented Apr 10, 2026

Fix pushed

The column name mismatch is now fixed. Here's the architecture:

Three tiers of facet data

Tier File Size When used
1. Baseline facet_summaries.parquet 2 KB Page load — unfiltered counts
2. Pre-computed cache facet_cross_filter.parquet 6 KB Single-filter lookups — instant
3. On-the-fly fallback sample_facets_v2.parquet 21 MB Multi-filter or text search combos

What changed

  • Cross-filter queries now use sample_facets view (URI strings in source, material, context, object_type columns) instead of the wide parquet (which has p__has_material_category as BIGINT arrays)
  • Single-filter interactions (90%+ of use) hit the 6 KB pre-computed cache — instant response, no scanning
  • Multi-filter or text search falls back to GROUP BY queries against the 21 MB sample_facets — much faster than scanning the 280 MB wide parquet
  • Record retrieval (whereClause) uses a pid subquery against sample_facets for material/context/object_type filters, source filter still hits the wide parquet's n column directly

All supplementary files are live on data.isamples.org.

rdhyee and others added 2 commits April 10, 2026 08:31
1. Multi-value within single facet: fast path now requires exactly
   one value in the active facet, not just one active dimension.
   Multiple selections (e.g., SESAR+GEOME) correctly fall through
   to on-the-fly queries.

2. Text search participates in cross-filtering: buildCrossFilterWhere
   now includes ILIKE conditions. sample_facets_v2 regenerated with
   label, description, place_name columns (63 MB on R2).

3. Clearing filters restores baseline counts: the update cell now
   resets all facet-count labels to baseline values and removes
   zero-count dimming when crossFilteredFacets is null.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codex review found two bugs:

1. facet_summaries counted all 6.68M records but sample_facets only
   had the 5.98M with coordinates — counts jumped when toggling filters.
   Regenerated all three parquet files from the same base universe
   (lat IS NOT NULL). SESAR now consistently 4,389,231 across all files.

2. Baseline summaries included blank-string facet values, but on-the-fly
   queries excluded them with != ''. Regenerated summaries now exclude
   blanks, matching the on-the-fly behavior.

Also: removed dead getDisplayCounts(), fixed stale 0.3MB comment,
added missing quote escaping on source cache lookup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant