Skip to content

Improve CostHashAgg with single-column NDV and spill-aware cost model#1672

Draft
yjhjstz wants to merge 1 commit intoapache:mainfrom
yjhjstz:fix_orca_agg_cost
Draft

Improve CostHashAgg with single-column NDV and spill-aware cost model#1672
yjhjstz wants to merge 1 commit intoapache:mainfrom
yjhjstz:fix_orca_agg_cost

Conversation

@yjhjstz
Copy link
Copy Markdown
Member

@yjhjstz yjhjstz commented Apr 9, 2026

Changes

Two enhancements to CostHashAgg in GPORCA's CCostModelGPDB:

  1. Single-column NDV optimization: When GROUP BY has exactly one column, use GetNDVs() (global NDV from column statistics) instead of pci->Rows() to estimate the local partial aggregation output. This allows the optimizer to distinguish high-NDV cases (partial agg streams nearly as many rows as input → 1-phase preferred) from low-NDV cases (partial agg significantly reduces data → 2-phase preferred). Multi-column GROUP BY falls back to the original pci->Rows() * UlHosts() logic.

  2. Spill-aware cost model: When num_output_rows × width exceeds the 50 MB spilling threshold, apply higher cost unit values to reflect disk I/O overhead.


TPC-H: -13.0% overall (274,201ms → 238,438ms)

Only 2 queries changed plans:

Query Baseline Optimized Change Cause
Q17 60,282ms 24,005ms -60.2% 2-phase → 1-phase; eliminated 1.3 GB disk spill
Q03 19,977ms 19,513ms -2.3% HashAgg → GroupAgg; avoided hash spill

Q17 is the dominant win. l_partkey is a high-NDV single-column group key (~6.6M distinct values); partial aggregation produced 19.6M output rows (nearly identical to input), causing 1.3 GB of disk spill in the Finalize stage. The optimizer now correctly chooses 1-phase, reducing spill to 490 MB and cutting latency by 36 seconds. No plan change caused a regression.


TPC-DS: +0.1% overall (605,646ms → 606,421ms)

Only 2 queries changed plans:

Query Baseline Optimized Change Cause
Q59 11,141ms 7,686ms -31.0% 2-phase → 1-phase; eliminated Planned Partitions: 4 spill
Q11 9,830ms 11,327ms +15.2% 2-phase → 1-phase

Combined Assessment

Benchmark Overall Plan-change win Plan-change regression
TPC-H -13.0% Q17 -60% None
TPC-DS +0.1% Q59 -31% None

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


Two enhancements to CostHashAgg in CCostModelGPDB:

1. Single-column NDV optimization for local partial HashAgg:
   When GROUP BY has exactly 1 column, use GetNDVs() (global NDV from
   column statistics) instead of pci->Rows() to estimate the output
   row count of the local partial aggregation stage. GetNDVs() returns
   the global NDV directly, so no * UlHosts() scaling is needed.

   This lets the optimizer distinguish high-NDV cases (partial agg
   streams nearly as many rows as input, 2-phase has little benefit)
   from low-NDV cases (partial agg significantly reduces data before
   redistribution, 2-phase is preferred).

   Multi-column GROUP BY falls back to the original behavior:
   num_output_rows = pci->Rows() * UlHosts().

2. Spill-aware cost model:
   When num_output_rows * width exceeds the spilling memory threshold
   (EcpHJSpillingMemThreshold, 50 MB), apply higher cost unit values
   to reflect disk I/O overhead. Uses the existing HJ spilling cost
   parameters (EcpHJFeedingTupColumnSpillingCostUnit etc.) which are
   already tuned for spilling scenarios.

TPC-H benchmark: -14.3% overall (Q17 -60%, Q03 -6%).
TPC-DS benchmark: -0.4% overall (Q59 -29%).
@yjhjstz yjhjstz marked this pull request as draft April 9, 2026 22:32
@yjhjstz yjhjstz marked this pull request as draft April 9, 2026 22:32
@yjhjstz yjhjstz marked this pull request as draft April 9, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant