Skip to content

Add histogram bucket aggregation support#5

Open
varun-st wants to merge 1 commit into
datafusion-dsl-metric-aggfrom
pr/histogram-bucket
Open

Add histogram bucket aggregation support#5
varun-st wants to merge 1 commit into
datafusion-dsl-metric-aggfrom
pr/histogram-bucket

Conversation

@varun-st

Copy link
Copy Markdown
Owner

Description

Adds support for histogram bucket aggregations in the DSL Calcite plugin.

Changes

Architecture:
• Refactors GroupingInfo into two interfaces:
• FieldGrouping: For field-based aggregations (terms, multi_terms)
• ExpressionGrouping: For computed aggregations (histogram)
• Histogram expressions are computed via LogicalProject before LogicalAggregate

Implementation:
• HistogramGrouping: Builds expression FLOOR((field - offset) / interval) * interval + offset
• HistogramBucketShape: Converts to InternalHistogram results
• SimpleFieldGrouping: Replaces direct GroupingInfo usage for field-based aggregations
• CollationResolver: Maps _key to projected column names for sorting

Testing:
• 4 unit tests for HistogramGrouping
• 13 unit tests for HistogramBucketShape
• 3 integration tests in DslLogicalPlanIntegrationIT

Implements histogram aggregation translation using expression-based grouping.
Splits GroupingInfo into FieldGrouping and ExpressionGrouping interfaces.
Histogram expressions are computed via LogicalProject before aggregation.
Comment on lines +62 to +69
RexNode fieldRef = builder.makeInputRef(field.getType(), field.getIndex());
RexNode intervalLiteral = builder.makeExactLiteral(BigDecimal.valueOf(interval));
RexNode offsetLiteral = builder.makeExactLiteral(BigDecimal.valueOf(offset));

RexNode adjusted = builder.makeCall(SqlStdOperatorTable.MINUS, fieldRef, offsetLiteral);
RexNode divided = builder.makeCall(SqlStdOperatorTable.DIVIDE, adjusted, intervalLiteral);
RexNode floored = builder.makeCall(SqlStdOperatorTable.FLOOR, divided);
RexNode multiplied = builder.makeCall(SqlStdOperatorTable.MULTIPLY, floored, intervalLiteral);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a comment on what this formula is for easy understanding

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to update AggregationTreeWalker as well where it uses getFieldNames to generate granularity key [Link].

When we have a query where same field is used for histogram but different interval, then this clash could happen right?

{
  "aggs": {
    "coarse": { "histogram": { "field": "price", "interval": 1000 } },
    "fine":   { "histogram": { "field": "price", "interval": 100 } }
  }
}


@Override
public String getProjectedColumnName() {
return fieldName + PROJECTED_COLUMN_SUFFIX;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same clash issue can happen here since the projected column suffix remains same if same field name is used multiple times

Comment on lines +58 to +60
if (field == null) {
throw ConversionException.invalidField(fieldName);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we handle null as well, since it is expected that some fields might be null? Not sure if throwing Exception here would be correct behavior

Comment on lines +69 to +72
// InternalHistogram requires EmptyBucketInfo when minDocCount is 0, but DSL doesn't support
// extended bounds. Use minDocCount of 1 to avoid this requirement. Actual filtering is done
// in AggregationResponseBuilder before calling this method.
long minDocCount = agg.minDocCount() == 0 ? 1 : agg.minDocCount();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this approach, can't we identify the min and max buckets from the response and fill in the gaps with empty buckets before generating the response?

// - Response contains InternalValueCount with correct value (count=3)
}

public void testHistogramAggregation() throws Exception {

@nssuresh2007 nssuresh2007 Mar 29, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add the following nested aggregation scenarios as well:

histogram -> metric (histogram with sub-agg)
terms -> histogram (histogram nested under terms)
histogram -> terms (terms nested under histogram)
histogram -> histogram (nested histograms on different fields)
sibling histograms on the same field with different intervals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants