Add histogram bucket aggregation support#5
Conversation
Implements histogram aggregation translation using expression-based grouping. Splits GroupingInfo into FieldGrouping and ExpressionGrouping interfaces. Histogram expressions are computed via LogicalProject before aggregation.
| RexNode fieldRef = builder.makeInputRef(field.getType(), field.getIndex()); | ||
| RexNode intervalLiteral = builder.makeExactLiteral(BigDecimal.valueOf(interval)); | ||
| RexNode offsetLiteral = builder.makeExactLiteral(BigDecimal.valueOf(offset)); | ||
|
|
||
| RexNode adjusted = builder.makeCall(SqlStdOperatorTable.MINUS, fieldRef, offsetLiteral); | ||
| RexNode divided = builder.makeCall(SqlStdOperatorTable.DIVIDE, adjusted, intervalLiteral); | ||
| RexNode floored = builder.makeCall(SqlStdOperatorTable.FLOOR, divided); | ||
| RexNode multiplied = builder.makeCall(SqlStdOperatorTable.MULTIPLY, floored, intervalLiteral); |
There was a problem hiding this comment.
nit: Please add a comment on what this formula is for easy understanding
There was a problem hiding this comment.
We may need to update AggregationTreeWalker as well where it uses getFieldNames to generate granularity key [Link].
When we have a query where same field is used for histogram but different interval, then this clash could happen right?
{
"aggs": {
"coarse": { "histogram": { "field": "price", "interval": 1000 } },
"fine": { "histogram": { "field": "price", "interval": 100 } }
}
}
|
|
||
| @Override | ||
| public String getProjectedColumnName() { | ||
| return fieldName + PROJECTED_COLUMN_SUFFIX; |
There was a problem hiding this comment.
Same clash issue can happen here since the projected column suffix remains same if same field name is used multiple times
| if (field == null) { | ||
| throw ConversionException.invalidField(fieldName); | ||
| } |
There was a problem hiding this comment.
Shouldn't we handle null as well, since it is expected that some fields might be null? Not sure if throwing Exception here would be correct behavior
| // InternalHistogram requires EmptyBucketInfo when minDocCount is 0, but DSL doesn't support | ||
| // extended bounds. Use minDocCount of 1 to avoid this requirement. Actual filtering is done | ||
| // in AggregationResponseBuilder before calling this method. | ||
| long minDocCount = agg.minDocCount() == 0 ? 1 : agg.minDocCount(); |
There was a problem hiding this comment.
Instead of this approach, can't we identify the min and max buckets from the response and fill in the gaps with empty buckets before generating the response?
| // - Response contains InternalValueCount with correct value (count=3) | ||
| } | ||
|
|
||
| public void testHistogramAggregation() throws Exception { |
There was a problem hiding this comment.
Lets add the following nested aggregation scenarios as well:
histogram -> metric (histogram with sub-agg)
terms -> histogram (histogram nested under terms)
histogram -> terms (terms nested under histogram)
histogram -> histogram (nested histograms on different fields)
sibling histograms on the same field with different intervals
Description
Adds support for histogram bucket aggregations in the DSL Calcite plugin.
Changes
Architecture:
• Refactors GroupingInfo into two interfaces:
• FieldGrouping: For field-based aggregations (terms, multi_terms)
• ExpressionGrouping: For computed aggregations (histogram)
• Histogram expressions are computed via LogicalProject before LogicalAggregate
Implementation:
• HistogramGrouping: Builds expression FLOOR((field - offset) / interval) * interval + offset
• HistogramBucketShape: Converts to InternalHistogram results
• SimpleFieldGrouping: Replaces direct GroupingInfo usage for field-based aggregations
• CollationResolver: Maps _key to projected column names for sorting
Testing:
• 4 unit tests for HistogramGrouping
• 13 unit tests for HistogramBucketShape
• 3 integration tests in DslLogicalPlanIntegrationIT