Skip to content

Commit ab39dc1

Browse files
committed
Update Doris documentation and engine adapter
1 parent 070f0fc commit ab39dc1

2 files changed

Lines changed: 395 additions & 11 deletions

File tree

docs/integrations/engines/doris.md

Lines changed: 385 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,385 @@
1+
# Apache Doris
2+
3+
## Overview
4+
5+
[Apache Doris](https://doris.apache.org/) is a modern analytical database product based on an MPP architecture. It provides real-time analytical capabilities, supporting both high-concurrency point queries and high-throughput complex analysis.
6+
7+
SQLMesh supports Doris through its MySQL-compatible protocol, while providing Doris-specific optimizations for table models, indexing, partitioning, and other features. The adapter is designed to leverage Doris's strengths for analytical workloads, with sensible defaults and support for advanced configuration.
8+
9+
## Connection Configuration
10+
11+
```yaml
12+
doris:
13+
connection:
14+
type: doris
15+
host: fe.doris.cluster # Frontend (FE) node address
16+
port: 9030 # Query port (default: 9030)
17+
user: doris_user
18+
password: your_password
19+
database: your_database
20+
# Optional MySQL-compatible settings
21+
charset: utf8mb4
22+
connect_timeout: 60
23+
state_connection:
24+
# Use postgres as state connection
25+
type: postgres
26+
host: 127.0.0.1
27+
port: 5432
28+
user: your_user
29+
password: your_password
30+
database: your_database
31+
```
32+
33+
## Table Models
34+
35+
Doris supports three table models: DUPLICATE, UNIQUE, and AGGREGATE. SQLMesh supports **DUPLICATE** and **UNIQUE** models through the `physical_properties` configuration.
36+
37+
### DUPLICATE Model (Default)
38+
39+
The DUPLICATE model allows duplicate data and is optimized for high-throughput scenarios like log data and streaming ingestion.
40+
41+
**Features:**
42+
- **High Write Performance**: Optimized for append-only workloads
43+
- **No Deduplication**: Allows duplicate records
44+
- **Streaming Friendly**: Ideal for real-time data ingestion
45+
46+
**Example Configuration:**
47+
```sql
48+
MODEL (
49+
name user_events,
50+
kind FULL,
51+
physical_properties (
52+
duplicate_key ('user_id', 'event_time'),
53+
distributed_by (
54+
kind = 'HASH',
55+
expressions = 'user_id',
56+
buckets = 10
57+
)
58+
)
59+
);
60+
```
61+
62+
### UNIQUE Model
63+
64+
The UNIQUE model is ideal for dimension tables and scenarios requiring data updates. It ensures key uniqueness and supports efficient UPSERT operations.
65+
66+
**Features:**
67+
- **Primary Key Updates**: New data overwrites existing records with matching keys
68+
- **Merge-on-Write**: Can be enabled for better query performance
69+
- **Automatic Deduplication**: Ensures data uniqueness based on specified key columns
70+
71+
**Example Configuration:**
72+
```sql
73+
MODEL (
74+
name dim_users,
75+
kind FULL,
76+
physical_properties (
77+
unique_key 'user_id',
78+
distributed_by (
79+
kind = 'HASH',
80+
expressions = 'user_id',
81+
buckets = 16
82+
),
83+
enable_unique_key_merge_on_write = 'true'
84+
)
85+
);
86+
```
87+
88+
## Table Properties
89+
90+
The Doris adapter supports a comprehensive set of table properties that can be configured in the `physical_properties` section of your model.
91+
92+
### Core Table Properties
93+
94+
| Property | Type | Description | Example |
95+
| --------------------- | --------------------- | ------------------------------------------- | ---------------------------------------------------------- |
96+
| `unique_key` | `Tuple[str]` or `str` | Defines unique key columns for UNIQUE model | `('user_id')` or `'user_id'` |
97+
| `duplicate_key` | `Tuple[str]` or `str` | Defines key columns for DUPLICATE model | `('user_id', 'event_time')` |
98+
| `distributed_by` | `Dict` | Distribution configuration | See Distribution section |
99+
| `partitions` | `Tuple[str]` or `str` | Custom partition expression | `'FROM ("2000-11-14") TO ("2099-11-14") INTERVAL 1 MONTH'` |
100+
101+
### Distribution Configuration
102+
103+
The `distributed_by` property supports multiple formats:
104+
105+
**Dictionary Format:**
106+
```sql
107+
MODEL (
108+
name my_table,
109+
kind FULL,
110+
physical_properties (
111+
distributed_by (
112+
kind = 'HASH',
113+
expressions = 'user_id',
114+
buckets = 10
115+
)
116+
)
117+
);
118+
```
119+
120+
```sql
121+
MODEL (
122+
name my_table,
123+
kind FULL,
124+
physical_properties (
125+
distributed_by (
126+
kind = 'RANDOM'
127+
)
128+
)
129+
);
130+
```
131+
132+
**Supported Distribution Types:**
133+
- `HASH`: Hash-based distribution (most common)
134+
- `RANDOM`: Random distribution
135+
136+
**Bucket Configuration:**
137+
- Integer value: Fixed number of buckets (e.g., `10`)
138+
- `'AUTO'`: Automatic bucket calculation
139+
140+
### Partitioning
141+
142+
Doris supports range partitioning and list partitioning to improve query performance.
143+
144+
**Custom Partition Expression:**
145+
```sql
146+
MODEL (
147+
name my_partitioned_model,
148+
kind INCREMENTAL_BY_TIME_RANGE(time_column (event_date, '%Y-%m-%d')),
149+
partitioned_by RANGE(event_date),
150+
physical_properties (
151+
partitions = 'FROM ("2000-11-14") TO ("2099-11-14") INTERVAL 2 YEAR',
152+
),
153+
);
154+
```
155+
156+
```sql
157+
MODEL (
158+
name my_custom_partitioned_model,
159+
kind FULL,
160+
partitioned_by RANGE(event_date),
161+
physical_properties (
162+
partitioned_by_expr = (
163+
'PARTITION `p2023` VALUES [("2023-01-01"), ("2024-01-01"))',
164+
'PARTITION `p2024` VALUES [("2024-01-01"), ("2025-01-01"))',
165+
'PARTITION `p2025` VALUES [("2025-01-01"), ("2026-01-01"))',
166+
'PARTITION `other` VALUES LESS THAN MAXVALUE'
167+
),
168+
)
169+
);
170+
```
171+
172+
### Generic Properties
173+
174+
Any additional properties in `physical_properties` are passed through as Doris table properties:
175+
176+
```sql
177+
MODEL (
178+
name advanced_table,
179+
kind FULL,
180+
physical_properties (
181+
unique_key = 'id',
182+
distributed_by (
183+
kind = 'HASH',
184+
expressions = 'id',
185+
buckets = 8
186+
),
187+
replication_allocation = 'tag.location.default: 3',
188+
in_memory = 'false',
189+
storage_format = 'V2',
190+
disable_auto_compaction = 'false',
191+
)
192+
);
193+
```
194+
195+
## Materialized Views
196+
197+
SQLMesh supports creating materialized views in Doris with comprehensive configuration options.
198+
199+
### Basic Materialized View
200+
201+
```sql
202+
MODEL (
203+
name user_summary_mv,
204+
kind VIEW (
205+
materialized true
206+
)
207+
);
208+
209+
SELECT
210+
user_id,
211+
COUNT(*) as event_count,
212+
MAX(event_time) as last_event
213+
FROM user_events
214+
GROUP BY user_id;
215+
```
216+
217+
### Advanced Materialized View Configuration
218+
219+
```sql
220+
MODEL (
221+
name sqlmesh_test.view_materialized1,
222+
kind VIEW (
223+
materialized true
224+
),
225+
partitioned_by ds,
226+
physical_properties (
227+
build = 'IMMEDIATE',
228+
refresh = 'AUTO',
229+
refresh_trigger = 'ON SCHEDULE EVERY 12 hour',
230+
unique_key = id,
231+
distributed_by = (kind='HASH', expressions=id, buckets=10),
232+
replication_allocation = 'tag.location.default: 3',
233+
in_memory = 'false',
234+
storage_format = 'V2',
235+
disable_auto_compaction = 'false'
236+
),
237+
description "customer zip",
238+
columns (
239+
id int,
240+
ds datetime,
241+
zip int,
242+
),
243+
column_descriptions (
244+
id = "order id",
245+
zip = "zip code",
246+
)
247+
);
248+
```
249+
250+
### Materialized View Properties
251+
252+
| Property | Description | Values |
253+
| --------------------- | ------------------------------------------------------------------------------- | ---------------------------------------------------------- |
254+
| `build` | Build strategy | `'IMMEDIATE'`, `'DEFERRED'` |
255+
| `refresh` | Refresh strategy | `'COMPLETE'`, `'AUTO'` |
256+
| `refresh_trigger` | Schedule for automatic refresh | `'MANUAL'`, `'ON SCHEDULE INTERVAL 1 HOUR'`, `'ON COMMIT'` |
257+
| `unique_key` | Unique key columns | `'user_id'` or `['user_id', 'date']` |
258+
| `duplicate_key` | Duplicate key columns | `'user_id'` or `['user_id', 'date']` |
259+
| `materialized_type` | Materialized type | `SYNC`, `ASYNC` |
260+
| `source_table` | Source table of synchronous materialized view | `schema_name`.`table_name` |
261+
262+
## Indexing
263+
264+
SQLMesh supports creating indexes in Doris to accelerate queries. You can define indexes in your model's DDL.
265+
266+
**Example:**
267+
```sql
268+
MODEL (
269+
name my_indexed_table,
270+
kind FULL
271+
);
272+
273+
SELECT
274+
user_id,
275+
username,
276+
city
277+
FROM
278+
users;
279+
280+
@IF(
281+
@runtime_stage = 'creating',
282+
CREATE INDEX idx_username ON my_indexed_table (username) USING INVERTED COMMENT 'Inverted index on username'
283+
);
284+
```
285+
286+
## Comments
287+
288+
SQLMesh supports adding comments to tables and columns with automatic truncation to Doris limits.
289+
290+
- **Table Comments**: Use the `description` property in the `MODEL` definition
291+
- **Column Comments**: Use the `column_descriptions` property in the `MODEL` definition
292+
293+
```sql
294+
MODEL (
295+
name my_commented_table,
296+
kind TABLE,
297+
description 'This is a comprehensive table comment that describes the purpose and usage of this table in detail.',
298+
column_descriptions (
299+
id = "Unique identifier for each record",
300+
user_id = "Foreign key reference to users table",
301+
event_type = "Type of event that occurred"
302+
)
303+
);
304+
```
305+
306+
**Limits:**
307+
- Table comments: 2048 characters (automatically truncated)
308+
- Column comments: 255 characters (automatically truncated)
309+
310+
## Views
311+
312+
SQLMesh supports both regular and materialized views in Doris.
313+
314+
### Regular Views
315+
316+
```sql
317+
MODEL (
318+
name user_summary_view,
319+
kind VIEW
320+
);
321+
322+
SELECT
323+
user_id,
324+
COUNT(*) as event_count,
325+
MAX(event_time) as last_event
326+
FROM user_events
327+
GROUP BY user_id;
328+
```
329+
330+
## Schema Management
331+
332+
### Creating Schemas
333+
334+
```sql
335+
-- Schemas in Doris are databases
336+
CREATE DATABASE IF NOT EXISTS my_schema;
337+
```
338+
339+
### Dropping Schemas
340+
341+
```sql
342+
-- Doris doesn't support CASCADE clause
343+
DROP DATABASE my_schema;
344+
```
345+
346+
## Data Operations
347+
348+
### Table Operations
349+
350+
**Create Table Like:**
351+
```sql
352+
-- Create a new table with the same structure as an existing table
353+
CREATE TABLE new_table LIKE existing_table;
354+
```
355+
356+
**Rename Table:**
357+
```sql
358+
-- Rename a table
359+
ALTER TABLE old_table_name RENAME new_table_name;
360+
```
361+
362+
**Delete Operations:**
363+
```sql
364+
-- Delete specific records
365+
DELETE FROM table_name WHERE condition;
366+
367+
-- Full table deletion uses TRUNCATE for better performance
368+
DELETE FROM table_name WHERE TRUE; -- Executes as TRUNCATE TABLE
369+
```
370+
371+
## Dependencies
372+
373+
To use Doris with SQLMesh, install the required MySQL driver:
374+
375+
```bash
376+
pip install "sqlmesh[doris]"
377+
# or
378+
pip install pymysql
379+
```
380+
381+
## Resources
382+
383+
- [Doris Documentation](https://doris.apache.org/docs/)
384+
- [Doris Data Models Guide](https://doris.apache.org/docs/table-design/data-model/)
385+
- [Doris SQL Reference](https://doris.apache.org/docs/sql-manual/)

0 commit comments

Comments
 (0)