Skip to content

Improve INCREMENTAL_BY_UNIQUE_KEY merge for MSSQL #4568

@codykonior

Description

@codykonior

For MSSQL/T-SQL the MERGE statement generated for a INCREMENTAL_BY_UNIQUE_KEY model could use improvement.

It typically does:

MERGE INTO ... [__MERGE_TARGET__] 
USING (...)  AS [__MERGE_SOURCE__] 
ON Target Unique Key Columns = Source Unique Key Columns
WHEN MATCHED 
THEN UPDATE SET Each Target Column = Each Source Column

This is bad for MSSQL, updating a row to all of its existing values isn't cancelled out as a no-operation. It overwrites every single column and is all written to the transaction log. So we'd like to prevent row updates where it's not necessary, such as when the row has not changed.

So for MSSQL it's much better to do:

MERGE INTO ... [__MERGE_TARGET__] 
USING (...)  AS [__MERGE_SOURCE__] 
ON Target Unique Key Columns = Source Unique Key Columns
WHEN MATCHED AND EXISTS (SELECT Each Target Column EXCEPT SELECT Each Source Column) (except for the unique columns)
THEN UPDATE SET Each Target Column = Each Source Column (except for the unique columns)

The addition of EXISTS prevents an update from happening on a row where all of the other columns are the same. The syntax for this with SELECT/EXCEPT/SELECT is valid for MSSQL because it honours NULL comparisons, and in this way it's much better than trying to do WHEN MATCHED AND (ISNULL(a1, '') <> ISNULL(a2, '') OR b1 <> b2 OR c1 <> c2) and all of that kind of stuff. Note that this may not work the same way with other database engines, I don't know.

The second part of that change is being a bit more careful with the UPDATE SET columns to not update the unique columns, after all we already checked them as part of the match. But there's a specific reason for this too, because in MSSQL on a normal clustered table, trying to set a key column to itself will cause a deferred update where the engine internally deletes and inserts the entire row. By not setting them, it can update the row in place instead (assuming all the data types fit, etc). I'm not 100% sure this occurs on the heap tables sqlmesh is using, but it should be avoided anyway.

Together these will result in much reduced CPU and IO.

A third thing but slightly less important and that could be done separately 😓 is another keyword so that when you materialize the table it creates the table with a clustered primary key using the unique_key field. I know that we can create indexes using a post-statement, I don't think we can create them on the model table as part of the pre-statement), but for MSSQL converting the fully populated heap at the end to a clustered table is a huge amount of CPU/IO versus creating it that way up front. MSSQL doesn't perform well with heaps, or MERGE on heaps, and so for me every single model will end up as a clustered table if that was an option.

For the user-presented overall implementation I'd like to suggest allowing a keyword on a model like:

MODEL (
    name [xxx].[yyy],
    kind INCREMENTAL_BY_UNIQUE_KEY (
        unique_key [My Id],
        use_merge_exists True,
        use_merge_exists_except [Optionally, Some Column I Want To Exclude],
        type 'clustered'
    ),
    cron '*/5 * * * *'
);

Having use_merge_exists would generate the syntax as above and following the above rules, and use_merge_exists_except would allow you to exclude columns from the EXISTS comparison section.

This is to allow a pattern where you include a [Load Date] = SYSDATETIMEOFFSET() or similar column in your model. This is something that is useful because it would get inserted into the table when a row suddenly appears for the first time, or when the row was actually changed, but shouldn't trigger a change just in itself because it would otherwise force an update every single time, which we don't want.

The type could be used to specify clustered instead of heap tables. When sqlglot materializes the table, it can chuck on a create table ... ( ... constraint PK_modelname primary key clustered (unique_key)) with (data_compression = page). That would be a huge help.

I'm not great with Python but did jump in to try to give this all a shot, but got nowhere fast, as it requires a lot of internal and SQLglot usage knowledge that I just don't have.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions