microsoft · tomz · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -7,12 +7,33 @@ on:
     branches: [main]
 
 jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v5
+        with:
+          python-version: "3.11"
+          enable-cache: true
+
+      - name: Install dev dependencies
+        run: uv sync --group dev
+
+      - name: Ruff check
+        run: uv run ruff check src/ tests/
+
+      - name: Ruff format check
+        run: uv run ruff format --check src/ tests/
+
   unit-tests:
     runs-on: ubuntu-latest
+    needs: lint
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12", "3.13"]
+        python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
 
     steps:
       - uses: actions/checkout@v4
@@ -21,6 +42,7 @@ jobs:
         uses: astral-sh/setup-uv@v5
         with:
           python-version: ${{ matrix.python-version }}
+          enable-cache: true
 
       - name: Install dependencies
         run: uv sync --group dev
@@ -66,6 +88,7 @@ jobs:
         uses: astral-sh/setup-uv@v5
         with:
           python-version: "3.11"
+          enable-cache: true
 
       - name: Install dependencies (${{ matrix.engine }})
         run: uv sync --group dev ${{ matrix.extras_flags }}

diff --git a/.gitignore b/.gitignore
@@ -79,3 +79,6 @@ __lakebench_cli_cache__/
 # Optional: Docs builds
 site/
 docs/_build/
+
+# Personal scratch / scratchpads (workspace-specific drivers, demo captures)
+scratch/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,18 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.6.9
+    hooks:
+      - id: ruff
+        args: [--fix]
+      - id: ruff-format
+
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+      - id: trailing-whitespace
+      - id: end-of-file-fixer
+      - id: check-yaml
+      - id: check-toml
+      - id: check-merge-conflict
+      - id: check-added-large-files
+        args: [--maxkb=500]
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ LakeBench exists to bring clarity, trust, accessibility, and relevance to engine
 
 
 ## ✅ Why LakeBench?
-- **Multi-Engine**: Benchmark Spark, DuckDB, Polars, Daft, Sail and others, side-by-side
+- **Multi-Engine**: Benchmark Spark, DuckDB, Polars, Daft, Sail, Spark Connect, Databricks, Livy and others, side-by-side
 - **Lifecycle Coverage**: Ingest, transform, maintain, and query—just like real workloads
 - **Diverse Workloads**: Test performance across varied data shapes and operations
 - **Consistent Execution**: One framework, many engines
@@ -47,7 +47,7 @@ LakeBench empowers data teams to make informed engine decisions based on real wo
 
 ## 💪 Benchmarks
 
-LakeBench currently supports four benchmarks with more to come:
+LakeBench currently supports five benchmarks with more to come:
 
 - **ELTBench**: An benchmark that simulates typicaly ELT workloads:
   - Raw data load (Parquet → Delta)
@@ -58,24 +58,25 @@ LakeBench currently supports four benchmarks with more to come:
 - **[TPC-DS](https://www.tpc.org/tpcds/)**: An industry-standard benchmark for complex analytical queries, featuring 24 source tables and 99 queries. Designed to simulate decision support systems and analytics workloads.
 - **[TPC-H](https://www.tpc.org/tpch/)**: Focuses on ad-hoc decision support with 8 tables and 22 queries, evaluating performance on business-oriented analytical workloads.
 - **[ClickBench](https://github.com/ClickHouse/ClickBench)**: A benchmark that simulates ad-hoc analytical and real-time queries on clickstream, traffic analysis, web analytics, machine-generated data, structured logs, and events data. The load phase (single flat table) is followed by 43 queries.
-
-_Planned_
-- **[TPC-DI](https://www.tpc.org/tpcdi/)**: An industry-standard benchmark for data integration workloads, evaluating end-to-end ETL/ELT performance across heterogeneous sources—including data ingestion, transformation, and loading processes.
+- **[TPC-DI](https://www.tpc.org/tpcdi/)**: An industry-standard benchmark for data integration workloads, evaluating end-to-end ETL/ELT performance across heterogeneous sources—including data ingestion from CSV, XML, and fixed-width files, dimensional model construction (SCD Type 2), incremental batch processing with CDC/merge logic, and audit validation.
 
 ## ⚙️ Engine Support Matrix
 
 LakeBench supports multiple lakehouse compute engines. Each benchmark scenario declares which engines it supports via `<BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY`.
 
-| Engine          | ELTBench | TPC-DS | TPC-H   | ClickBench |
-|-----------------|:--------:|:------:|:-------:|:----------:|
-| Spark (Generic) |    ✅    |   ✅   |   ✅  |    ✅    |
-| Fabric Spark    |    ✅    |   ✅   |   ✅  |    ✅    |
-| Synapse Spark   |    ✅    |   ✅   |   ✅  |    ✅    |
-| HDInsight Spark |    ✅    |   ✅   |   ✅  |    ✅    |
-| DuckDB          |    ✅    |   ✅   |   ✅  |    ✅    |
-| Polars          |    ✅    |   ⚠️   |   ⚠️  |    ⚠️    |
-| Daft            |    ✅    |   ⚠️   |   ⚠️  |    ⚠️    |
-| Sail            |    ✅    |   ✅   |   ✅  |    ✅    |
+| Engine          | ELTBench | TPC-DS | TPC-H   | ClickBench | TPC-DI |
+|-----------------|:--------:|:------:|:-------:|:----------:|:------:|
+| Spark (Generic) |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| Fabric Spark    |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| Synapse Spark   |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| HDInsight Spark |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| DuckDB          |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| Polars          |    ✅    |   ⚠️   |   ⚠️  |    ⚠️    |   ✅   |
+| Daft            |    ✅    |   ⚠️   |   ⚠️  |    ⚠️    |   ✅   |
+| Sail            |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| Spark Connect   |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| Databricks      |    ✅    |   ✅   |   ✅  |    ✅    |   ✅   |
+| Livy            |          |   ✅   |   ✅  |    ✅    |        |
 
 > **Legend:**  
 > ✅ = Supported  
@@ -275,6 +276,126 @@ benchmark.run()
 ```
 ---
 
+## Command Line Interface (CLI)
+
+LakeBench includes a CLI for running benchmarks and generating data without writing Python code.
+
+### Installation
+```bash
+pip install lakebench[duckdb]
+```
+
+After installation, the `lakebench` command is available:
+
+### Running Benchmarks
+```bash
+# Run TPC-H with a profile
+lakebench run --profile local-duckdb --benchmark tpch --scenario sf1 --scale-factor 1 --input-uri /tmp/tpch_sf1
+
+# Run with default profile
+lakebench run --benchmark tpch --scenario sf1 --scale-factor 1 --input-uri /tmp/tpch_sf1 --save-results --result-uri /tmp/results
+```
+
+### Generating Data
+```bash
+lakebench datagen --benchmark tpch --scale-factor 1 --output /tmp/tpch_sf1
+lakebench datagen --benchmark tpcds --scale-factor 1 --output /tmp/tpcds_sf1
+lakebench datagen --benchmark tpcdi --scale-factor 5 --output /tmp/tpcdi --digen-jar ./TPC-DI/DIGen.jar
+```
+
+### Managing Profiles
+```bash
+lakebench profiles list
+lakebench profiles show local-duckdb
+```
+
+## Profile Configuration (`.lakebench.json`)
+
+LakeBench uses a two-tier profile system:
+- **`~/.lakebench.json`** — Global user defaults (shared across all projects)
+- **`./lakebench.json`** — Project-level overrides (takes precedence over global)
+
+### Example Configuration
+```json
+{
+  "defaults": {
+    "profile": "local-duckdb",
+    "save_results": true,
+    "result_table_uri": "/tmp/lakebench/results"
+  },
+  "profiles": {
+    "local-duckdb": {
+      "engine": "duckdb",
+      "engine_options": {
+        "schema_or_working_directory_uri": "/tmp/lakebench"
+      }
+    },
+    "fabric-prod": {
+      "engine": "fabric_spark",
+      "engine_options": {
+        "lakehouse_name": "my_lakehouse",
+        "lakehouse_schema_name": "benchmarks"
+      }
+    },
+    "databricks-prod": {
+      "engine": "databricks",
+      "engine_options": {
+        "host": "https://xxx.cloud.databricks.com",
+        "cluster_id": "0123-456789-abcdef",
+        "schema_name": "benchmarks",
+        "token_env": "DATABRICKS_TOKEN"
+      }
+    },
+    "spark-connect": {
+      "engine": "spark_connect",
+      "engine_options": {
+        "remote": "sc://localhost:15002",
+        "schema_name": "benchmarks"
+      }
+    }
+  }
+}
+```
+
+> **Security Note:** Profiles reference environment variable names for tokens (`token_env`), never the tokens themselves.
+
+## Remote Execution Engines
+
+In addition to local engines, LakeBench supports remote execution backends:
+
+### Spark Connect
+Connect to a remote Spark cluster via the Spark Connect protocol:
+```python
+from lakebench.engines import SparkConnect
+engine = SparkConnect(remote="sc://host:15002", schema_name="benchmarks")
+```
+Install: `pip install lakebench[spark_connect]`
+
+### Databricks
+Connect to a Databricks cluster via Databricks Connect:
+```python
+from lakebench.engines import Databricks
+engine = Databricks(
+    host="https://xxx.cloud.databricks.com",
+    cluster_id="0123-456789-abcdef",
+    schema_name="benchmarks"
+)
+```
+Install: `pip install lakebench[databricks]`
+
+### Livy
+Execute benchmarks via the Apache Livy REST API:
+```python
+from lakebench.engines import Livy
+engine = Livy(
+    url="https://livy.example.com",
+    schema_or_working_directory_uri="/tmp/lakebench"
+)
+```
+Install: `pip install lakebench[livy]`
+
+---
+
 ## Managing Queries Over Various Dialects
 
 LakeBench supports multiple engines that each leverage different SQL dialects and capabilities. To handle this diversity while maintaining consistency, LakeBench employs a **hierarchical query resolution strategy** that balances automated transpilation with engine-specific customization.