Merge branch 'main' into main

deepindeed2022 · web-flow · commit 6c3a7aa0156d · 2026-04-13T09:38:49.000+08:00
diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md
@@ -124,9 +124,9 @@ Report the path and size to the user.
 
 ## Common Pitfalls
 
-- **Transformers version**: Newer models (e.g., Devstral/ministral3) may require a transformers version not yet in the container. Check `config.json` for `transformers_version` and upgrade if needed. Install ModelOpt first, then upgrade transformers **with** deps (not `--no-deps`) to pull compatible `huggingface_hub`
+- **Transformers version**: New models may need a newer version of transformers than what's installed. Check `config.json` for `transformers_version`. In containers, beware of `PIP_CONSTRAINT` blocking upgrades — see `references/slurm-setup-ptq.md` for workarounds
 - **Gated datasets**: Some calibration datasets require HF authentication. Ensure `HF_TOKEN` is set in the job environment, or use `--dataset cnn_dailymail` as a non-gated alternative
-- **NFS root_squash + Docker**: Docker runs as root, but NFS squashes root to `nobody`. Use `docker run --user $(id -u):$(id -g)`, or `chmod -R a+rwX` on needed directories as a fallback. See `skills/common/slurm-setup.md` section 5
+- **NFS root_squash + Docker**: See `skills/common/slurm-setup.md` section 5
 
 ## References
 
diff --git a/.claude/skills/ptq/references/slurm-setup-ptq.md b/.claude/skills/ptq/references/slurm-setup-ptq.md
@@ -7,29 +7,54 @@ monitoring), see `skills/common/slurm-setup.md`.
 
 ## 1. Container
 
-Get the recommended image version from `examples/llm_ptq/README.md`, then look for a `.sqsh` file in the workspace and common sibling directories:
+Get the recommended image version from `examples/llm_ptq/README.md`, then look for an existing `.sqsh` file:
 
 ```bash
 ls *.sqsh ../*.sqsh ~/containers/*.sqsh 2>/dev/null
 ```
 
-If you find a `.sqsh` but aren't sure of its version, check it:
+**If a `.sqsh` exists**, use it directly with `--container-image=<path>`. Skip import.
+
+**If no `.sqsh` exists**, import with enroot (caches for subsequent smoke tests and reruns):
 
 ```bash
-srun --container-image=<path/to/container.sqsh> --ntasks=1 bash -c \
-    "pip show tensorrt-llm 2>/dev/null | grep Version || cat /VERSION 2>/dev/null || echo unknown"
+export ENROOT_CACHE_PATH=/path/to/writable/enroot-cache
+export ENROOT_DATA_PATH=/path/to/writable/enroot-data
+mkdir -p "$ENROOT_CACHE_PATH" "$ENROOT_DATA_PATH"
+enroot import --output /path/to/container.sqsh docker://nvcr.io#nvidia/tensorrt-llm/release:<version>
 ```
 
-If no `.sqsh` exists, import it with enroot. Set writable cache paths first — the default `/raid/containers` is often not writable:
+If enroot import fails (e.g., permission errors on lustre), use pyxis inline pull as fallback — pass the NGC URI directly to `--container-image="nvcr.io/nvidia/tensorrt-llm/release:<version>"`. Note this re-pulls on every job.
+
+### Container dependency pitfalls
+
+**New models may need newer transformers** than what's in the container:
 
 ```bash
-export ENROOT_CACHE_PATH=/path/to/writable/enroot-cache
-export ENROOT_DATA_PATH=/path/to/writable/enroot-data
-export TMPDIR=/path/to/writable/tmp
-mkdir -p "$ENROOT_CACHE_PATH" "$ENROOT_DATA_PATH" "$TMPDIR"
+pip install -U transformers
+```
+
+For unlisted models that need unreleased transformers (e.g., from git), see `references/unsupported-models.md` Step A.
+
+**Prefer `PYTHONPATH`** to use the synced ModelOpt source instead of installing inside the container — this avoids risking dependency conflicts (e.g., `pip install -U nvidia-modelopt[hf]` can upgrade PyTorch and break other packages):
+
+```bash
+export PYTHONPATH=/path/to/Model-Optimizer:$PYTHONPATH
+```
+
+If `PYTHONPATH` doesn't work due to missing compiled extensions, fall back to `pip install -e ".[hf]" --no-build-isolation` (run from the Model-Optimizer repo root).
+
+**Watch for pip dependency conflicts** — NGC containers set `PIP_CONSTRAINT` to pin versions, causing `ResolutionImpossible` errors. Unset it first so pip can resolve freely:
+
+```bash
+unset PIP_CONSTRAINT
+pip install -U transformers   # now upgrades and resolves with new deps included
+```
 
-enroot import --output /path/to/container.sqsh \
-    docker://nvcr.io#nvidia/tensorrt-llm/release:<version>
+If that still conflicts, fall back to `--no-deps` (skips new deps — may need to add missing ones manually):
+
+```bash
+pip install -U transformers --no-deps
 ```
 
 ---
@@ -68,10 +93,3 @@ This catches script errors cheaply before using GPU quota on a real run.
 See `skills/common/slurm-setup.md` section 2 for the smoke test partition pattern.
 
 Only submit the full calibration job after the smoke test exits cleanly.
-
----
-
-## 4. PTQ-Specific Notes
-
-- **Gated datasets**: Some calibration datasets (e.g., `nvidia/Nemotron-Post-Training-Dataset-v2`) require HF authentication. Set `HF_TOKEN` in the job environment, or use `--dataset cnn_dailymail` to use a non-gated alternative.
-- **NFS permissions**: Docker + NFS root_squash causes `PermissionError` on output/cache dirs. See `skills/common/slurm-setup.md` section 5 for fixes.
diff --git a/.claude/skills/ptq/references/unsupported-models.md b/.claude/skills/ptq/references/unsupported-models.md
@@ -15,7 +15,11 @@ After download, inspect the model files on the target machine (use `remote_run`
 
 Write custom scripts locally (in `./workspaces/<model>/scripts/`), then sync to remote before running.
 
-**Then check `config.json`** (on the target machine):
+**Check transformers compatibility** (on the target machine):
+
+First, if README or `config.json` specifies a required transformers version, check if installed version satisfies it. If not, upgrade: `pip install -U "transformers>=<required_version>"`.
+
+Then try loading:
 
 ```bash
 python -c "
@@ -40,16 +44,14 @@ print(type(cfg).__name__)
 
   Read the modeling file and proceed to Step B.
 
-- **Raises `ValueError` / `OSError` (unknown architecture)** → not in the installed transformers. Determine why:
-
-  1. **Check the transformers `main` branch** (not yet released):
+- **Raises `ValueError` / `OSError` (unknown architecture)** → not in the installed transformers. Try `pip install -U transformers` first. If still not found, check the `main` branch:
 
      ```bash
      git clone --depth 1 https://github.com/huggingface/transformers.git /tmp/transformers-main --quiet
      grep -r "class <ArchName>" /tmp/transformers-main/src/transformers/models/
      ```
 
-     - **Found** → install with deps: `pip install /tmp/transformers-main`, then re-run `AutoConfig.from_pretrained()`. **Important**: if ModelOpt is already installed, its `[hf]` extras may have pinned an older transformers. Install ModelOpt first, then upgrade transformers **after** (with deps, not `--no-deps`) so compatible `huggingface_hub` and other transitive deps are pulled in.
+     - **Found** → `pip install /tmp/transformers-main`, then re-run `AutoConfig`.
      - **Not found** → ask the user: *"The checkpoint uses `<ArchName>` which isn't in released or main-branch transformers. Do you have a private fork or custom modeling code?"*
 
 - **No `config.json`** → not a standard HF checkpoint. List the directory for README or `.py` files. If nothing useful, ask the user for the modeling code.
@@ -131,13 +133,15 @@ class QuantCustomModule(OriginalModule):
 
 ## Pattern 2: MoE Models
 
-**Standard MoE** (per-expert `nn.Linear` in a `ModuleList` with `gate` + `experts`): Auto-detected by `register_sparse_moe_on_the_fly`. No custom code needed — amax sync and calibration coverage are handled automatically.
+**Most MoE models are auto-detected** — ModelOpt handles two common patterns automatically:
+
+- **transformers >= 5.0**: Unified fused experts (`gate_up_proj` + `down_proj` 3D tensors) → auto-detected by `register_fused_experts_on_the_fly`, handled by `_QuantFusedExperts`. Covers Mixtral, Qwen, DeepSeek, Jamba, OlMoE, etc.
+- **transformers < 5.0**: Sequential per-expert `nn.Linear` with `gate` + `experts` → auto-detected by `register_sparse_moe_on_the_fly`.
 
-**Custom MoE** requires patching. Read the model source to understand how expert weights are stored and computed, then find the closest pattern in the plugin (`modelopt/torch/quantization/plugins/huggingface.py`):
+**Custom MoE** (non-standard layout not matching auto-detection) requires patching. Find the closest pattern in the plugin (`modelopt/torch/quantization/plugins/huggingface.py`):
 
 | MoE design | Strategy | Plugin example |
 | --- | --- | --- |
-| Fused weights + per-expert dispatch loop | Expand to per-expert `nn.Linear` | `_QuantQwen35MoeExperts` |
 | Fused weights + `torch.bmm` | Add `TensorQuantizer` around bmm | `_QuantLlama4TextExperts` |
 | Fused weights + functional interception | Intercept matmul ops | `_QuantGptOssExperts` |
 | Fused 2D weights (experts stacked in rows) | Two-level expansion | `_QuantDbrxExpertGLU` |
@@ -343,3 +347,4 @@ tokenizer.save_pretrained(output_path)
 - **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled
 - **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors
 - **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary
+- **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down
diff --git a/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py b/examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py
@@ -142,7 +142,8 @@ def keep_conversation(entry):
     tokenizer = AutoTokenizer.from_pretrained(args.model, trust_remote_code=args.trust_remote_code)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token
-    tokenizer.chat_template = tokenizer.chat_template.replace(REMOVE_THINK_CHAT_TEMPLATE, "")
+    if tokenizer.chat_template is not None:
+        tokenizer.chat_template = tokenizer.chat_template.replace(REMOVE_THINK_CHAT_TEMPLATE, "")
 
     output_dir = args.output_dir
     output_dir.mkdir(parents=True, exist_ok=True)
diff --git a/modelopt/torch/__init__.py b/modelopt/torch/__init__.py
@@ -15,12 +15,25 @@
 
 """Model optimization and deployment subpackage for torch."""
 
+import importlib
 import warnings as _warnings
 
 from packaging.version import Version as _Version
 from torch import __version__ as _torch_version
 
-from . import distill, nas, opt, peft, prune, quantization, sparsity, speculative, utils
+# Pre-initialize torch._dynamo to prevent double-registration with peft's torch.compile() call
+importlib.import_module("torch._dynamo")
+from . import (  # noqa: E402
+    distill,
+    nas,
+    opt,
+    peft,
+    prune,
+    quantization,
+    sparsity,
+    speculative,
+    utils,
+)
 
 if _Version(_torch_version) < _Version("2.9"):
     _warnings.warn(
diff --git a/modelopt/torch/quantization/config.py b/modelopt/torch/quantization/config.py
@@ -1560,6 +1560,10 @@ def normalize_quant_cfg_list(v: dict | list) -> list[QuantizerCfgEntry]:
     - An empty entry ``{}``.
     - An entry with only ``quantizer_name`` and no other keys — the only effect would be an
       implicit ``enable=True``, which must be stated explicitly.
+    - An entry with ``enable=True`` (explicit or implicit) whose ``cfg`` is not a non-empty
+      ``dict`` or ``list`` — e.g. ``{"quantizer_name": "*", "cfg": {}}`` or
+      ``{"quantizer_name": "*", "cfg": 42}``.  An enabled quantizer must have a valid
+      configuration.
 
     **Normalization** — after conversion and validation every entry is put into canonical form:
 
@@ -1577,7 +1581,8 @@ def normalize_quant_cfg_list(v: dict | list) -> list[QuantizerCfgEntry]:
 
     Raises:
         ValueError: If any entry has only ``quantizer_name`` with neither ``cfg`` nor ``enable``,
-            or if the entry format is not recognized.
+            if ``enable=True`` with an empty or non-dict/list ``cfg``, or if the entry format
+            is not recognized.
     """
 
     def _warn_legacy():
@@ -1662,6 +1667,28 @@ def _dict_to_entry(key: str, value) -> list[QuantizerCfgEntry]:
                     "enable=True is not allowed; set it explicitly)."
                 )
 
+            # Validate: when cfg is present and enable=True, cfg must be a non-empty
+            # dict or list.  An empty cfg would attempt to create a
+            # QuantizerAttributeConfig with no actual configuration.
+            cfg = entry.get("cfg")
+            enable = entry.get("enable", True)
+            if enable and cfg is not None:
+                if isinstance(cfg, dict):
+                    is_invalid = len(cfg) == 0
+                elif isinstance(cfg, list):
+                    is_invalid = len(cfg) == 0 or any(
+                        not isinstance(item, dict) or len(item) == 0 for item in cfg
+                    )
+                else:
+                    is_invalid = True
+                if is_invalid:
+                    raise ValueError(
+                        f"Invalid quant_cfg entry: {raw!r} — 'cfg' must be a non-empty dict "
+                        f"or a non-empty list of non-empty dicts when enabling a quantizer "
+                        f"(got {type(cfg).__name__}: {cfg!r}). Either provide quantizer "
+                        "attributes in 'cfg' or remove 'cfg' and set 'enable' explicitly."
+                    )
+
             # Normalize: make enable and cfg always explicit.
             entry.setdefault("enable", True)
             entry.setdefault("cfg", None)
diff --git a/tests/examples/speculative_decoding/conftest.py b/tests/examples/speculative_decoding/conftest.py
@@ -13,11 +13,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import json
+
 import pytest
 import yaml
 from _test_utils.examples.run_command import run_example_command
 
 
+@pytest.fixture(scope="session")
+def tiny_conversations_path(tmp_path_factory):
+    """Tiny JSONL with short synthetic conversations for compute_hidden_states_hf tests.
+
+    Uses minimal single-turn conversations so that tokenized lengths stay well
+    within the tiny test model's max_position_embeddings (32) even after chat
+    template formatting.
+    """
+    tmp_dir = tmp_path_factory.mktemp("tiny_convs")
+    output_file = tmp_dir / "train.jsonl"
+    conversations = [
+        {
+            "conversation_id": f"test-{i}",
+            "conversations": [
+                {"role": "user", "content": "What is 2 plus 2?"},
+                {"role": "assistant", "content": "4"},
+            ],
+        }
+        for i in range(5)
+    ]
+    with open(output_file, "w") as f:
+        f.writelines(json.dumps(conv) + "\n" for conv in conversations)
+    return output_file
+
+
 @pytest.fixture(scope="session", autouse=True)
 def tiny_daring_anteater_path(tmp_path_factory):
     tmp_dir = tmp_path_factory.mktemp("daring_anteater")
diff --git a/tests/examples/speculative_decoding/test_eagle_offline_ptq.py b/tests/examples/speculative_decoding/test_eagle_offline_ptq.py
@@ -55,7 +55,7 @@ def offline_ptq_dirs(tmp_path_factory):
     }
 
 
-def test_collect_hidden_states(tiny_llama_path, tiny_daring_anteater_path, offline_ptq_dirs):
+def test_collect_hidden_states(tiny_llama_path, tiny_conversations_path, offline_ptq_dirs):
     """Stage 1: generate .pt hidden state files from the base model."""
     run_example_command(
         [
@@ -64,11 +64,13 @@ def test_collect_hidden_states(tiny_llama_path, tiny_daring_anteater_path, offli
             "--model",
             tiny_llama_path,
             "--input-data",
-            str(tiny_daring_anteater_path),
+            str(tiny_conversations_path),
             "--output-dir",
             str(offline_ptq_dirs["hidden_states"]),
             "--debug-max-num-conversations",
             "2",
+            "--max-seq-len",
+            "32",
         ],
         "speculative_decoding",
     )
diff --git a/tests/unit/torch/quantization/test_config_validation.py b/tests/unit/torch/quantization/test_config_validation.py
@@ -163,6 +163,60 @@ def test_error_on_multi_key_legacy_dict(self):
         with pytest.raises(ValueError):
             normalize_quant_cfg_list([{"*weight_quantizer": {}, "*input_quantizer": {}}])
 
+    def test_error_on_empty_cfg_dict_implicit_enable(self):
+        """Entry with cfg={} and implicit enable=True is rejected."""
+        with pytest.raises(ValueError, match="non-empty dict"):
+            normalize_quant_cfg_list([{"quantizer_name": "*weight_quantizer", "cfg": {}}])
+
+    def test_error_on_empty_cfg_dict_explicit_enable_true(self):
+        """Entry with cfg={} and explicit enable=True is rejected."""
+        with pytest.raises(ValueError, match="non-empty dict"):
+            normalize_quant_cfg_list(
+                [{"quantizer_name": "*weight_quantizer", "cfg": {}, "enable": True}]
+            )
+
+    def test_error_on_empty_cfg_list_enable_true(self):
+        """Entry with cfg=[] and enable=True is rejected."""
+        with pytest.raises(ValueError, match="non-empty dict"):
+            normalize_quant_cfg_list(
+                [{"quantizer_name": "*weight_quantizer", "cfg": [], "enable": True}]
+            )
+
+    def test_error_on_non_dict_non_list_cfg_enable_true(self):
+        """Entry with cfg of invalid type (e.g. int) and enable=True is rejected."""
+        with pytest.raises(ValueError, match="non-empty dict"):
+            normalize_quant_cfg_list(
+                [{"quantizer_name": "*weight_quantizer", "cfg": 42, "enable": True}]
+            )
+
+    def test_error_on_cfg_list_with_empty_dict_enable_true(self):
+        """Entry with cfg=[{}] and enable=True is rejected (empty dict element)."""
+        with pytest.raises(ValueError, match="non-empty dict"):
+            normalize_quant_cfg_list(
+                [{"quantizer_name": "*weight_quantizer", "cfg": [{}], "enable": True}]
+            )
+
+    def test_error_on_cfg_list_with_non_dict_element_enable_true(self):
+        """Entry with cfg=[42] and enable=True is rejected (non-dict element)."""
+        with pytest.raises(ValueError, match="non-empty dict"):
+            normalize_quant_cfg_list(
+                [{"quantizer_name": "*weight_quantizer", "cfg": [42], "enable": True}]
+            )
+
+    def test_empty_cfg_dict_enable_false_accepted(self):
+        """Entry with cfg={} and enable=False is allowed (disable-only entry)."""
+        result = normalize_quant_cfg_list(
+            [{"quantizer_name": "*input_quantizer", "cfg": {}, "enable": False}]
+        )
+        assert result[0]["enable"] is False
+
+    def test_empty_cfg_list_enable_false_accepted(self):
+        """Entry with cfg=[] and enable=False is allowed (disable-only entry)."""
+        result = normalize_quant_cfg_list(
+            [{"quantizer_name": "*input_quantizer", "cfg": [], "enable": False}]
+        )
+        assert result[0]["enable"] is False
+
     def test_new_format_with_list_cfg(self):
         """cfg can be a list of dicts for SequentialQuantizer."""
         raw = [
diff --git a/uv.lock b/uv.lock