Skip to content

Fix tensor device mismatch in _eval_polarisable_electrode when training with CUDA#5

Closed
Copilot wants to merge 2 commits into
masterfrom
copilot/fix-runtimeerror-tensor-device-mismatch
Closed

Fix tensor device mismatch in _eval_polarisable_electrode when training with CUDA#5
Copilot wants to merge 2 commits into
masterfrom
copilot/fix-runtimeerror-tensor-device-mismatch

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 23, 2026

dp train --init-frz-model crashes with a RuntimeError when CUDA is available because setup_from_lammps creates tensors on torch_admp.env.DEVICE (cuda:0), while t_positions, t_box, t_charges, and the neighbor list tensors were all created on CPU — causing a device conflict inside charge_optimization at the torch.where(electrode_mask, charges, ...) call.

Changes

  • _eval_polarisable_electrode in dipole_charge_electrode.py:
    • Call setup_from_lammps first and derive device from input_data[0].device
    • Move self.calculator and self.er to device (no-op if already co-located)
    • Create t_positions, t_box, t_charges on device via torch.as_tensor(..., device=device) so vesin_nblist returns neighbor-list tensors on the same device automatically
# Before: tensors defaulted to CPU regardless of env.DEVICE
t_positions = torch.tensor(positions.reshape(-1, 3), requires_grad=True)  # CPU
input_data = setup_from_lammps(...)  # tensors on env.DEVICE (cuda:0)
# → RuntimeError inside charge_optimization

# After: device aligned before any tensor construction
input_data = setup_from_lammps(...)
device = input_data[0].device
self.calculator.to(device)
self.er.to(device)
t_positions = torch.as_tensor(positions.reshape(-1, 3), device=device).requires_grad_(True)
Original prompt

This section details on the original issue you should resolve

<issue_title>[Bug] RuntimeError: Expected all tensors to be on the same device (cuda:0 and cpu) during dp train --init-frz-model</issue_title>
<issue_description>Bug summary

A RuntimeError regarding tensor device mismatch occurs when executing dp train --init-frz-model. The error is triggered inside torch_admp/electrode.py during the modify_data phase called by the ec-MLP modifier. It seems electrode_mask and charges are located on different devices (one on cuda:0 and one on cpu) when computing modified_charges.

DeePMD-kit Version
3.1.4.dev2+ga60e90be6

ec-MLP version:
1.1.0.dev5+g2c7383b5b

torch-admp version:
1.1.5

How did you download the software?
Built from source

Input Files, Running Commands, Error Log, etc.
issue.tar.gz

Steps to Reproduce
dp train ec-mlp.json --init-frz-model ec-mlp.pb

</issue_description>

<agent_instructions>Any suggestion about this issue?</agent_instructions>

Comments on the Issue (you are @copilot in this section)


⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

@ChiahsinChu
Copy link
Copy Markdown
Owner

Here is the error log:

Traceback (most recent call last):
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/bin/dp", line 6, in <module>
    sys.exit(main())
             ^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/main.py", line 1052, in main
    deepmd_main(args)
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/tf/entrypoints/main.py", line 68, in main
    train_dp(**dict_args)
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/tf/entrypoints/train.py", line 189, in train
    _do_work(jdata, run_opt, is_compress)
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/tf/entrypoints/train.py", line 297, in _do_work
    model.train(train_data, valid_data)
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/tf/train/trainer.py", line 533, in train
    datasetloader = DatasetLoader(train_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/tf/train/trainer.py", line 1003, in __init__
    batch_data = self.train_data.get_batch()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/utils/data_system.py", line 460, in get_batch
    b_data = self.get_batch_standard(sys_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/utils/data_system.py", line 487, in get_batch_standard
    b_data = self.data_systems[self.pick_idx].get_batch(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/utils/data.py", line 302, in get_batch
    self._load_batch_set(self.dirs[self.set_count % self.get_numb_set()])
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/deepmd/utils/data.py", line 594, in _load_batch_set
    self.modifier.modify_data(self.batch_set, self)
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/ec_mlp/tf/modifier/dipole_charge_electrode.py", line 284, in modify_data
    tot_e, tot_f, tot_v = self.eval(coord, box, atype)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/ec_mlp/tf/modifier/dipole_charge_electrode.py", line 185, in eval
    e, f, v = self._eval_polarisable_electrode(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/ec_mlp/tf/modifier/dipole_charge_electrode.py", line 96, in _eval_polarisable_electrode
    _q_opt, _efield = charge_optimization(
                      ^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/torch_admp/electrode.py", line 489, in charge_optimization
    chi_elec, _energy = calculator.calc_coulomb_potential(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/public/groups/chenggroup/jpqiu/conda/ec-MLP/lib/python3.11/site-packages/torch_admp/electrode.py", line 66, in calc_coulomb_potential
    modified_charges = torch.where(electrode_mask == 0, charges, 0.0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 23, 2026

⚠️ JUnit XML file not found

The CLI was unable to find any JUnit XML files to upload.
For more help, visit our troubleshooting guide.

…training

Co-authored-by: ChiahsinChu <53895049+ChiahsinChu@users.noreply.github.com>
Agent-Logs-Url: https://github.com/ChiahsinChu/ec-MLP/sessions/92e40153-2342-4caf-a27b-626c06c5aea4
Copilot AI changed the title [WIP] Fix RuntimeError on tensor device mismatch during dp train Fix tensor device mismatch in _eval_polarisable_electrode when training with CUDA Mar 23, 2026
Copilot AI requested a review from ChiahsinChu March 23, 2026 07:50
@ChiahsinChu ChiahsinChu closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] RuntimeError: Expected all tensors to be on the same device (cuda:0 and cpu) during dp train --init-frz-model

2 participants