Releases: DigiField/StreamDiffusionV2
0.1.1+f2
This release is centered around reducing VRAM usage by the T5 text encoder, a ~9.8GB model that encodes prompts. With this release, it is finally possible to run StreamDiffusionV2 (albeit at a low resolution and frame rate) on a GPU with 8 GB of VRAM.
Samples of StreamDiffusionV2 running on an RTX 4070 laptop
T5 offloading
The T5 text encoder is now automatically offloaded to the CPU if the GPU does not have enough memory. (343ae3f)
Prompt caching
Prompt encodings usually don't change if the prompt itself doesn't. Yet, encoding the prompts can take a while, especially if T5 is offloaded to the CPU. This is where prompt caching comes in.
Prompt caching saves encoded prompts into a new cache folder. This folder can be set using the prompt_cache_dir argument of StreamDiffusionV2Pipeline. When prepare() is called with a new prompt, it generates the encoding and saves it to this folder. During subsequent calls to prepare() with the same prompt, the encoding is loaded instead of being regenerated. This can significantly cut down time to first generated frame if the prompts that will be used are known in advance.
Other changes
- Debug logs were added for pipeline loading. They can be seen if
loggingis configured to showlogging.DEBUGor higher levels. (1b75080)
0.1.1+f1
Initial DigiField release.
Changes from upstream 0.1.1
- Applied VRAM optimization (f87a511):
streamv2v/inference_common.pyload_generator_state_dict: replacedtorch.load(..., map_location="cpu")
with a version that triesmmap=True, weights_only=Truefirst
(falls back to the original if PyTorch is too old)- Same function: replaced the dict comprehension that added
"model."
prefixes with an in-place.pop()rename to avoid a second full copy of all
tensors - Same function: added
checkpoint.pop(key)+del checkpointto drop
optimizer state and other top-level keys before processing generator weights
streamv2v/inference.pyload_model: addeddel state_dict; gc.collect()afterload_state_dict
completes, so the CPU weight copy is freed before the GPU move happens
models/wan/wan_wrapper.pyWanTextEncoderWrapper.__init__: changeddtype=torch.float32to
dtype=torch.bfloat16on the UMT5-XXL text encoder — this is what was
causing an OOM on low VRAM cards, as float32 puts the encoder at ~19GB
models/wan/causal_stream_inference.pyprepare(): addedself.text_encoder.to('cpu')+
torch.cuda.empty_cache()immediately after the text encoder runs, so it
gets offloaded back to CPU once the prompt embeddings are produced and
doesn't sit in VRAM for the rest of the session