Skip to content

Releases: DigiField/StreamDiffusionV2

0.1.1+f2

15 Jun 22:35

Choose a tag to compare

This release is centered around reducing VRAM usage by the T5 text encoder, a ~9.8GB model that encodes prompts. With this release, it is finally possible to run StreamDiffusionV2 (albeit at a low resolution and frame rate) on a GPU with 8 GB of VRAM.

Blurry image of a cat on grass Image of a cat facing the camera on grass

Samples of StreamDiffusionV2 running on an RTX 4070 laptop

T5 offloading

The T5 text encoder is now automatically offloaded to the CPU if the GPU does not have enough memory. (343ae3f)

Prompt caching

Prompt encodings usually don't change if the prompt itself doesn't. Yet, encoding the prompts can take a while, especially if T5 is offloaded to the CPU. This is where prompt caching comes in.

Prompt caching saves encoded prompts into a new cache folder. This folder can be set using the prompt_cache_dir argument of StreamDiffusionV2Pipeline. When prepare() is called with a new prompt, it generates the encoding and saves it to this folder. During subsequent calls to prepare() with the same prompt, the encoding is loaded instead of being regenerated. This can significantly cut down time to first generated frame if the prompts that will be used are known in advance.

Other changes

  • Debug logs were added for pipeline loading. They can be seen if logging is configured to show logging.DEBUG or higher levels. (1b75080)

0.1.1+f1

15 Jun 17:58
809e1b6

Choose a tag to compare

Initial DigiField release.

Changes from upstream 0.1.1

  • Applied VRAM optimization (f87a511):
    • streamv2v/inference_common.py
      • load_generator_state_dict: replaced torch.load(..., map_location="cpu")
        with a version that tries mmap=True, weights_only=True first
        (falls back to the original if PyTorch is too old)
      • Same function: replaced the dict comprehension that added "model."
        prefixes with an in-place .pop() rename to avoid a second full copy of all
        tensors
      • Same function: added checkpoint.pop(key) + del checkpoint to drop
        optimizer state and other top-level keys before processing generator weights
    • streamv2v/inference.py
      • load_model: added del state_dict; gc.collect() after load_state_dict
        completes, so the CPU weight copy is freed before the GPU move happens
    • models/wan/wan_wrapper.py
      • WanTextEncoderWrapper.__init__: changed dtype=torch.float32 to
        dtype=torch.bfloat16 on the UMT5-XXL text encoder — this is what was
        causing an OOM on low VRAM cards, as float32 puts the encoder at ~19GB
    • models/wan/causal_stream_inference.py
      • prepare(): added self.text_encoder.to('cpu') +
        torch.cuda.empty_cache() immediately after the text encoder runs, so it
        gets offloaded back to CPU once the prompt embeddings are produced and
        doesn't sit in VRAM for the rest of the session