WebGPU-Video-Diffusion

WebGPUDeep LearningComputer VisionONNX
WebGPU-Video-Diffusion

Overview

WebGPU Video Diffusion: Browser-Based Text-to-Video Generation. A zero-shot text-to-video generation pipeline that runs entirely in the browser using WebGPU. This project implements the Text-to-Video Zero algorithm, enabling users to generate smooth 8-frame video animations from text prompts without any server-side computation or video training data.

Description

A novel implementation of the Text-to-Video Zero algorithm that brings AI video generation directly to the browser. By leveraging WebGPU and ONNX Runtime, this project enables real-time GPU-accelerated inference on consumer hardware without requiring backend servers or cloud computing resources.

The pipeline extends Stable Diffusion 1.5 with cross-frame attention mechanisms to ensure temporal consistency across generated video frames. Users simply input a text prompt, and the system generates an 8-frame video sequence with coherent motion and visual continuity.

This work demonstrates the potential of modern web APIs for computationally intensive AI tasks.

  • Pure browser execution - no server required, all inference runs on client GPU via WebGPU
  • Implements Text-to-Video Zero algorithm with motion field warping and cross-frame attention
  • Generates 8-frame temporally consistent video sequences from text prompts
  • Zero-shot approach using pre-trained Stable Diffusion 1.5 - no video training data needed
  • GPU-accelerated inference via ONNX Runtime WebGPU backend
  • PNDM scheduler with classifier-free guidance (scale 7.5) for high-quality outputs
  • Configurable motion parameters for controlling video dynamics
  • Tech stack: WebGPU, ONNX Runtime, CLIP ViT-L/14, Stable Diffusion 1.5 UNet/VAE

    YK