
Overview
Novel View Synthesis — A reproduction and implementation of the Stable Virtual Camera training pipeline for multi-view diffusion-based novel view synthesis. Since SEVA did not release its training code, this project provides a complete training framework and successfully trained model for generating photorealistic novel views with strong 3D consistency.
Description
This project implements a training pipeline for Stable Virtual Camera, a diffusion-based novel view synthesis model. While SEVA demonstrated impressive results for generating novel views from arbitrary input images, its training pipeline was not publicly released.
I reproduced the SEVA training framework from scratch, implementing key components including CLIP ViT-H-14 image conditioning, SNR-shift noise scheduling for multi-frame generation, and normalized camera coordinate systems. The pipeline supports flexible multi-view conditioning and large viewpoint synthesis.
The trained model was integrated with DUSt3R for multi-view camera pose estimation and Depth-Pro for monocular depth prediction, enabling end-to-end novel view synthesis from arbitrary input images with automatic camera trajectory generation.
- Complete reproduction of SEVA training pipeline
- CLIP ViT-H-14 image encoder for visual conditioning from reference views
- SNR-shift beta schedule optimized for multi-frame diffusion generation
- Integrated DUSt3R for multi-view camera pose estimation during inference
- Integrated Depth-Pro for monocular depth prediction in single-view scenarios
- Supports multiple camera trajectory modes: free, bi-directional, swing, and custom paths
- Distributed training with DeepSpeed ZeRO-2 and EMA model averaging
- Trained on ACID, DL3DV, Co3D, Real10k, and other multi-view datasets