Novel View Synthesis

PyTorchDeep LearningNovel View Synthesis

Overview

Novel View Synthesis — A reproduction and implementation of the Stable Virtual Camera training pipeline for multi-view diffusion-based novel view synthesis. Since SEVA did not release its training code, this project provides a complete training framework and successfully trained model for generating photorealistic novel views with strong 3D consistency.

Description

This project implements a training pipeline for Stable Virtual Camera, a diffusion-based novel view synthesis model. While SEVA demonstrated impressive results for generating novel views from arbitrary input images, its training pipeline was not publicly released.

I reproduced the SEVA training framework from scratch, implementing key components including CLIP ViT-H-14 image conditioning, SNR-shift noise scheduling for multi-frame generation, and normalized camera coordinate systems. The pipeline supports flexible multi-view conditioning and large viewpoint synthesis.

The trained model was integrated with DUSt3R for multi-view camera pose estimation and Depth-Pro for monocular depth prediction, enabling end-to-end novel view synthesis from arbitrary input images with automatic camera trajectory generation.

Complete reproduction of SEVA training pipeline
CLIP ViT-H-14 image encoder for visual conditioning from reference views
SNR-shift beta schedule optimized for multi-frame diffusion generation
Integrated DUSt3R for multi-view camera pose estimation during inference
Integrated Depth-Pro for monocular depth prediction in single-view scenarios
Supports multiple camera trajectory modes: free, bi-directional, swing, and custom paths
Distributed training with DeepSpeed ZeRO-2 and EMA model averaging
Trained on ACID, DL3DV, Co3D, Real10k, and other multi-view datasets

All Projects