StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

Yupeng Zhou^1* Daquan Zhou²^† Mingming Cheng¹ Jiashi Feng² Qibin Hou¹^†

¹VCIP, CS, Nankai University ²ByteDance Inc

^* Interns in ByteDance Inc ^† Corresponding Authors

Paper Github Cite Demo

Comics Generation

StoryDiffusion can create impressive comics by our consistent self-attention, maintain character consistency for cohesive storytelling.

Video Generation Results

StoryDiffusion can generate high quality video by our image semantic motion perdictor with our generated consistent images or user-input images as condition.

Video Gallery

Using images generated by our consistent self-attention

Using Condition images from SORA

Using User-Input Condition images

Using images generated by our consistent self-attention

Using Condition images from SORA

Using User-Input Condition images

Using images generated by our consistent self-attention

Using Condition images from SORA

Using User-Input Condition images

Using images generated by our consistent self-attention

Using Condition images from SORA

Using User-Input Condition images

Cartoon characters generation

StoryDiffusion can also create amazing consisitent cartoon characters images.

Multiple Characters Generation

StoryDiffusion can also maintain the IDs of multiple characters at the same time and generate consistent images.

More Comic Generation Example

StoryDiffusion can create impressive comics. We will add more comics and put on here.

"Girl and Squirrel"

Methods

The structure of the Consisitent Self-Attention.

The structure of the Motion Predictor.