StoryDiffusion: Long-range image and video generation

smusamashah · 2024-05-01T08:41:45

This is unbelievably good. Seems better than Sora even in terms of natural look and motion in videos.

The video of two girls talking seems so natural. There are some artifacts but the movement is so natural and clothes and other things around are not continuously changing.

I hope it does become open source, which i suspect it won't because it's coming from byte dance.

reply

cchance · 2024-05-01T09:04:47

I don't know if thats true, theirs a massive flicker in the guys hair (the one thats mostly black background and black shirt) half way through it completely loses tracking on his hair and it like snap changes.

schoen · 2024-05-01T01:52:50

I looked very closely at the videos for a while and managed to find some minor continuity errors (like different numbers of buttons on people's button-down shirts at different times, or different sizes or styles of earrings, or arguably different interpretations of which finger is which in an intermittently-obscured hand). I also think that the cycling woman's shorts appear to cover more of her left leg than her right leg, although that's not physically impossible, and the bear seemingly has a differently-sized canine tooth at different times.

But I guess it took me multiple minutes to find these problems, watching each video clip many times, rather than having any of them jump out at me. So, it's not like literally full consistent object persistence, but at a casual viewing it was very persuasive.

Maybe people who shoot or edit video frequently would notice some of these problems more quickly, because they're more attuned to looking for continuity problems?

reply

nyokodo · 2024-05-01T02:11:35

> But I guess it took me multiple minutes to find these problems

I’m no video editor but I noticed straight away that The characters’ eyes and hair tend to change, sometimes dramatically as they turn their head. Also, the head movement tends to be jerky or abrupt especially in the middle of the turn.

reply

justinclift · 2024-05-01T05:04:26

Eyes and teeth seem like they still need further work. Still, looks like things are improving. :)

cchance · 2024-05-01T09:05:47

I mean at the end of the day neither is standard video editing, how many times have we all found inconsistencies in TV shows or random water bottles showing up and disappearing in scenes... I imagine diffusion video creation will be similar eventually funny anecdotes of what we saw that time in LOTR 10

godelski · 2024-05-01T02:25:50

Did you miss the fish?[0] You should see the error in first viewing

What about the woman with glasses? Her face literally "jumps"[1] Same with this guy's hands[2]

Interesting, we notice that [1] has "sora" in the name though I think it is a reference to the main image on sora[3]

Not sure if the gallery is weird to anyone else, but it doesn't exactly show new images and the position indicator is wonky.

The thing that makes me most suspicious is seeing the numbers on these demos. 1, 2, 4 (terrifying to me), 5, 65, 66, 68, 72, 73, 83, 85, 86 (is this Simone Giertz? Vic Michaelis?). The part that is tough about evaluating generative models is the cherry picking for demonstrations. You have to do it or people tear your work apart but also in doing so you give a false impression of what your work can actually do.

IMO it has gotten out of hand and is not benefiting anyone. It makes these papers more akin to advertising than communication of research. We talk about integrity of the research community and why we argue over borderline works but come on, if you can get a better review by more samples, you can get better reviews by paying more, not by doing better work. A pay to play system is far worse for the integrity of ML (or any science) than arguing over borderline works.

Edit: I think it is also a bit problematic that this is posted BEFORE the arxiv link or GitHub goes live. I'd appeal to the HN community to not upvote these kinds of works until at least the paper is live.

[0] https://storydiffusion.github.io/MagicStory_files/longvideo/...

[1] https://storydiffusion.github.io/MagicStory_files/longvideo/...

[2] https://storydiffusion.github.io/MagicStory_files/longvideo/...

[3] https://openai.com/sora

reply

vkou · 2024-05-01T02:32:36

I'm immediately noticing significant issues with mouths (specifically, when they are open).

It's also telling that most of the shots do their best to hide hands - whenever they are visible, they are obviously broken.

reply

samspenc · 2024-05-01T01:11:46

Normally I don't mind spelling errors - and there are plenty in the examples - but my question is, did the system really produce "lunch" when the prompt was "they have launch at restraunt" (verbatim from the sample)? I would imagine it got restaurant right, but I would have expected it to produce something like a rocket launch image instead of figuring out the author meant lunch.

dkarras · 2024-05-01T01:29:46

transformers / attention is very robust against typos as they take the entire context into account just like we do. launch any free LLM and ask them questions with typos that you would notice and auto-correct and you'll see that the models just don't care and understand them. actually they are so resilient that they understand very garbled text without breaking a sweat.

noneeeed · 2024-05-01T09:18:29

I often use ChatGPT in learning spanish, I find it's great for explaining distinctions between words with similar meanings where a dictionary isn't always a lot of help.

I am constantly surprised by how well it copes with my typos, grammatical errors and generally poor spelling.

reply

BoorishBears · 2024-05-01T02:56:20

There's honestly something uncanny about how well they do.

In the "early days" of GPT-4 I tried testing it as a way to get around poor transcription for an in-car voice assistant. It managed: "I'm how dew yew say... Freud?" => Turn up the temperature... which was nonsense most people would stare at for a long time before making any sense of.

reply

neckro23 · 2024-05-01T01:38:17

And if it the model is supposed to be so attentive to context, why did it show a desert instead of "dessert"? After all, they just ate "launch".

yorwba · 2024-05-01T06:10:39

The model can only attend to context that is part of the input. Most likely they created the image grid by independently feeding the model each prompt together with the reference image. (And the point is to show off that the model output remains consistent despite this independent generation process.)

godelski · 2024-05-01T02:36:15

"He felt very frightened and run", "There is a huge amount of treasure in the house!"

I suspect that some grammar and spelling issues may be the authors themselves. For example "A Asian Man": "a" instead of "an" is a common mistake for many Asian languages due to not having similar forms in their languages. So considering consistent article errors, I expect this to be an issue from the authors. Not sure the "M" capitalization. Similar things with "The man have breakfast", "They have launch at restaurant", "They play in (the) amusement part."

Considering the comics have similar types of error (the squirrel one clearer) I'd chalk it up to language barrier instead of the process. Though LeCun is not wearing gloves on the moon, and well...

reply

ffhhj · 2024-05-01T04:40:44

Curious what it would produced with: "they have launch a rockestaurant".

hbbio · 2024-05-01T01:35:32

GitHub link is not public yet?

https://github.com/HVision-NKU/StoryDiffusion

reply

smcnally · 2024-05-01T03:48:09

That repo’s not listed

https://github.com/orgs/HVision-NKU/repositories

reply

stanislavb · 2024-05-01T02:01:23

Seems so. I was about to report about it, too.

speedgoose · 2024-05-01T09:47:07

Is there a video of Will Smith eating spaghetti with this model?

LeoPanthera · 2024-05-01T01:07:31

The rate of progress of generative AI is honestly quite scary.

ed_mercer · 2024-05-01T01:27:06

Really? Feels like nothing much is happening lately.

vouaobrasil · 2024-05-01T05:30:29

Progress comes in spurts. Due to the negative reactions to AI by some (artists), the system wants it to appear that nothing is happening so that the next wave of AI can be created in relative peace, at which time it will be too late to stop it.

We have been conditioned to only react to hype and "news", rather than analyze reality and see the danger.

reply

thejohnconway · 2024-05-01T09:55:14

Which “system”?

newswasboring · 2024-05-01T09:42:45

What are you talking about? ChatGPT-3 came out less than 4 years ago. Stable diffusion's first version around that too. In less than 4 years we went from nothing to making janky but believable video clips. This is not fast enough for you?

keikobadthebad · 2024-05-01T01:31:02

It'll be good if the girl and the giant squirrel are ever seen in the same park at the same time.

pmontra · 2024-05-01T05:38:04

The Moon in the sky seen from the surface of the Moon is wrong? Poetic? Funny? Recursive? A demonstration that these models don't understand anything? Add to the list.

topspin · 2024-05-01T02:17:22

Love how under "Multiple Characters Generation" the white guy is "A Man," whereas the someone else is "An Asian Man." Reminds me of Daryl Gates and the "normal people" quote, thence patrol cars being called "black and normals."

fnordpiglet · 2024-05-01T02:23:10

A probabilistic regression models behavior will just demonstrate the training data. Don’t hate the player, hate the game.

topspin · 2024-05-01T02:53:56

No hate for any part of this: it's just amusing.

gbickford · 2024-05-01T02:41:20

It's always disappointing when people publish things to GitHub without the intention of collaborating or sharing.

forgingahead · 2024-05-01T02:26:43

Github link is broken, and I honestly find it frustrating that the only link to code is the theme source and credits?? Is it really that important to give the static page theme that much real estate instead of actual code release for the project?

brotherdusk · 2024-05-01T01:36:56

sorry, i can't access the repo and the pdf doesn't have an href attr, is that by design?

29athrowaway · 2024-05-01T02:22:43

Time for Microsoft Chat 2.0 it seems.

peteradio · 2024-05-01T02:10:24

There is a video of two girls. One girl seems to be sticking out her tongue and then blowing a kiss, but the tongue is appearing again mid-kiss. Very arousing stuff I'll say. Keep up the good work microsft or goggle or whoever made it.

yard2010 · 2024-05-01T09:21:19

Worse - bytedance

freefruit · 2024-05-01T01:18:04

So is Amazon flooded with hyper niche e-books yet?

m463 · 2024-05-01T05:16:31

I went to buy an air fryer. There were several specific-air-fryer-model recipe books available. But they were all garbage auto-generated stuff.

I complained to amazon, and they said since I hadn't purchased the book they couldn't do anything. So I bought the book, complained, and returned it. The chapters devoted to the details of the specific air fryer model were either very general (almost quotes of product description on amazon), or just plain wrong.

What I thought I would get would be like the magic lantern books about specific camera models. Instead it was auto-generated pages of nonsense.

reply

surfingdino · 2024-05-01T07:33:16

Your real-life example is a good case against using AI-generated legal or medical advice.

selalipop · 2024-05-01T01:33:28

I’m working on a platform for reading hyper niche e-books: https://tryspellbound.com

I don’t think this form of generative AI needs to become a source of spam, carefully designed platforms can let people enjoy their niche content without making them feel isolated

reply

surfingdino · 2024-05-01T07:33:59

Too late, it has become a source of spam.

selalipop · 2024-05-01T09:56:08

Not really useful to give up the fight in the infancy of something with as much surface area as generative AI.

Is being used to create spam is not the same as needs to be spam, and we mostly just need platforms that leverage generative AI natively to bridge the gap.

reply