AI Narratives: Shooting the Script (Part 3)

(This is part 3 of a series on an AI generated TV show. Click here to check out part 2.)

We’ve covered a lot of ground – from high level concept to a finished script. But a script isn’t very compelling – we need to film it! The video production process mimics Hollywood productions, because it operates under similar constraints:

The AI generates the dialog and high level instructions for music, camera angles, and so on. This is given to a “production crew” made up of a visualizer program (written by me), and a human (me) who generates additional sets, characters, sounds, etc. if needed. The visualizer “films” the episode and produces a final video.

Fundamentally, the production system is responsible for generating the 150 or so billion pixels that make up a 15 minute video. It makes the moment to moment decisions about lighting, camera angles, set design, costumes, voice “casting”, sound, music, and special effects to execute the director’s vision. And it combines all this into a coherent whole.

Let’s go through the key pieces of the production system.

The Backlot

Below are a few major sets for the show, all in a neat row. It’s easy to imagine them made out of plywood on a soundstage, isn’t it? During filming, they feel like real places. But if you step back, you see just how little “reality” there is:

Of course, I am standing on the shoulders of giants. Check out the filming set for the Millenium Falcon cockpit below. Only a tiny portion of the ship was built for filming. If it was good enough for Lucas, it’s good enough for me!

In some versions of On Screen, I set up a large open world for filming. But getting good results became more difficult, not less. The AI was confused by the wide range of choices in these large areas, and it created nonsensical and confusing scenes. Because there is a lot more information for a big world, it required additional systems to keep the LLM from being overwhelmed, too.

Smaller sets, focused on specific story purposes, are much easier for the AI to use effectively.

Actors are defined by their names, costumes, and voice. I made a tool (below) to modify names, the base geometry, accessories, and to change voice settings. The tool made building out the cast go quickly. And a single “Actor” which could configure itself to “wear” any costume greatly simplifies the implementation.

Actors use a mix of canned and procedural animations. An animation state machine sequences individual animations for smooth motion when an actor goes from sitting to standing, typing to not typing, and so on.

A gaze system tracks interesting things in the vicinity of each actor, and generates the head motion to look at them. When actors are speaking, a wagging motion is applied to their head. The wagging is stupid simple – the louder their audio is at a specific moment, the more their head moves. But it’s effective!

Early versions of the gaze system didn’t know if points of interest were in the same set as the actor. An actor speaking in a far away set caused other actors to look in seemingly random directions as they turned their head towards it. Once the system understood what region went with what set (the green box below), it became possible to ignore distant actors and get sensible behavior.

All of these behaviors, as well as pathfinding, are coordinated by an overall behavioral state machine.

Sets are built by kitbashing (ie, smashing together lots of small pieces from other models). The current sets came together pretty quickly, in a few hours of focused work.

Other ingredients are necessary, though:

Markers for where actors can sit/stand. These appear as yellow balls with names above them.
The holoscreen. Although invisible most of the time, one screen per set is always present. These greatly increase dramatic possibilities without adding a lot of complexity for the AI.
The bounds of the set. This is the green wireframe box. Sometimes camera selection, holoscreen operation, or gaze need to be limited to the same set.
Markers for camera placement.

So far, the AI only uses some of these features. Making all the features work reliably is a challenge, both in terms of the raw functionality and in teaching the AI to use them. The biggest of these work in progress features is allowing actors walking around the set. Unfortunately, the AI loves moving actors around, and often has them sprint from place to place, or even leave the set, between lines!

Lighting makes a huge difference in quality, depth, and legibility of a scene. Above left is a scene with a half dozen small hand tuned lights, while above right is the same scene with default lighting.

The lighting is very simple: the background is darker, with cool, blue colors, and the foreground is brighter, with warm, orange colors. Lights are placed to illuminate the character from the side, so that his face has clear, defined features. Notice that the crease in his mask is plainly visible in the left hand version, giving a sense of depth. In the right hand, unlit version, there is no definition and his face looks flat.

Exterior shots are done just like they did them in the old days of film. A miniature version of the ship is used, as shown below. There are actually many different invisible ships in the same location, and we only show one at a time.

What about the ship motion? That’s old school, too – 70s style. Star Wars: A New Hope was the first film to use a modern digital motion controlled camera (although the technique dates back to 1914). In this approach, models are fixed on a stand, and the camera moves around them on a robot arm, creating the illusion of motion. The original Halo even uses this technique!

On Screen! does the same. A fixed scene with a moving camera was simpler than moving multiple objects (ships, special effects, etc.). Effects like the hyperspace tunnel can be turned on or off for more variety.

Lights! Camera! Action!

The visualizer follows JSON commands generated by AI, like these:

{ "caption": "Bridge of the Cygnus 7, Deep Space" },
{ "sequence": "BridgeEstablishing", "target": "Cygnus7" },
{ "cutTo": [ "Gwen", "Grace" ], "type": "full" },
{ "who": "Grace", "says": "It's puzzling." },

These commands describe how to shoot the episode. Some happen instantly (like cuts or captions), while others pause until they finish (like dialogue). The visualizer has some simple logic to enforce this.

The commands resemble natural language, which turned out to be very important. More “natural” output requirements led to much better results from the AI. “Technical” formats work, but at a high quality cost, or even output which breaks the visualizer!

Once the script is ready, the visualizer processes it into a video file of ~10 minutes in length. I review it to make sure there’s nothing wildly inappropriate, and it goes up to YouTube.

The visualizer adds dynamic music and the background ambience. Eleven Labs provides the voice work, but I’m evaluating several other options with more expressiveness.

Cameras were especially difficult to get right. The AI gives general instructions (who to include in frame, category of framing). Older versions chose specific camera angles from lists, and often cut to empty parts of the set or pointed at random actors. The list got very long with multiple sets, which made things worse. The newer system cuts between different sets in a much simpler way, which greatly improves the quality of the output.

Future Work

There are some high value areas for improvement.

First, updating the models and prompts. Improved models change what the best structures for generation are. As On Screen and LLMs mature, some of the techniques I discarded early on could become relevant again, too. Fine tuning is a related massively valuable area to explore.

Second, enhancing the visualizer. The system for actors to walk around needs finishing. Actors would benefit from being able to gesture, interact with the set (like doing repairs or using control panels), and showing more emotion. Extras would add a ton of life. Varying lighting – like red alert vs. normal operation – would increase variety and interest. Exterior shots need to support things like planets in the background or multiple visible ships. Camera angles and selection could be improved, too.

Third, using the latest voice synthesis. There are new self-hosted models that can deliver higher quality, more emotional performances. Leveraging these models by adding emotional guidance to the script would be a huge win.

Nearly every part of On Screen! has a really obvious path to being 10x better. In complex projects, improvement don’t stack – they multiply. If every part is 10x better, the whole thing will be 1000x better. And at that point, you get a change in type not just quality.

Conclusion

That’s how I built my own AI-generated TV show.

Episode 2: The Stellar Codex

LLMs are the most powerful technique for narrative generation I’ve ever seen. But finite context sizes limit quality and complicate architecture. Data formats which harmonize with LLMs make a big difference, as does disciplined review of actual requests to ensure that the LLM is always getting as much of the right information as possible.

On the production side, writing the visualizer by hand was a wise path. Developing the visualizer in parallel with the AI bent each to the other’s strengths. AI capabilities for video generation will increase in time, but control, consistency, duration, and cost just aren’t there yet.

The current wave of AI has a long way to go to wholly replace existing workflows – if it ever will – but it is already able to create interesting content, and it’s evolving fast. What I’ve built – and what others are building – is less a replacement of existing content, as the beginnings of a new genre entirely.

(Click here to go back to part 1)

– Ben

P.S. If you liked my articles, I have two asks:

First, I’m looking for partners with Hollywood experience, developers/artists, distributors, and investors/grants. Reach out – I’d love to share my vision for the future of On Screen! and see if there is a way to work together. Are you interested in partnering to build the next version?

Second, I provide a variety of consulting services ranging from fractional CTO/VP Engineering to AI systems to low level work in AR/VR, streaming video, game dev, and many other areas. Check out my LinkedIn to get a better sense of my areas of expertise. How can I help you build something cool?

Contact me at ben AT theengine DOT co or @bengarney on Twitter/X.

3 responses to “AI Narratives: Shooting the Script (Part 3)”

b

April 3, 2024 at 2:47 pm

This is so very cool. Never seen something like this. Great work 😀 I have some questions about the visualizer and AI interfacing. It seems pretty well done. I am wondering more about how the commands are executed in Unity. Is a movie script mapped to source code ? Is there a runtime interpreter ? How is everything timed appropriately? What about the motions and animations ? So many questions in my head. But one thing I have to say is that I think this project is pretty impressive and could be the start of something special.

1. Ben Garney
  
  April 3, 2024 at 10:05 pm
  
  Thanks!
  
  No, it’s passed as JSON, and it basically loops over each command and shows the results, one at a time. For things like voice, it waits until the sound for the line of dialogue is done playing before moving on. Motions and animations are from Mixamo or handmade.
  
  I really hope it goes places!
  
AI Narratives: Orchestrating a Story (Part 2) – Ben Garney

April 2, 2024 at 4:10 am

[…] Continue to part 3! […]