Building Dialogue from Operations

How dialogue audio is built from small, reusable operations you can chain into recipes — synthesise, record, split, voice-change, and edit timing.

Status: this page describes the direction for the dialogue and performance pipeline. Parts marked (not yet implemented) are the planned design, written down ahead of the build so the shape is agreed first. The one-press flow in Audio & Voice Pipeline is what exists today.

Instead of one all-in-one "make the dialogue" button, dialogue audio is built from a handful of small operations — each does one thing — that you chain together into a recipe. You can run the whole recipe with one press, or open any step to see and adjust it. The platform can build a sensible recipe for you from the screenplay; you only step in when you want to.

This is the same building-block idea used everywhere else in the media gallery, applied to voices.

Media carries its character

Every piece of dialogue audio is labelled with the character it belongs to. You see the character's name on the media tile, so you can tell whose line a clip is without playing it. That label travels with the audio as it moves through a recipe — when an operation produces a separate file for each character, each file keeps its character's name. (not yet implemented)

This is what lets the platform wire steps together for you: a later step can simply ask for "the audio for VENN from the previous step," and the right file connects automatically.

The operations

Line to audio (text-to-speech)

Turns one written line — VENN (angry): What do you mean! — into an audio file in that character's voice. The emotion is optional. You pick which of the character's configured text-to-speech voices to use; it defaults to the character's preferred voice.

Generated audio is cached for the whole scene, keyed by the line itself. If you rewrite the screenplay and a line moves to a different shot — or shots get reordered, inserted, or deleted — the line that didn't change is reused from the cache instead of being synthesised again. That keeps voice-generation cost down as your edit evolves. (partially exists today; scene-level reuse is being formalised)

Assemble a shot's timing

A shot often has a single line, but some shots have a back-and-forth between characters. The assemble operation takes the lines for a shot and lays them out with the right gaps — pause, line, pause, line, pause — and produces one audio file per character, with each character silent while the others speak. One file per character is exactly what a video model needs to lip-sync.

The timing comes ready-made from the screenplay, so assemble usually just runs. If you want to nudge it, press Edit to open a timeline and drag the lines around; if you don't, the default timing is used. (not yet implemented)

Record a performance

Prefer a real voice? Record captures audio (or video with audio) directly. You choose the character as you record, so the take is labelled straight away — no tagging by hand afterwards. (recording exists today; recording as a recipe step, with character labelling at capture, is planned)

Split a recording into characters

Record a whole conversation in one take, then split it: on a timeline, mark which span belongs to which character, and the operation gives you a separate, labelled file per character — everyone else muted during their parts. This lets you perform a scene naturally and divide it up afterwards.

Later, the platform will be able to line a recording up against the screenplay automatically — using speech recognition to find each line, chop a full-scene take into its shots and characters, and flag duplicate takes or flubbed lines. (not yet implemented)

Voice changer

Run a recorded performance through a character's voice changer to convert it into that character's voice, keeping the performance and timing of the original. You pick which configured voice changer to use. The output stays labelled with the same character.

Trim and edit timing

A timeline editor lets you trim a clip, change its speed, and adjust its position — working on a single audio or video file and producing an edited version with the same character label. You can also reuse an edit: load a previous file's edit as a recipe, drop a new file into it, and run — so the same trim or timing applies to fresh material. (not yet implemented)

Recipes the platform builds for you

From the screenplay, the platform can assemble a good default recipe for a shot:

Text-to-speech: one line to audio step per spoken line, feeding one assemble step — wired up by character.
Recorded performance: record → split (when one take covers several characters) → voice changer per character.

These are starting points. Open any recipe to see the steps, swap a voice, adjust timing, or add a recording where it matters. (not yet implemented)

Choosing the keeper

An operation can leave you with more than one option — a quick first pass and a higher-quality alternate, or several takes of a recording. Star the one you want to keep; archive the ones you don't. The starred version is the one downstream steps use.