Audio & Voice Pipeline

How dialog lines become audio — TTS, voice cloning, recording, and merged character tracks.

Ordinary Animator has a full dialog audio pipeline: from dialog lines written in the screenplay, through voice synthesis or recording, to per-character merged audio tracks that play back on the timeline.

Dialog lines

Dialog is written in the screenplay (Fountain format) and parsed into the scene's dialog structure. Each line is associated with a character and a shot. The Dialog tab in a scene or shot shows all lines in order.

Every line can independently have:

Generated TTS audio
A recorded audio file (uploaded or captured)
A voice-cloned variant

Voice configuration

Each character has a Voice configuration. You select which TTS engine to use and which voice within that engine:

Engine	Notes
ElevenLabs	High-quality synthesis with many voice options; requires API key
Typecast	Alternative TTS engine

Voice settings can be configured per character. When TTS is triggered for a dialog line, the character's voice configuration determines which engine and voice are used.

Generating TTS

From the Dialog tab (scene or shot), each dialog line has a Generate Speech button. Clicking it submits the line text to the configured TTS engine and deposits the result in the line's audio slot.

You can preview each generated clip inline, regenerate if the result isn't right, and pick from alternatives if you've generated multiple takes.

Recording and uploads

If you prefer a human voice or want to capture reference audio:

Record — capture audio directly in the browser
Upload — attach an existing audio file

Uploaded and recorded files go through the same curating flow as TTS: you preview, approve, or replace.

Voice cloning

Voice cloning creates a synthesised voice that matches a reference recording. Once a character has a reference audio file in their voice configuration, the platform can generate lines that sound like the reference speaker, using the TTS engine's cloning capability.

Merged audio tracks

When all dialog lines for a scene are generated or recorded, you can Merge audio per character. The merge produces a single audio track per character, with silence between spoken lines, timed to align with shot durations.

The merged track is what the timeline renderer uses when assembling the final video — it's mixed across all characters in the scene.

The render pipeline

See Shots for how rendered video clips and audio tracks are combined in the shot and scene timeline render.