Skip to main content

Segments

A Segment describes one stretch of motion in time. Sequential segments compose into a clip — the request's total length is sum(segments[].duration_frames). A single pose segment is the one exception: it produces exactly one frame and can't be composed with others.

Three segment types are defined in v1:

// Multi-segment clip
[
{ "type": "text", "prompt": "a person walks forward", "duration_frames": 90 },
{ "type": "unconditioned", "duration_frames": 30 },
{ "type": "text", "prompt": "then sits down", "duration_frames": 60 }
]

// Single-pose request (mutually exclusive with multi-segment composition)
[
{ "type": "pose", "prompt": "person waves hello with their right hand" }
]

Segment is a discriminated union with type as the discriminator. Servers advertise which types they accept via ModelSpec.supported_segments in /capabilities — clients SHOULD check before sending a non-default type. Future minor versions MAY add types (e.g. "audio", "motion_reference") under the same key.

TextSegment — type: "text"

Generates motion conditioned on a natural-language prompt.

FieldTypeRequiredNotes
type"text"yesDiscriminator
promptstring (UTF-8 NFC, 1–max_prompt_length codepoints)yesNatural-language description
duration_framesint (> 0)yesLength of this segment, in frames at the request's effective fps
languagestring (BCP-47)no, default "en"Hint for backbones with multilingual support

UnconditionedSegment — type: "unconditioned"

Generates motion without text conditioning — the backbone runs its unconditional path (the null-prompt branch in a CFG model, or whatever default trajectory the model produces in absence of a prompt). Use this to:

  • bridge two text segments with a natural transition the model picks,
  • fill a constraint-driven span where you have spatial pins but no prompt (e.g. "no text, but pin the right hand here at frame 50"),
  • request idle / neutral motion at the start or end of a clip.
{ "type": "unconditioned", "duration_frames": 30 }
FieldTypeRequiredNotes
type"unconditioned"yesDiscriminator
duration_framesint (> 0)yesLength of the span, in frames at the request's effective fps

Constraints overlapping an unconditioned span are honored on a best-effort basis just like anywhere else. The unconditioned span is still motion — post-processing applies as usual, and Options.transition_frames is honored at boundaries with adjacent segments.

PoseSegment — type: "pose"

Generates a single pose — exactly one frame — from a text prompt. Distinct from a 1-frame TextSegment: the server routes a pose segment to a specialized model trained for instantaneous poses, not "the first frame of a motion clip".

{ "type": "pose", "prompt": "person waves hello with their right hand" }
FieldTypeRequiredNotes
type"pose"yesDiscriminator
promptstring (1–max_prompt_length codepoints)yesNatural-language description of the pose
languagestring (BCP-47)no, default "en"

No duration_frames field — pose is intrinsically 1 frame.

PoseSegment is an optional capability. Servers advertise support by including "pose" in ModelSpec.supported_segments. The SDK rejects pose segments sent to a model that doesn't claim support, with the unsupported_segment error code. See Capabilities reference →.

Composing

{
"segments": [
{ "type": "text", "prompt": "person walks forward", "duration_frames": 90 },
{ "type": "unconditioned", "duration_frames": 30 },
{ "type": "text", "prompt": "then sits down", "duration_frames": 60 }
],
"options": { "transition_frames": 8 }
}

transition_frames is the blend window at boundaries between segments. Rotations blend via slerp; root translation blends linearly.

Segments vs constraints-only

If segments is empty, the request becomes constraints-only — the model generates motion to satisfy the constraints with no text guidance. In that case you MUST set duration_frames at the request level:

{
"protocol_version": "1.0",
"model": "kimodo-soma-rp",
"skeleton": { "joints": [ /* … */ ] },
"segments": [],
"duration_frames": 90,
"constraints": [ /* … */ ]
}

When segments is non-empty, the request-level duration_frames MUST be omitted — the duration is sum(segments[].duration_frames).

A request with both segments: [] and constraints: [] is rejected.