Segments

A Segment describes one stretch of motion in time. Sequential segments compose into a clip — the request's total length is sum(segments[].duration_frames). A single pose segment is the one exception: it produces exactly one frame and can't be composed with others.

Three segment types are defined in v1:

// Multi-segment clip
[
  { "type": "text",          "prompt": "a person walks forward", "duration_frames": 90 },
  { "type": "unconditioned", "duration_frames": 30 },
  { "type": "text",          "prompt": "then sits down",         "duration_frames": 60 }
]

// Single-pose request (mutually exclusive with multi-segment composition)
[
  { "type": "pose", "prompt": "person waves hello with their right hand" }
]

Segment is a discriminated union with type as the discriminator. Servers advertise which types they accept via ModelSpec.supported_segments in /capabilities — clients SHOULD check before sending a non-default type. Future minor versions MAY add types (e.g. "audio", "motion_reference") under the same key.

TextSegment — `type: "text"`

Generates motion conditioned on a natural-language prompt.

Field	Type	Required	Notes
`type`	`"text"`	yes	Discriminator
`prompt`	string (UTF-8 NFC, 1–`max_prompt_length` codepoints)	yes	Natural-language description
`duration_frames`	int (> 0)	yes	Length of this segment, in frames at the request's effective fps
`language`	string (BCP-47)	no, default `"en"`	Hint for backbones with multilingual support

UnconditionedSegment — `type: "unconditioned"`

Generates motion without text conditioning — the backbone runs its unconditional path (the null-prompt branch in a CFG model, or whatever default trajectory the model produces in absence of a prompt). Use this to:

bridge two text segments with a natural transition the model picks,
fill a constraint-driven span where you have spatial pins but no prompt (e.g. "no text, but pin the right hand here at frame 50"),
request idle / neutral motion at the start or end of a clip.

{ "type": "unconditioned", "duration_frames": 30 }

Field	Type	Required	Notes
`type`	`"unconditioned"`	yes	Discriminator
`duration_frames`	int (> 0)	yes	Length of the span, in frames at the request's effective fps

Constraints overlapping an unconditioned span are honored on a best-effort basis just like anywhere else. The unconditioned span is still motion — post-processing applies as usual, and Options.transition_frames is honored at boundaries with adjacent segments.

PoseSegment — `type: "pose"`

Generates a single pose — exactly one frame — from a text prompt. Distinct from a 1-frame TextSegment: the server routes a pose segment to a specialized model trained for instantaneous poses, not "the first frame of a motion clip".

{ "type": "pose", "prompt": "person waves hello with their right hand" }

Field	Type	Required	Notes
`type`	`"pose"`	yes	Discriminator
`prompt`	string (1–`max_prompt_length` codepoints)	yes	Natural-language description of the pose
`language`	string (BCP-47)	no, default `"en"`

No duration_frames field — pose is intrinsically 1 frame.

PoseSegment is an optional capability. Servers advertise support by including "pose" in ModelSpec.supported_segments. The SDK rejects pose segments sent to a model that doesn't claim support, with the unsupported_segment error code. See Capabilities reference →.

Composing

{
  "segments": [
    { "type": "text",          "prompt": "person walks forward", "duration_frames": 90 },
    { "type": "unconditioned", "duration_frames": 30 },
    { "type": "text",          "prompt": "then sits down",       "duration_frames": 60 }
  ],
  "options": { "transition_frames": 8 }
}

transition_frames is the blend window at boundaries between segments. Rotations blend via slerp; root translation blends linearly.

Segments vs constraints-only

If segments is empty, the request becomes constraints-only — the model generates motion to satisfy the constraints with no text guidance. In that case you MUST set duration_frames at the request level:

{
  "protocol_version": "1.0",
  "model": "kimodo-soma-rp",
  "skeleton": { "joints": [ /* … */ ] },
  "segments": [],
  "duration_frames": 90,
  "constraints": [ /* … */ ]
}

When segments is non-empty, the request-level duration_frames MUST be omitted — the duration is sum(segments[].duration_frames).

A request with both segments: [] and constraints: [] is rejected.

TextSegment — type: "text"​

UnconditionedSegment — type: "unconditioned"​

PoseSegment — type: "pose"​

Composing​

Segments vs constraints-only​

TextSegment — `type: "text"`

UnconditionedSegment — `type: "unconditioned"`

PoseSegment — `type: "pose"`

Composing

Segments vs constraints-only