Segments
A Segment describes one stretch of motion in time. Sequential segments
compose into a clip — the request's total length is
sum(segments[].duration_frames). A single pose
segment is the one exception: it produces exactly one frame and can't be
composed with others.
Three segment types are defined in v1:
// Multi-segment clip
[
{ "type": "text", "prompt": "a person walks forward", "duration_frames": 90 },
{ "type": "unconditioned", "duration_frames": 30 },
{ "type": "text", "prompt": "then sits down", "duration_frames": 60 }
]
// Single-pose request (mutually exclusive with multi-segment composition)
[
{ "type": "pose", "prompt": "person waves hello with their right hand" }
]
Segment is a discriminated union with type as the discriminator.
Servers advertise which types they accept via
ModelSpec.supported_segments in /capabilities — clients SHOULD
check before sending a non-default type. Future minor versions MAY add
types (e.g. "audio", "motion_reference") under the same key.
TextSegment — type: "text"
Generates motion conditioned on a natural-language prompt.
| Field | Type | Required | Notes |
|---|---|---|---|
type | "text" | yes | Discriminator |
prompt | string (UTF-8 NFC, 1–max_prompt_length codepoints) | yes | Natural-language description |
duration_frames | int (> 0) | yes | Length of this segment, in frames at the request's effective fps |
language | string (BCP-47) | no, default "en" | Hint for backbones with multilingual support |
UnconditionedSegment — type: "unconditioned"
Generates motion without text conditioning — the backbone runs its unconditional path (the null-prompt branch in a CFG model, or whatever default trajectory the model produces in absence of a prompt). Use this to:
- bridge two text segments with a natural transition the model picks,
- fill a constraint-driven span where you have spatial pins but no prompt (e.g. "no text, but pin the right hand here at frame 50"),
- request idle / neutral motion at the start or end of a clip.
{ "type": "unconditioned", "duration_frames": 30 }
| Field | Type | Required | Notes |
|---|---|---|---|
type | "unconditioned" | yes | Discriminator |
duration_frames | int (> 0) | yes | Length of the span, in frames at the request's effective fps |
Constraints overlapping an unconditioned span are honored on a best-effort
basis just like anywhere else. The unconditioned span is still motion —
post-processing applies as usual, and Options.transition_frames is
honored at boundaries with adjacent segments.
PoseSegment — type: "pose"
Generates a single pose — exactly one frame — from a text prompt.
Distinct from a 1-frame TextSegment: the server routes a pose
segment to a specialized model trained for instantaneous poses, not
"the first frame of a motion clip".
{ "type": "pose", "prompt": "person waves hello with their right hand" }
| Field | Type | Required | Notes |
|---|---|---|---|
type | "pose" | yes | Discriminator |
prompt | string (1–max_prompt_length codepoints) | yes | Natural-language description of the pose |
language | string (BCP-47) | no, default "en" |
No duration_frames field — pose is intrinsically 1 frame.
PoseSegment is an optional capability. Servers advertise support
by including "pose" in ModelSpec.supported_segments. The SDK
rejects pose segments sent to a model that doesn't claim support, with
the unsupported_segment error code. See
Capabilities reference →.
Composing
{
"segments": [
{ "type": "text", "prompt": "person walks forward", "duration_frames": 90 },
{ "type": "unconditioned", "duration_frames": 30 },
{ "type": "text", "prompt": "then sits down", "duration_frames": 60 }
],
"options": { "transition_frames": 8 }
}
transition_frames is the blend window at boundaries between segments.
Rotations blend via slerp; root translation blends linearly.
Segments vs constraints-only
If segments is empty, the request becomes constraints-only — the model
generates motion to satisfy the constraints with no text guidance. In that
case you MUST set duration_frames at the request level:
{
"protocol_version": "1.0",
"model": "kimodo-soma-rp",
"skeleton": { "joints": [ /* … */ ] },
"segments": [],
"duration_frames": 90,
"constraints": [ /* … */ ]
}
When segments is non-empty, the request-level duration_frames MUST be
omitted — the duration is sum(segments[].duration_frames).
A request with both segments: [] and constraints: [] is rejected.