Serving#

mstar has two ways to start a server:

  • mstar serve <model> — a one-command wrapper with sensible per-model defaults (most single-GPU; a few multi-GPU). Best for getting started and single-node runs.

  • mstar-serve --config <yaml> — the low-level entry point that takes an explicit config. Use it for custom layouts, disaggregation, and tensor parallelism.

mstar serve resolves a default config for the model, fills in the plumbing that mstar-serve needs (socket/upload dirs, a single-node-safe tensor protocol, the HF cache), and then delegates to it.

mstar serve#

mstar serve <model> [options]

<model> is one of bagel, bagel_cfg_parallel, qwen3_omni, orpheus, pi05, vjepa2, vjepa2_ac (or pass --config for any other deployment).

Option

Default

Description

--host

0.0.0.0

Bind address.

--port

8000

HTTP port.

--gpus

all visible

Sets CUDA_VISIBLE_DEVICES (e.g. 0 or 0,1,2).

--config

model default

Override the resolved default config with a path to your own YAML.

--cache-dir

HF default

HuggingFace weight cache directory.

--tensor-comm-protocol

SHM

Tensor transport: SHM (safe single-node default), TCP, or RDMA.

--socket-path-prefix

/tmp/mstar_<user>/

ZMQ IPC socket prefix.

--upload-dir

/tmp/mstar_uploads_<user>/

Temp directory for uploaded media.

--log-level

INFO

DEBUG / INFO / WARNING / ERROR.

Each model maps to a default config (override with --config):

Model

Default config

GPUs

bagel

configs/bagel_single_gpu.yaml

1

bagel_cfg_parallel

configs/bagel_cfg_parallel.yaml

3

orpheus

configs/orpheus_colocated.yaml

1

qwen3_omni

configs/qwen3omni_2gpu.yaml

2

pi05

configs/pi05.yaml

1

vjepa2

configs/vjepa2.yaml

1

vjepa2_ac

configs/vjepa2_ac.yaml

1

mstar-serve#

mstar-serve --config configs/<model>.yaml [options]

Option

Default

Description

--config (required)

Path to the YAML config.

--host / --port

0.0.0.0 / 8000

Bind address / HTTP port.

--tensor-comm-protocol

RDMA

RDMA, TCP, or SHM.

--cache-dir

HF default

HuggingFace weight cache directory.

--socket-path-prefix

/tmp/mstar

ZMQ IPC socket prefix (shared with conductor/workers).

--upload-dir

/tmp/mstar_uploads

Temp directory for uploaded media.

--timeout

600

Per-request timeout (seconds).

--mooncake-port

8080

Port for the Mooncake RDMA transfer engine.

--tcp-transfer-device

(auto)

Network device for TCP tensor transport.

--enable-nvtx

off

Emit NVTX markers for profiling.

--log-level

INFO

DEBUG / INFO / WARNING / ERROR.

Note

mstar serve defaults the tensor protocol to SHM (safe on a single node), whereas mstar-serve defaults to RDMA. On a single node without RDMA, pass --tensor-comm-protocol SHM to mstar-serve.

Config files#

A config maps the model’s computation-graph nodes to physical GPU ranks. The keys:

Key

Meaning

model

Registry key of the model (see Supported Models).

max_seq_len

Maximum sequence length (sizes the KV cache).

node_groups

List of placements. Each entry assigns node_names to ranks, optionally scoped to specific graph_walks and/or sharded with tp_size.

model_kwargs

(optional) Server-init model parameters (see below).

Node names are model-specific — they are the keys of the model’s get_node_engine_types (e.g. BAGEL’s vit_encoder / vae_encoder / LLM, Orpheus’s LLM / snac_decoder).

Single GPU. Everything on rank 0:

model: "bagel"
max_seq_len: 32768
node_groups:
  - {node_names: [vit_encoder], ranks: [0]}
  - {node_names: [vae_encoder, vae_decoder], ranks: [0]}
  - {node_names: [LLM], ranks: [0]}

Disaggregation. The same node can live on different GPUs per graph walk — e.g. prefill, decode, and image generation on three GPUs:

node_groups:
  - {node_names: [LLM], ranks: [0], graph_walks: [prefill_text, prefill_vit, prefill_vae]}
  - {node_names: [LLM], ranks: [1], graph_walks: [decode]}
  - {node_names: [LLM], ranks: [2], graph_walks: [image_gen]}

Because placement is config-only, the same model code runs single-GPU or fully disaggregated. configs/ ships several layouts per model (*_single_gpu, *_colocated, *_pd_disaggregated, *_cfg_parallel, …).

Tensor parallelism. Shard a node across GPUs with tp_size and that many ranks:

model: "orpheus"
max_seq_len: 131072
node_groups:
  - {node_names: [LLM], ranks: [0, 1], tp_size: 2, graph_walks: [prefill, decode]}
  - {node_names: [snac_decoder], ranks: [0], graph_walks: [snac_chunk]}

A node must be declared TP-enabled by the model to be eligible for tp_size > 1; the weight loaders then shard parameters automatically, with no model-code changes. See Tensor parallelism in the model guide for the model-side details.

model_kwargs#

model_kwargs are model parameters fixed at server start (not per request) — they are baked into the model’s config dataclass and into CUDA-graph captures. For example, the Pi0.5 DROID variant fixes the action horizon:

model: "pi05"
max_seq_len: 2048
model_kwargs:
  action_horizon: 15
node_groups:
  - {node_names: [vit_encoder], ranks: [0]}
  - {node_names: [LLM], ranks: [0]}

Per-request knobs (temperature, voice, max_output_tokens, …) are sent by the client instead — see Using a Server.

Tensor transport#

Workers route tensors directly to one another using one of three transports, selected with --tensor-comm-protocol:

  • SHM — shared memory; the safe default for single-node deployments.

  • TCP — works anywhere; used for some multi-node setups.

  • RDMA — lowest latency for multi-GPU / multi-node, via the Mooncake transfer engine (requires RDMA-capable networking).