Serving#

mstar has two ways to start a server:

mstar serve <model> — a one-command wrapper with sensible per-model defaults (most single-GPU; a few multi-GPU). Best for getting started and single-node runs.
mstar-serve --config <yaml> — the low-level entry point that takes an explicit config. Use it for custom layouts, disaggregation, and tensor parallelism.

mstar serve resolves a default config for the model, fills in the plumbing that mstar-serve needs (socket/upload dirs, a single-node-safe tensor protocol, the HF cache), and then delegates to it.

mstar serve#

mstar serve <model> [options]

<model> is one of bagel, bagel_cfg_parallel, qwen3_omni, orpheus, pi05, vjepa2, vjepa2_ac (or pass --config for any other deployment).

Option	Default	Description
`--host`	`0.0.0.0`	Bind address.
`--port`	`8000`	HTTP port.
`--gpus`	all visible	Sets `CUDA_VISIBLE_DEVICES` (e.g. `0` or `0,1,2`).
`--config`	model default	Override the resolved default config with a path to your own YAML.
`--cache-dir`	HF default	HuggingFace weight cache directory.
`--tensor-comm-protocol`	`SHM`	Tensor transport: `SHM` (safe single-node default), `TCP`, or `RDMA`.
`--socket-path-prefix`	`/tmp/mstar_<user>/`	ZMQ IPC socket prefix.
`--upload-dir`	`/tmp/mstar_uploads_<user>/`	Temp directory for uploaded media.
`--log-level`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR`.

Each model maps to a default config (override with --config):

Model	Default config	GPUs
`bagel`	`configs/bagel_single_gpu.yaml`	1
`bagel_cfg_parallel`	`configs/bagel_cfg_parallel.yaml`	3
`orpheus`	`configs/orpheus_colocated.yaml`	1
`qwen3_omni`	`configs/qwen3omni_2gpu.yaml`	2
`pi05`	`configs/pi05.yaml`	1
`vjepa2`	`configs/vjepa2.yaml`	1
`vjepa2_ac`	`configs/vjepa2_ac.yaml`	1

mstar-serve#

mstar-serve --config configs/<model>.yaml [options]

Option	Default	Description
`--config` (required)	—	Path to the YAML config.
`--host` / `--port`	`0.0.0.0` / `8000`	Bind address / HTTP port.
`--tensor-comm-protocol`	`RDMA`	`RDMA`, `TCP`, or `SHM`.
`--cache-dir`	HF default	HuggingFace weight cache directory.
`--socket-path-prefix`	`/tmp/mstar`	ZMQ IPC socket prefix (shared with conductor/workers).
`--upload-dir`	`/tmp/mstar_uploads`	Temp directory for uploaded media.
`--timeout`	`600`	Per-request timeout (seconds).
`--mooncake-port`	`8080`	Port for the Mooncake RDMA transfer engine.
`--tcp-transfer-device`	(auto)	Network device for TCP tensor transport.
`--enable-nvtx`	off	Emit NVTX markers for profiling.
`--log-level`	`INFO`	`DEBUG` / `INFO` / `WARNING` / `ERROR`.

Note

mstar serve defaults the tensor protocol to SHM (safe on a single node), whereas mstar-serve defaults to RDMA. On a single node without RDMA, pass --tensor-comm-protocol SHM to mstar-serve.

Config files#

A config maps the model’s computation-graph nodes to physical GPU ranks. The keys:

Key	Meaning
`model`	Registry key of the model (see Supported Models).
`max_seq_len`	Maximum sequence length (sizes the KV cache).
`node_groups`	List of placements. Each entry assigns `node_names` to `ranks`, optionally scoped to specific `graph_walks` and/or sharded with `tp_size`.
`model_kwargs`	(optional) Server-init model parameters (see below).

Node names are model-specific — they are the keys of the model’s get_node_engine_types (e.g. BAGEL’s vit_encoder / vae_encoder / LLM, Orpheus’s LLM / snac_decoder).

Single GPU. Everything on rank 0:

model: "bagel"
max_seq_len: 32768
node_groups:
  - {node_names: [vit_encoder], ranks: [0]}
  - {node_names: [vae_encoder, vae_decoder], ranks: [0]}
  - {node_names: [LLM], ranks: [0]}

Disaggregation. The same node can live on different GPUs per graph walk — e.g. prefill, decode, and image generation on three GPUs:

node_groups:
  - {node_names: [LLM], ranks: [0], graph_walks: [prefill_text, prefill_vit, prefill_vae]}
  - {node_names: [LLM], ranks: [1], graph_walks: [decode]}
  - {node_names: [LLM], ranks: [2], graph_walks: [image_gen]}

Because placement is config-only, the same model code runs single-GPU or fully disaggregated. configs/ ships several layouts per model (*_single_gpu, *_colocated, *_pd_disaggregated, *_cfg_parallel, …).

Tensor parallelism. Shard a node across GPUs with tp_size and that many ranks:

model: "orpheus"
max_seq_len: 131072
node_groups:
  - {node_names: [LLM], ranks: [0, 1], tp_size: 2, graph_walks: [prefill, decode]}
  - {node_names: [snac_decoder], ranks: [0], graph_walks: [snac_chunk]}

A node must be declared TP-enabled by the model to be eligible for tp_size > 1; the weight loaders then shard parameters automatically, with no model-code changes. See Tensor parallelism in the model guide for the model-side details.

model_kwargs#

model_kwargs are model parameters fixed at server start (not per request) — they are baked into the model’s config dataclass and into CUDA-graph captures. For example, the Pi0.5 DROID variant fixes the action horizon:

model: "pi05"
max_seq_len: 2048
model_kwargs:
  action_horizon: 15
node_groups:
  - {node_names: [vit_encoder], ranks: [0]}
  - {node_names: [LLM], ranks: [0]}

Per-request knobs (temperature, voice, max_output_tokens, …) are sent by the client instead — see Using a Server.

Tensor transport#

Workers route tensors directly to one another using one of three transports, selected with --tensor-comm-protocol:

SHM — shared memory; the safe default for single-node deployments.
TCP — works anywhere; used for some multi-node setups.
RDMA — lowest latency for multi-GPU / multi-node, via the Mooncake transfer engine (requires RDMA-capable networking).

Serving

Contents

Serving#

mstar serve#

mstar-serve#

Config files#

model_kwargs#

Tensor transport#