Serving#
mstar has two ways to start a server:
mstar serve <model>— a one-command wrapper with sensible per-model defaults (most single-GPU; a few multi-GPU). Best for getting started and single-node runs.mstar-serve --config <yaml>— the low-level entry point that takes an explicit config. Use it for custom layouts, disaggregation, and tensor parallelism.
mstar serve resolves a default config for the model, fills in the plumbing that
mstar-serve needs (socket/upload dirs, a single-node-safe tensor protocol, the HF
cache), and then delegates to it.
mstar serve#
mstar serve <model> [options]
<model> is one of bagel, bagel_cfg_parallel, qwen3_omni, orpheus,
pi05, vjepa2, vjepa2_ac (or pass --config for any other deployment).
Option |
Default |
Description |
|---|---|---|
|
|
Bind address. |
|
|
HTTP port. |
|
all visible |
Sets |
|
model default |
Override the resolved default config with a path to your own YAML. |
|
HF default |
HuggingFace weight cache directory. |
|
|
Tensor transport: |
|
|
ZMQ IPC socket prefix. |
|
|
Temp directory for uploaded media. |
|
|
|
Each model maps to a default config (override with --config):
Model |
Default config |
GPUs |
|---|---|---|
|
|
1 |
|
|
3 |
|
|
1 |
|
|
2 |
|
|
1 |
|
|
1 |
|
|
1 |
mstar-serve#
mstar-serve --config configs/<model>.yaml [options]
Option |
Default |
Description |
|---|---|---|
|
— |
Path to the YAML config. |
|
|
Bind address / HTTP port. |
|
|
|
|
HF default |
HuggingFace weight cache directory. |
|
|
ZMQ IPC socket prefix (shared with conductor/workers). |
|
|
Temp directory for uploaded media. |
|
|
Per-request timeout (seconds). |
|
|
Port for the Mooncake RDMA transfer engine. |
|
(auto) |
Network device for TCP tensor transport. |
|
off |
Emit NVTX markers for profiling. |
|
|
|
Note
mstar serve defaults the tensor protocol to SHM (safe on a single node),
whereas mstar-serve defaults to RDMA. On a single node without RDMA, pass
--tensor-comm-protocol SHM to mstar-serve.
Config files#
A config maps the model’s computation-graph nodes to physical GPU ranks. The keys:
Key |
Meaning |
|---|---|
|
Registry key of the model (see Supported Models). |
|
Maximum sequence length (sizes the KV cache). |
|
List of placements. Each entry assigns |
|
(optional) Server-init model parameters (see below). |
Node names are model-specific — they are the keys of the model’s
get_node_engine_types (e.g. BAGEL’s vit_encoder / vae_encoder / LLM,
Orpheus’s LLM / snac_decoder).
Single GPU. Everything on rank 0:
model: "bagel"
max_seq_len: 32768
node_groups:
- {node_names: [vit_encoder], ranks: [0]}
- {node_names: [vae_encoder, vae_decoder], ranks: [0]}
- {node_names: [LLM], ranks: [0]}
Disaggregation. The same node can live on different GPUs per graph walk — e.g. prefill, decode, and image generation on three GPUs:
node_groups:
- {node_names: [LLM], ranks: [0], graph_walks: [prefill_text, prefill_vit, prefill_vae]}
- {node_names: [LLM], ranks: [1], graph_walks: [decode]}
- {node_names: [LLM], ranks: [2], graph_walks: [image_gen]}
Because placement is config-only, the same model code runs single-GPU or fully
disaggregated. configs/ ships several layouts per model (*_single_gpu,
*_colocated, *_pd_disaggregated, *_cfg_parallel, …).
Tensor parallelism. Shard a node across GPUs with tp_size and that many ranks:
model: "orpheus"
max_seq_len: 131072
node_groups:
- {node_names: [LLM], ranks: [0, 1], tp_size: 2, graph_walks: [prefill, decode]}
- {node_names: [snac_decoder], ranks: [0], graph_walks: [snac_chunk]}
A node must be declared TP-enabled by the model to be eligible for tp_size > 1; the
weight loaders then shard parameters automatically, with no model-code changes. See
Tensor parallelism in the model guide for the model-side
details.
model_kwargs#
model_kwargs are model parameters fixed at server start (not per request) — they are
baked into the model’s config dataclass and into CUDA-graph captures. For example, the
Pi0.5 DROID variant fixes the action horizon:
model: "pi05"
max_seq_len: 2048
model_kwargs:
action_horizon: 15
node_groups:
- {node_names: [vit_encoder], ranks: [0]}
- {node_names: [LLM], ranks: [0]}
Per-request knobs (temperature, voice, max_output_tokens, …) are sent by the
client instead — see Using a Server.
Tensor transport#
Workers route tensors directly to one another using one of three transports, selected with
--tensor-comm-protocol:
SHM— shared memory; the safe default for single-node deployments.TCP— works anywhere; used for some multi-node setups.RDMA— lowest latency for multi-GPU / multi-node, via the Mooncake transfer engine (requires RDMA-capable networking).