Installation#
Requirements#
Python 3.12+.
Linux with an NVIDIA GPU and a recent CUDA toolkit for the GPU model families. A CPU-only machine can exercise the graph/worker plumbing in dummy mode (submodules return
None) for development and the modular tests, but not real model inference.Enough GPU memory for the model you intend to serve — several families (e.g. BAGEL-7B, Qwen3-Omni-30B) are multi-GPU-class models.
Install from source#
mstar is installed from source in editable mode. We recommend uv to create the Python 3.12 environment:
git clone https://github.com/mstar-project/mstar.git
cd mstar
# Create and activate a Python 3.12 virtualenv (--seed adds pip to it)
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install --torch-backend=auto -e .
This pulls in the core runtime (PyTorch, FastAPI/Uvicorn, ZMQ, …) and installs the two
console scripts, mstar and mstar-serve.
Important
Always pass --torch-backend=auto. mstar pins PyTorch 2.9
(torch==2.9.1 / torchvision==0.24.1 / torchaudio==2.9.1); sgl-kernel
(Qwen3-Omni) is built against torch 2.9, so newer torch won’t work. The flag tells
uv to detect your driver’s CUDA version and fetch the matching torch build —
cu128 on a CUDA 12.x box, cu130 on a CUDA 13.x box. This matters because the
source-compiled extensions (flash-attn, sgl-kernel) build against your system
CUDA toolkit, whose major version must match torch’s. Without the flag uv installs
PyPI’s default (cu128) build, which then fails to compile flash-attn on a CUDA 13
machine with a “detected CUDA version mismatches … PyTorch” error. You can set it once
with export UV_TORCH_BACKEND=auto instead of repeating the flag. See Matching your
CUDA toolkit for details and the manual fallback.
Optional dependencies#
Model families and some output formats need extra packages, exposed as pip extras:
Extra |
Installs / use for |
|---|---|
|
BAGEL runtime: |
|
Qwen3-Omni runtime: the BAGEL set plus |
|
Orpheus TTS runtime: |
|
Pi0.5 runtime: |
|
V-JEPA 2 runtime: |
|
|
|
|
|
The union of every model extra above — installs the full runtime for all model
families in one shot. Convenient for a machine that serves multiple models; heavier
and slower to install than a single family’s extra. (Still excludes |
Combine extras as needed (keep --torch-backend=auto on every install):
uv pip install --torch-backend=auto -e ".[bagel,audio,dev]"
Tip
If you’re just getting started or have the disk/time to spare, .[all] is the
recommended install — it pulls every model family’s runtime so any model works out of
the box, with no need to track which extra goes with which model:
uv pip install --torch-backend=auto -e ".[all,dev]"
Note
torch, torchvision, and torchaudio are already in the base install; each
model extra adds that family’s remaining runtime — FlashInfer for the autoregressive
backbones, Transformers, safetensors, any codec/media libraries, and the Mooncake RDMA
transport for disaggregated deployments.
GPU libraries#
The GPU model families depend on:
FlashInfer (
flashinfer-python) — paged attention and continuous batching for the autoregressive backbones (every model with aKV_CACHEnode runs attention through it).flash-attn — used by Qwen3-Omni. Not installed by any extra; install it separately (see flash-attn (Qwen3-Omni)).
mooncake-transfer-engine — RDMA tensor transport for multi-GPU, disaggregated deployments. Single-node deployments can use shared-memory (
SHM) orTCPtransport instead (see Serving).
Apart from flash-attn, these are installed by the extras above. Your installed torch
must match your system CUDA toolkit — --torch-backend=auto handles that for you (next
section).
flash-attn (Qwen3-Omni)#
flash-attn is only needed for Qwen3-Omni, and it is not pulled in by
.[qwen3_omni] or .[all] — you install it as a separate step. The reason: flash-attn
publishes no wheels on PyPI, so pip/uv fall back to compiling it from source, which is
slow and fails outright on CUDA 13 (its bundled CUTLASS predates CUDA 13’s vector-type
ABI change). Skip the build by installing the prebuilt wheel that matches your stack.
The wheels live on flash-attn’s GitHub releases, named by CUDA major, torch
version, Python tag, and C++ ABI. With the pinned torch 2.9 and Python 3.12 you want a
torch2.9 / cp312 / cxx11abiTRUE wheel — the only choice left is the CUDA major, which must
match your installed torch’s CUDA, not your system toolkit. Check it first:
python -c "import torch; print(torch.version.cuda)" # 12.8 -> cu12, 13.0 -> cu13
Then install the matching wheel by direct URL (don’t use --find-links — uv sorts the
+cu13… local version above +cu12… and will grab cu13 even on a CUDA 12 box):
# torch built for CUDA 12.x (cu12)
uv pip install \
"https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"
# torch built for CUDA 13.x (cu13)
uv pip install \
"https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu13torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"
Because it’s a binary wheel nothing compiles, so your system CUDA version is irrelevant — only the torch build matters. Verify with:
python -c "import flash_attn; print(flash_attn.__version__)"
(An undefined symbol error on import means the wheel’s cu1x / torch2.9 tag doesn’t
match your installed torch — recheck python -c "import torch; print(torch.__version__,
torch.version.cuda)" and pick the matching wheel.)
Matching your CUDA toolkit#
PyPI’s default torch wheel targets one specific CUDA release (cu128), which may not match
your machine. flash-attn and sgl-kernel compile from source against your system
CUDA, so a mismatch with torch’s CUDA breaks the build. The simplest fix is to let uv
choose the right build automatically:
uv pip install --torch-backend=auto -e ".[all]"
--torch-backend=auto detects your driver (via nvidia-smi) and selects the matching
PyTorch index — cu128 on CUDA 12.x, cu130 on CUDA 13.x — for the runtime and for the
isolated environments that build flash-attn / sgl-kernel. The same command therefore
works unchanged across machines. (Needs a recent uv — run uv pip install --help and
look for --torch-backend if unsure; export UV_TORCH_BACKEND=auto is equivalent.)
Manual fallback. If you can’t use the flag, install the pinned torch trio from the
matching CUDA index first, then install mstar (the resolver then treats the
requirement as already satisfied):
# pick the index for your CUDA toolkit: cu128 (CUDA 12.8), cu130 (CUDA 13.x), …
uv pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 \
--index-url https://download.pytorch.org/whl/cu128
uv pip install -e ".[all]"
Check your driver’s CUDA version with nvidia-smi and pick the closest build at
https://pytorch.org/get-started/locally/. Keep all three packages on the same 2.9 /
0.24 line — torchvision and torchaudio are versioned in lockstep with torch.
Verify the install#
python -c "import mstar; print('mstar import OK')"
mstar --help
mstar-serve --help
Next: Quickstart.