Supported Models#
mstar ships the following model families. The table below summarizes the registered
families, their registry key (the value of model: in a config YAML), and a
representative Hugging Face identifier.
Registry keys live in mstar/model/registry.py (MODEL_REGISTRY / HF_MODELS).
Registry key |
Example Hugging Face model ID |
Description |
|---|---|---|
|
|
Unified multimodal model (text + image understanding and generation). |
|
|
TTS: Llama 3.2 3B LLM emitting audio tokens + SNAC 24 kHz decoder. |
|
|
Pi0.5 vision-language-action robotics model (ViT encoder + LLM + flow action expert). |
|
|
Omni-modal (text/image/audio/video in, text/audio out): Thinker + Talker + codec. |
|
|
V-JEPA 2 video encoder + masked predictor. |
|
|
V-JEPA 2-AC encoder + action-conditioned predictor. |
Notes#
The IDs above are representative. You may use local paths or compatible variants.
Some families accept multimodal input (image/audio/video); see the model’s
process_promptfor the inputs it expects.To add a new family, see Adding a New Model.