Architecture#

High-level components#

mstar is organized as a set of cooperating processes:

API server (mstar/api_server/): FastAPI layer that accepts POST /generate, tokenizes/loads media, dispatches the request, and streams results back to the client. Entry point: mstar.api_server.entrypoint:main (the mstar-serve console script).
Conductor (mstar/conductor/): central coordinator. It manages the request lifecycle, handles graph-walk transitions, selects workers, routes inputs, and detects completion.
Workers (mstar/worker/): one process per GPU. Each runs an engine manager, a micro-scheduler (continuous batching), and a KV cache manager, and routes tensors directly to downstream workers.
Engines (mstar/engine/): execution backends that actually run submodules on the GPU — KVCacheEngine (nodes with a persistent paged KV cache, e.g. autoregressive LLMs and LLM-as-denoiser flow loops) and StatelessEngine (everything else: ViT/VAE encoders and decoders, codec decoders, projection/combine stages).
Models (mstar/model/): each model declares its computation graph, tokenization, engine types, and submodules. Registered via mstar/model/registry.py.
Graph (mstar/graph/): computation-graph primitives — GraphNode, Sequential, Parallel, Loop, GraphEdge.
Communication (mstar/communication/): ZMQ-based IPC/TCP messaging; tensor transport over RDMA or TCP.
Streaming (mstar/streaming/): streaming output with configurable chunking policies and async partition topology.

Models define execution plans. Each model provides its own graph walks (e.g. prefill, decode, image_gen) via get_graph_walk_graphs().
Disaggregated. Logical computation nodes map to physical workers via the YAML config’s node_groups (node names → GPU ranks).
Graph-driven scheduling. The conductor schedules graph walks and their transitions to coordinate multi-engine pipelines, including async producer/consumer partitions.

The API server receives a request, loads media, and calls the model’s process_prompt to produce the initial tensors.
The conductor seeds the initial graph walk (e.g. prefill) and asks the model for the next forward-pass arguments after each graph walk completes.
Workers execute the ready graph nodes on GPU through the appropriate engine and route outputs (tensors) to downstream nodes/workers.
Outputs marked for the client are post-processed (postprocess) and streamed back.