Architecture#

High-level components#

mstar is organized as a set of cooperating processes:

  • API server (mstar/api_server/): FastAPI layer that accepts POST /generate, tokenizes/loads media, dispatches the request, and streams results back to the client. Entry point: mstar.api_server.entrypoint:main (the mstar-serve console script).

  • Conductor (mstar/conductor/): central coordinator. It manages the request lifecycle, handles graph-walk transitions, selects workers, routes inputs, and detects completion.

  • Workers (mstar/worker/): one process per GPU. Each runs an engine manager, a micro-scheduler (continuous batching), and a KV cache manager, and routes tensors directly to downstream workers.

  • Engines (mstar/engine/): execution backends that actually run submodules on the GPU — KVCacheEngine (nodes with a persistent paged KV cache, e.g. autoregressive LLMs and LLM-as-denoiser flow loops) and StatelessEngine (everything else: ViT/VAE encoders and decoders, codec decoders, projection/combine stages).

  • Models (mstar/model/): each model declares its computation graph, tokenization, engine types, and submodules. Registered via mstar/model/registry.py.

  • Graph (mstar/graph/): computation-graph primitives — GraphNode, Sequential, Parallel, Loop, GraphEdge.

  • Communication (mstar/communication/): ZMQ-based IPC/TCP messaging; tensor transport over RDMA or TCP.

  • Streaming (mstar/streaming/): streaming output with configurable chunking policies and async partition topology.

Core design principles#

  • Models define execution plans. Each model provides its own graph walks (e.g. prefill, decode, image_gen) via get_graph_walk_graphs().

  • Disaggregated. Logical computation nodes map to physical workers via the YAML config’s node_groups (node names → GPU ranks).

  • Graph-driven scheduling. The conductor schedules graph walks and their transitions to coordinate multi-engine pipelines, including async producer/consumer partitions.

Execution flow (simplified)#

  1. The API server receives a request, loads media, and calls the model’s process_prompt to produce the initial tensors.

  2. The conductor seeds the initial graph walk (e.g. prefill) and asks the model for the next forward-pass arguments after each graph walk completes.

  3. Workers execute the ready graph nodes on GPU through the appropriate engine and route outputs (tensors) to downstream nodes/workers.

  4. Outputs marked for the client are post-processed (postprocess) and streamed back.