StarServe: Elastic Model Parallelism for Heterogeneous GPU Clusters
Dynamic tensor/pipeline/expert parallelism that adapts to cluster topology in real-time. 2.3× throughput improvement over static partitioning on mixed GPU fleets.
arXiv →High-performance inference serving for large language models. Deploy any open-weight model with sub-millisecond latency, automatic scaling, and zero DevOps overhead.
From Llama to DeepSeek, from a single node to thousands — IStars handles the hard parts so your team focuses on the product.
Continuous batching, speculative decoding, PagedAttention. Sub-5ms P50 latency on 70B+ models at production traffic.
One-line deploys from Hugging Face, private registries, or our curated collection of optimized checkpoints. Quantized, ready to serve.
Token-level tracing, throughput dashboards, cost attribution, and drift detection. Know exactly what your models are doing in production.
Run entirely inside your VPC or on bare metal. No data leaves your network. SOC 2 Type II, HIPAA, and ISO 27001 compliant.
Our team actively contributes to the systems-for-ML community. Recent work from IStars Research.
Dynamic tensor/pipeline/expert parallelism that adapts to cluster topology in real-time. 2.3× throughput improvement over static partitioning on mixed GPU fleets.
arXiv →A preemptive scheduling algorithm that guarantees P99 latency bounds under adversarial traffic patterns. Deployed in production across 12,000+ GPUs.
arXiv →Intelligent KV-cache sharing across tenants with shared system prompts. 40% memory reduction with no accuracy degradation.
arXiv →# Install the IStars CLI
curl -fsSL https://istars.space/install.sh | sh
# Deploy DeepSeek-V3 with one command
istars deploy deepseek-ai/DeepSeek-V3 \
--replicas 2 \
--quantization fp8 \
--region us-west
# Send your first inference request
curl https://api.istars.space/v1/chat/completions \
-H "Authorization: Bearer $ISTARS_API_KEY" \
-d '{"model":"deepseek-v3","messages":[{"role":"user","content":"Hello, world!"}]}'
Full API reference, SDK guides, and deployment patterns at docs.istars.space →
IStars Space was founded in 2024 by a team of distributed systems researchers from UC Berkeley, MIT, and Tsinghua. We saw a gap: model capabilities were advancing faster than the infrastructure to serve them.
Today, we serve billions of tokens per day across a globally distributed GPU fleet. Our customers range from AI-native startups to Fortune 500 enterprises deploying LLMs in regulated environments.
We are a remote-first team of 40+, with offices in San Francisco, Beijing, and Singapore.
Whether you’re evaluating inference providers or want to collaborate on research — we read everything.