AI Infrastructure,
Built for Scale.

High-performance inference serving for large language models. Deploy any open-weight model with sub-millisecond latency, automatic scaling, and zero DevOps overhead.

Explore Platform Read the Docs →

50K+GPUs Orchestrated

99.97%Uptime SLA

<5msP50 Latency

Platform

One Stack, Any Model.

From Llama to DeepSeek, from a single node to thousands — IStars handles the hard parts so your team focuses on the product.

⚡

Inference Engine

Continuous batching, speculative decoding, PagedAttention. Sub-5ms P50 latency on 70B+ models at production traffic.

📦

Model Registry

One-line deploys from Hugging Face, private registries, or our curated collection of optimized checkpoints. Quantized, ready to serve.

📊

Observability

Token-level tracing, throughput dashboards, cost attribution, and drift detection. Know exactly what your models are doing in production.

🔒

On-Prem & VPC

Run entirely inside your VPC or on bare metal. No data leaves your network. SOC 2 Type II, HIPAA, and ISO 27001 compliant.

Research

We publish. We open-source.

Our team actively contributes to the systems-for-ML community. Recent work from IStars Research.

ICML 2026

StarServe: Elastic Model Parallelism for Heterogeneous GPU Clusters

Dynamic tensor/pipeline/expert parallelism that adapts to cluster topology in real-time. 2.3× throughput improvement over static partitioning on mixed GPU fleets.

arXiv →

SOSP 2025

TokenFlow: Predictable Latency for LLM Serving at Scale

A preemptive scheduling algorithm that guarantees P99 latency bounds under adversarial traffic patterns. Deployed in production across 12,000+ GPUs.

arXiv →

OSDI 2025

CacheGen: Prefix-Aware KV-Cache Management for Multi-Tenant Inference

Intelligent KV-cache sharing across tenants with shared system prompts. 40% memory reduction with no accuracy degradation.

arXiv →

Documentation

Start building in 5 minutes.

# Install the IStars CLI
curl -fsSL https://istars.space/install.sh | sh

# Deploy DeepSeek-V3 with one command
istars deploy deepseek-ai/DeepSeek-V3 \
  --replicas 2 \
  --quantization fp8 \
  --region us-west

# Send your first inference request
curl https://api.istars.space/v1/chat/completions \
  -H "Authorization: Bearer $ISTARS_API_KEY" \
  -d '{"model":"deepseek-v3","messages":[{"role":"user","content":"Hello, world!"}]}'

Full API reference, SDK guides, and deployment patterns at docs.istars.space →

About

Founded by systems researchers. Backed by engineers who run production.

IStars Space was founded in 2024 by a team of distributed systems researchers from UC Berkeley, MIT, and Tsinghua. We saw a gap: model capabilities were advancing faster than the infrastructure to serve them.

Today, we serve billions of tokens per day across a globally distributed GPU fleet. Our customers range from AI-native startups to Fortune 500 enterprises deploying LLMs in regulated environments.

We are a remote-first team of 40+, with offices in San Francisco, Beijing, and Singapore.

San FranciscoHQ

BeijingEngineering

SingaporeAPAC

Contact

Let’s talk.

Whether you’re evaluating inference providers or want to collaborate on research — we read everything.

hello@istars.space @istars_space on X github.com/istars-space

AI Infrastructure,Built for Scale.

One Stack, Any Model.

Inference Engine

Model Registry

Observability

On-Prem & VPC

We publish. We open-source.

StarServe: Elastic Model Parallelism for Heterogeneous GPU Clusters

TokenFlow: Predictable Latency for LLM Serving at Scale

CacheGen: Prefix-Aware KV-Cache Management for Multi-Tenant Inference

Start building in 5 minutes.

Founded by systems researchers. Backed by engineers who run production.

Let’s talk.

AI Infrastructure,
Built for Scale.