Introduction

KServe

Alauda Build of KServe is based on the KServe. KServe provides a standardized, cloud-native interface for serving machine learning models at scale on Kubernetes. It has evolved around two primary scenarios: Predictive AI for traditional ML inference, and Generative AI for LLM-based workloads.

Generative AI

Generative AI support is optimized for Large Language Model (LLM) serving with OpenAI-compatible APIs.

  • llm-d (Distributed LLM Inference): A Kubernetes-native distributed inference framework that runs under the KServe control plane. llm-d orchestrates multi-node LLM inference using a Leader/Worker pattern and makes real-time routing decisions based on KV cache state and GPU load — enabling KV-cache-aware request scheduling, elastic tensor/pipeline parallelism, and cluster-wide inference that behaves like a single machine. This lowers cost per token and maximizes GPU utilization for large models (e.g., Llama 3.1 405B) that exceed single-node memory.
  • LLM Inference & Streaming: Native support for streaming responses (SSE / chunked transfer), enabling real-time token delivery for chat and completion workloads, with OpenAI-compatible /chat/completions and /completions APIs.
  • vLLM Runtime: First-class integration with vLLM as the high-performance LLM serving backend, with support for continuous batching and PagedAttention.
  • Gateway Integration: Native integration with Envoy Gateway and the Gateway API Inference Extension (GIE) for AI-aware traffic routing, load balancing, and per-model rate limiting across inference services.
  • Autoscaling for LLMs: Metrics-driven autoscaling policies tailored to LLM throughput characteristics, including scale-to-zero for cost efficiency.

Predictive AI

Predictive AI covers traditional machine learning model serving with high throughput and low latency requirements.

  • InferenceService: The core CRD for deploying and managing model serving endpoints. Supports canary rollouts, traffic splitting across model versions, and A/B testing workflows.
  • Model Serving Runtimes: Pre-integrated runtimes for popular ML frameworks — TensorFlow Serving, TorchServe, Triton Inference Server, SKLearn, XGBoost, and more. Custom runtimes are supported via the ClusterServingRuntime and ServingRuntime CRDs.
  • Inference Graph: The InferenceGraph CRD enables composing multiple models into a pipeline, including pre/post-processing nodes, routing logic, and ensemble patterns.
  • Autoscaling: Scale-to-zero and scale-from-zero support via KEDA or Kubernetes HPA, with policies based on request rate, queue depth, or custom metrics.

For installation on the platform, see Install KServe.

Documentation

KServe upstream documentation and key dependencies: