Introduction

Envoy AI Gateway

Alauda Build of Envoy AI Gateway is based on the Envoy AI Gateway project. Envoy AI Gateway is a Kubernetes-native, AI-specific gateway layer built on top of Envoy Gateway, providing intelligent traffic management, routing, and policy enforcement for AI inference workloads.

Main components and capabilities include:

  • AI-Aware Routing: Routes inference requests to the appropriate backend model service based on request content, model name, and backend availability — enabling transparent multi-model serving behind a single endpoint.
  • OpenAI-Compatible API: Exposes a unified, OpenAI-compatible API surface (/v1/chat/completions, /v1/completions, /v1/models) for all downstream inference services, regardless of the underlying runtime.
  • Per-Model Rate Limiting & Policies: Enforces fine-grained rate limiting, token quotas, and traffic policies at the individual model level, preventing resource starvation and ensuring fair usage across tenants.
  • Backend Load Balancing: Distributes inference requests across multiple replicas of the same model using configurable load-balancing strategies, with health checking and automatic failover.
  • Envoy Gateway Integration: Runs as an extension of Envoy Gateway, inheriting its Kubernetes Gateway API-native control plane, TLS termination, and observability features (metrics, access logs, distributed tracing).
  • Gateway API Inference Extension (GIE): Integrates with the Kubernetes SIG Gateway API Inference Extension for advanced, inference-aware scheduling and load balancing decisions based on real-time backend state.

Envoy AI Gateway is a required dependency of Alauda Build of KServe for exposing inference services.

For installation on the platform, see Install Envoy AI Gateway.

Documentation

Envoy AI Gateway upstream documentation and related resources: