DigitalOcean
14 min read

How DigitalOcean’s Agentic Inference Cloud powered by NVIDIA GPUs Achieved 67% Lower Inference Costs for Workato

Read Full Article

Summary

This article details the collaboration between DigitalOcean and Workato's AI Research Lab to optimize large language model (LLM) inference using NVIDIA GPUs. The focus is on achieving cost efficiency and performance through a sophisticated inference stack that leverages KV-aware routing to minimize redundant computations. Key architectural components include NVIDIA Dynamo, which orchestrates the inference process, and DigitalOcean Kubernetes Service (DOKS), which provides the execution environment. The article highlights the challenges of managing inference costs at scale and the importance of efficient routing and cache management to enhance throughput and reduce latency in processing large workloads.

Key Learnings

  • 1KV-aware routing significantly reduces inference costs by minimizing redundant computations across GPUs.
  • 2The architecture's design allows for independent scaling of routing and worker components, optimizing resource utilization.
  • 3Understanding the mechanics of LLM inference phases (prefill and decode) is crucial for optimizing performance and cost.
  • 4NVIDIA Dynamo's orchestration capabilities provide a global view of GPU resources, enabling smarter load balancing and cache management.
  • 5The choice of GPU (NVIDIA H200 vs. A100) impacts both performance and cost, demonstrating the importance of hardware selection in system design.

Who Should Read This

Senior AI Engineers and Machine Learning Architects focused on optimizing large-scale inference systems for production environments.

Test Your Knowledge

?

What are the trade-offs between using KV-aware routing versus traditional load balancing in LLM inference?

?

How does the architecture of NVIDIA Dynamo enhance the efficiency of GPU resource utilization?

?

What specific challenges arise when scaling LLM inference for high-throughput workloads?

?

In what scenarios might the KV cache become a bottleneck, and how can this be mitigated?

?

How does the choice of GPU type affect the overall cost and performance of the inference stack?

Topics

Read Full Article at DigitalOcean

More from DigitalOcean Engineering

View DigitalOcean engineering blogs →