The Cloudflare Blog: How we built the most efficient inference engine for Cloudflare’s network

Aug 27, 2025

—

Source URL: https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/
Source: The Cloudflare Blog
Title: How we built the most efficient inference engine for Cloudflare’s network

Feedly Summary: Infire is an LLM inference engine that employs a range of techniques to maximize resource utilization, allowing us to serve AI models more efficiently with better performance for Cloudflare workloads.

AI Summary and Description: Yes

**Summary:** This text discusses Cloudflare’s new LLM inference engine, Infire, designed to optimize AI workloads by addressing inefficiencies in existing models like vLLM. Infire’s architecture focuses on maximizing GPU utilization and minimizing CPU overhead, ensuring efficient processing in dynamic, distributed environments while maintaining strong security protocols. This innovation is crucial for professionals in AI infrastructure, addressing both performance and security concerns as AI technologies evolve.

**Detailed Description:**
The text outlines significant developments in AI inference systems, specifically detailing Cloudflare’s challenges with existing models and their innovative solution, Infire. Key points include:

– **Challenge with Centralized Models**:
– Many AI products rely on centralized data centers, which lead to inefficiencies due to mismatched infrastructure with Cloudflare’s globally-distributed network.
– Traditional reliance on large, expensive GPUs is not sustainable for optimal performance at the edge.

– **Introduction of Infire**:
– Infire is a new LLM inference engine, developed to fully utilize GPU resources at the edge and serve requests more efficiently.
– Implemented in Rust, Infire aims to capitalize on the unique performance capabilities of Cloudflare’s distributed architecture.

– **Performance Enhancements**:
– Initial benchmarks show that Infire can execute inference tasks up to 7% faster than vLLM under optimal conditions and significantly better under real load conditions.
– The architecture of Infire allows for dynamic scheduling of multiple models on GPUs, streamlining resource use and minimizing downtime.

– **Architecture Breakdown**:
– Composed of an OpenAI-compatible HTTP server, a batcher, and the Infire engine itself, each component focuses on maximizing efficient processing.
– The batching process significantly utilizes memory bandwidth for enhanced performance.

– **Security Considerations**:
– Cloudflare emphasizes security, noting that reliance on third-party systems (like vLLM) introduced potential vulnerabilities.
– Infire operates directly on bare-metal servers, reducing external risk factors and ensuring resources are allocated efficiently without competing with other critical services.

– **Future Developments**:
– As Infire evolves, there are plans for multi-GPU support, quantization, and true multi-tenancy features to further streamline AI processing and adapt to expanding workloads.

In conclusion, Infire stands as a vital advancement in AI inference technology, particularly suited for environments requiring high efficiency and security while addressing the unique demands of modern AI workloads in distributed cloud infrastructures. Security and compliance professionals should note these enhancements in infrastructure as they impact both operational efficiency and risk management within AI deployments.

7 a Act ads advancement age AI ai model AI models AI technologies AI workloads All and and Risk anti API Arch architecture art as at ated bandwidth benchmark benchmarks Bi built by C capabilities centers centralized model CERN challenge challenges CI CIA Cloud cloud infrastructure Cloudflare co Col compliance compliance professionals Compose concerns Condi CPU critical critical services D data data center data centers de demand deployment deployments design development developments distributed architecture distributed cloud distributed environments distributed network downtime e edge efficiency efficient efficient processing ELF enhanced performance Entra environment exp External fact fast faster feature features for full future future developments g Global GPU GPU support GPUs H high http HTTPS impact in inefficiencies Inference inference engine inference tasks inference technology infrastructure infrastructures innovation io Iron ite k Key l large led Li line llm lm load low M man management max memory memory bandwidth Meta mini ML Mode model models Modern multi multi-tenancy N network new no o of on one ons open openai operation operational operational efficiency OPM opt optimal performance ory oS other out over party per performance performance enhancement performance enhancements point potential pro process processing product products professionals protocol protocols ps Q quantization R rate RCE re real red resource resource utilization resources Risk risk factors risk management Ro RoT Rust s sec security security and compliance security concerns security considerations security protocols self server servers service services side Sig size sizes source specific SSE structures support system systems T Task tasks tech techniques technologies technology ted tenancy text the third third-party Time to Tor TP two UI under up US use utilization V vllm vulnerabilities Wi workload workloads x z