Can Open Source Level the AI Infrastructure Race?

May 30, 2025

AI Infrastructure Is the New Battleground

The race for artificial intelligence is no longer just about models—it’s about who owns the terrain those models run on. As generative AI matures, power is consolidating not around algorithms, but around compute capacity, data infrastructure, and orchestration layers. The world’s most powerful entities—nations and corporations—are moving swiftly to claim that ground, and the stakes are rising for everyone else.

In just the past month:

  • Saudi Arabia announced a $10 billion AI investment fund and a plan to build 6.6GW of data center capacity by 2034. Its national AI company, Alat, is sourcing 18,000 Nvidia chips and collaborating with AMD and Qualcomm to localize AI chip design, aiming to control 7% of the global AI compute.
  • Salesforce is acquiring Informatica for $8 billion, gaining tight control over enterprise data infrastructure. Informatica will integrate into Agentforce, Salesforce’s new AI platform, enabling powerful agentic AI systems across workflows.
  • In the UK, the government is piloting an AI tool called Minute across 25 councils. The tool automates administrative tasks in planning, licensing, and social services, demonstrating the state’s growing appetite for embedding AI into public infrastructure.

But these headlines are just the tip of the iceberg. The centralization of AI infrastructure is accelerating:

  • Nvidia now controls roughly 80% of the global AI accelerator market, with its CUDA software stack creating significant vendor lock-in.
  • The hyperscalers—Amazon, Microsoft, and Google—operate over 60% of global cloud compute capacity, according to Synergy Research.
  • China’s government has pledged over $50 billion in AI infrastructure by 2030, with Baidu, Alibaba, and Tencent building proprietary AI clouds and chip supply chains.
  • Open-source AI models, such as Meta’s Llama and Mistral’s models, have seen rapid adoption; however, even these models depend heavily on centralized cloud computing for training and deployment.

During the 2023 GPU shortage, academic research and startups were priced out of AI development, slowing innovation outside big tech. Over 50% of generative AI companies cited GPU shortages as a major scaling bottleneck. Startups and even well-funded AI firms, such as Hugging Face and Databricks, struggled to access sufficient GPUs, which limited their ability to train and deploy competitive models. Reports highlighted that smaller companies and researchers were unable to compete with larger players who could secure hardware at scale, leading to inflated costs and long wait times for cloud GPU access. This dynamic risks making AI development the exclusive domain of major corporations and governments.

Centralized infrastructure creates single points of failure and attack—think cloud outages or supply chain hacks. The heavy reliance on a handful of hyperscalers (Amazon, Microsoft, Google) for AI compute means that outages, supply chain disruptions, or cyberattacks can have outsized impacts on global AI. The supply chain challenges and bottlenecks caused by the GPU shortage demonstrate the vulnerability of centralized infrastructure to disruptions.

Countries without local infrastructure are dependent on foreign providers, risking data sovereignty and national security. The global scramble for GPUs and AI chips has led to geopolitical tensions and export restrictions, further exacerbating shortages and driving up prices for countries without domestic chip manufacturing capabilities. This dependency on foreign providers for critical AI infrastructure raises significant concerns about national security and data sovereignty, as nations lacking local infrastructure are forced to rely on external cloud and hardware vendors.

Bottom line: Compute, not code, is becoming the core advantage. Without intervention, AI innovation risks becoming the exclusive domain of a handful of powerful actors, deepening global digital divides and stifling open innovation.


How Open Source Can Rebalance the AI Infrastructure Equation

While governments and mega-corporations consolidate control, open source remains one of the few forces that can level the playing field. However, it’s not a silver bullet—open source faces real challenges, especially in areas where hardware and capital-intensive infrastructure are involved. Still, here’s where open source is making critical inroads:

Open Infrastructure and Cross-Cloud Abstractions

Open-source tooling enables developers to operate across multiple infrastructure layers—public clouds, sovereign clusters, or local data centers—without being locked in.

Examples:

  • KubeRay: An AI-native orchestration layer built on Kubernetes, KubeRay lets organizations deploy, scale, and manage distributed AI workloads seamlessly. It integrates with Ray to support distributed training and inference, abstracts away cloud-specific APIs, and supports GPU scheduling across clusters, making it possible for research labs or startups to run complex AI pipelines on hybrid or multi-cloud setups.
  • SkyPilot: SkyPilot provides multi-cloud scheduling and GPU provisioning, allowing users to run AI jobs on the cheapest or most available GPUs across AWS, Azure, GCP, and even smaller cloud providers. It automates spot instance management, failover, and data synchronization, significantly reducing operational overhead and cost for teams without dedicated infrastructure.
  • vLLM: Optimized for large language model inference, vLLM enables efficient and high-throughput serving of LLMs across diverse hardware platforms. It supports dynamic batching, memory optimization, and hardware abstraction, helping organizations deploy cutting-edge models on commodity hardware or across multiple clouds.
  • Red Hat’s llm-d: Announced in 2025, llm-d is a Kubernetes-native framework for scalable AI inference, supporting separation of compute phases, smart routing, and efficient GPU utilization. It is backed by partners like CoreWeave, Google Cloud, IBM Research, and NVIDIA, and aims to standardize inference infrastructure for production workloads.

Impact:
These tools enable smaller players, such as startups, academic labs, and developing countries, to scale AI efficiently without dependency on hyperscalers, and to optimize for cost, performance, and regulatory compliance.


Decentralized Compute Access

As demand for GPUs and accelerators surges, access to compute is becoming one of the most significant barriers to AI development. Traditional access is controlled by a handful of hyperscalers or national initiatives, but decentralized compute protocols offer an alternative path.

Examples:

  • Golem: One of the earliest decentralized compute networks, Golem enables users to rent out unused CPU and GPU cycles. Anyone can contribute hardware to the network and earn tokens, while developers can access affordable compute for rendering, simulation, or training lightweight models. Golem abstracts infrastructure provisioning, making compute more accessible for small teams or cost-sensitive use cases.
  • Bittensor: A decentralized machine learning network built specifically for AI model training and inference. In Bittensor, participants run “subnets” (specialized machine learning tasks), contribute compute, and are rewarded based on the value of their output, ranked by a consensus algorithm. The system incentivizes open innovation, making model development and deployment viable without relying on centralized cloud access.

Impact:
These networks enable peer-to-peer access to AI compute, significantly lowering the barrier to participation for researchers, startups, and developers in emerging markets. By commoditizing and decentralizing infrastructure, they represent a foundational shift away from exclusive, permissioned access models and toward a more open, participatory AI economy.

Limitations:
Decentralized compute networks are still in their infancy and have yet to demonstrate their ability to scale to the largest AI workloads. Open-source software still requires hardware—if chips and data centers are controlled by a few, open-source software alone can’t solve access issues.


Open Hardware and Transparent AI Stacks

Open hardware and transparent AI stacks are essential for digital sovereignty and security. Open initiatives are making real progress, but the landscape is complex.

Examples:

  • RISC-V: An open instruction set architecture (ISA) that allows anyone to design custom CPUs without proprietary licensing. RISC-V chips are now being adopted in a wide range of applications, from edge devices to data center servers, with companies such as SiFive and Alibaba producing commercially viable RISC-V processors. This enables countries and organizations to build AI infrastructure without dependence on US or Chinese chip vendors.
  • Open Compute Project (OCP): Provides open, vendor-neutral designs for data center hardware, including servers, storage, and networking. Hyperscalers and enterprises use OCP’s AI hardware blueprints to build scalable, efficient, and auditable infrastructure. For example, Meta’s data centers are primarily built on OCP standards, and OCP’s AI Hardware Project is promoting open accelerator designs.
  • LLMFoundry: An open framework for training, evaluating, and deploying large language models, LLMFoundry offers transparent pipelines, reproducible training recipes, and support for open hardware backends. It enables organizations to build fully auditable AI stacks, from silicon to software, and is increasingly adopted by research institutions for secure, sovereign AI deployments.
  • Google’s Open TPU Initiative: Google has begun opening up its TPU hardware stack for community use, providing documentation, APIs, and reference designs to encourage third-party innovation in AI accelerators.

Impact:
Open hardware and transparent stacks support digital sovereignty, security, and innovation for governments, research institutions, and enterprises that need control over every layer of their AI infrastructure.

Global Examples:

  • Gaia-X: A flagship European initiative that builds a federated, open cloud and data infrastructure, empowering businesses, individuals, and governments with sovereign control over their data. Gaia-X creates a decentralized ecosystem of “Digital Clearing Houses” and a verification framework to ensure trust, interoperability, and compliance with European values and regulations.
  • Sovereign Tech Fund and the European Commission’s GenAI4EU initiative are investing in open digital infrastructure and open AI ecosystems, emphasizing that true sovereignty requires robust, well-funded, open-source foundations.
  • AfriLabs and the AI4D Africa Initiative support local innovation by funding open-source AI projects tailored to African challenges, such as healthcare diagnostics in remote areas, AI-powered agricultural tools, and financial inclusion platforms.

The Environmental Angle

AI infrastructure is profoundly energy-intensive, with data centers and large-scale model training contributing significantly to global electricity demand and carbon emissions. As the world accelerates toward widespread adoption of AI, sustainability must be a core consideration, not just for environmental reasons, but also for grid reliability, cost control, and regulatory compliance.

Open source is emerging as a key enabler for sustainable AI, offering several concrete pathways:

Open Source Projects for Energy-Efficient AI

  • LF Energy (a Linux Foundation initiative) leads multiple open source projects to optimize energy use in power grids and AI operations. Projects like OpenSTEF (for short-term electric load forecasting), GridFM, and the expanded OpenEEMeter suite enable utilities and researchers to forecast demand, optimize asset management, and model energy usage with AI.
  • The Open Power AI Consortium, launched by EPRI, NVIDIA, and partners, is developing open, domain-specific large language models and datasets tailored for the energy sector.
  • AI4EF (Artificial Intelligence for Energy Efficiency in the Building Sector) provides open architectures and toolkits to advance sustainable energy management in buildings.

Open Standards and Benchmarks

  • AI Energy Score (by Hugging Face) is an open, standardized benchmark to compare the energy efficiency of AI models across tasks and architectures.

Decentralized Compute and Resource Optimization

  • Decentralized compute protocols (like those emerging in the open source ecosystem) can help distribute AI workloads to regions or nodes with surplus renewable energy, or during off-peak hours, smoothing demand and reducing reliance on fossil-fuel-powered data centers.
  • Grid-interactive smart communities, powered by open-source AI, are enabling decentralized energy management—balancing loads, integrating renewables, and optimizing consumption at the grid edge.

Community-Driven Sustainability

  • Open-source communities are actively optimizing AI software for energy efficiency, sharing best practices, and developing tools to monitor and minimize the carbon footprint of model training and inference.
  • At the MIT Lincoln Laboratory Supercomputing Center (LLSC), researchers have developed and open-sourced tools and techniques to reduce the energy consumption of AI workloads in data centers, including power-capping hardware, early stopping for training runs, and energy-aware scheduling.

Conclusion

The future of AI will be shaped not only by how powerful the models are, but also by who owns the ground on which they run. Without open alternatives, compute access and AI development could become centralized in ways that stifle innovation, exclude emerging players, and limit global equity.

But open source has changed the course of computing before—from Linux to Kubernetes, from Apache to PyTorch. It can do it again. The models may generate the headlines, but the infrastructure wins the war. Let’s ensure it’s a terrain everyone can build on.