ClusterMAX™ 2.0: The Industry Standard GPU Cloud Rating System
Deep Dives
Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:
-
Slurm Workload Manager
12 min read
The article repeatedly references SLURM as a critical orchestration layer for GPU clusters, discussing 'Slurm-on-Kubernetes' as a key trend and SLURM support quality as a differentiator between providers. Understanding how this job scheduler works provides essential context for why it matters in AI infrastructure.
-
InfiniBand
11 min read
The article mentions InfiniBand as a key interconnect technology for GPU clusters and lists it as a required qualification for hires. This high-bandwidth networking technology is fundamental to understanding why certain cloud providers achieve better performance for distributed AI training.
Introduction
GPU clouds (also known as “Neoclouds” since October of last year) are at the center of the AI boom. Neoclouds represent some of the most important transactions in AI, the critical juncture where end users rent GPUs to train models, process data, and build inference endpoints.
Our previous research has set the standard for understanding Neoclouds:
Since ClusterMAX 1.0 was released 6 months ago, we have seen significant changes in the industry. H200, B200, MI325X, and MI355X GPUs have arrived at scale. GB200 NVL72 has rolled out to hyperscale customers and GB300 NVL72 systems are being brought up. TPU and Trainium are in the arena. And many buyers are turning to the ClusterMAX rating system as the trusted, independent third party with a comprehensive, technical guide to understanding the market.
An update is needed!
Executive Summary
YouTube summary video available here!
ClusterMAX 2.0 debuts with a comprehensive review of 84 providers, up from 26 in ClusterMAX 1.0. We increase our market view to cover 209 total providers, up from 169 in our previous article and 124 in the original AI Neocloud Playbook and Anatomy. We have interviewed over 140 end users of Neoclouds as part of this research.
We release an itemized list of all criteria we consider during testing, covering 10 primary categories (security, lifecycle, orchestration, storage, networking, reliability, monitoring, pricing, partnerships, availability).
We release five descriptions of our expectations, covering SLURM, Kubernetes, standalone machines, monitoring, and health checks. We encourage providers to use these lists when developing their offerings. We consider the lists as an amalgamation of our interviews with end users, and continue to pursue the quality when developing their offerings.
CoreWeave retains top spot as the only member of the Platinum tier. CoreWeave sets the bar for others to follow, and is the only cloud to consistently command premium pricing in our interviews with end users.
Nebius, Oracle and Azure are the top providers within the Gold tier. Crusoe and new entrant Fluidstack also achieve Gold tier.
Google rises to the top of the Silver tier, alongside AWS, together.ai and Lambda. Many more clouds from all around the world debut at the Bronze or Silver tier, for a total of 37 clouds achieving a medallion rating.
We provide analysis of key trends: Slurm-on-Kubernetes, Virtual Machines or Bare-Metal, Kubernetes for Training, Transition to Blackwell, GB200 NVL72 Reliability and SLA’s, Crypto Miners Here To Stay, Custom
...
This excerpt is provided for preview purposes. Full article content is available on the original publication.