← Back to Library

Import AI 439: AI kernels; decentralized training; and universal representations

Deep Dives

Explore related topics with these Wikipedia articles, rewritten for enjoyable reading:

  • Kernel (operating system) 16 min read

    The article extensively discusses 'kernel' optimization and generation but assumes reader familiarity. Understanding what kernels actually are—low-level software that interfaces between hardware and applications—is essential context for appreciating Facebook's automated kernel generation achievement.

  • Folding@home 18 min read

    Explicitly mentioned as a historical precedent for decentralized computing that achieved massive scale. The article uses it as a comparison point for potential growth of decentralized AI training, making its history and technical approach directly relevant.

  • Graphics processing unit 15 min read

    The article references NVIDIA GPUs, AMD GPUs, and custom MTIA chips as the hardware targets for kernel optimization. Understanding GPU architecture, parallel processing capabilities, and why they became central to AI training provides crucial technical context for the optimization challenges discussed.

Welcome to Import AI, a newsletter about AI research. Import AI runs on arXiv and feedback from readers. If you’d like to support this, please subscribe.

Facebook uses GPT, Claude, and Llama to write its own kernels:
…LLM-driven infrastructure optimization at the hyperscale…
Facebook researchers have published details on KernelEvolve, a software system which uses AI to automate the design of new kernels to optimize AI models for serving ads on the company’s network of web platforms. KernelEvolve is a neat example of how AI systems have got good enough to automate and speed up parts of AI development - here, the design of kernels to optimize inference of hundreds of different models running on multiple chip architectures.

What KernelEvolve is: The software is “designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures through multiple programming abstractions, including Triton, CuTe DSL, and low-level hardware diagnostic languages, spanning the full hardware-software optimization stack”.

How it works: The core of the software is a system to take in a user request (e.g, “Generate a Triton kernel for MTIA v3”) which then goes through a mixture of internal (Llama, CWM) and external (GPT, Claude) language models, which then produce candidate kernels that get evaluated through a variety of tools and, if they’re good, are added to an external knowledge database which then gets used to further improve future prompts.

It works well: By using this software, Facebook says it has cut the development time of new kernels “from weeks to hours”, and in production tests has yielded kernels on par with hand-designed ones, and in some cases has delivered performance improves of up to 17 times above existing PyTorch baselines. Kernels built using this software have been deployed across NVIDIA GPUs, AMD GPUs, and Meta’s own custom MTIA chips.
“KernelEvolve achieves substantial speedups spanning LLM inference workloads (Llama-3.1-8B: Vanilla Attention 4.6×, SDPA-MLP 3.3×), convolutional transformers (conv1d: 6.5×, conv2d: 4.7×), memory-bound data preprocessing operators critical for model enablement (MapId: 4.1×, MBDT: 9.3×, Batch Event Truncate: 9.8×), compute-intensive fusion kernels in ranking models (WuKong Optimized FM: 4.0×, InterFormer PFFN: 2.5×), MTIA-specific optimizations (RMSNorm 2D backward: 17×), and retrieval operations (Sparse Inverted Index: 1.25×)”, Facebook writes.

Saturates KernelBench: “We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ...

Read full article on Import AI →