kernel-evo

0

Описание

Evolutionary generation of efficient GPU kernels

Языки

  • Python99,3%
  • Dockerfile0,7%
2 месяца назад
2 месяца назад
2 месяца назад
2 месяца назад
2 месяца назад
2 месяца назад
2 месяца назад
2 месяца назад
2 месяца назад
README.md
Kernel Evo banner

Evolutionary generation of efficient GPU kernels using GigaEvo.
Define a task, run evolution with an LLM backend, extract and compare optimized programs.

Speedup vs tokens lvl 1 Speedup vs tokens lvl 2

Features

  • Custom tasks — Define your own kernel tasks in KernelBench format and evolve them.
  • KernelBench integration — Use existing KernelBench problems.
  • Triton and CUDA inline backends - two most popular ways to create kernels, suitable for different scenarious.
  • Remote or local execution — Run validation locally or via a remote eval server.
  • Cost efficient - works with fast models gemini flash 3 and gpt-oss-120b. Current experiments costs 0.5-1$. Frontier models with high reasoning effort would be benefitial, yet cost would be magnitude higher.

Requirements

  • Python >= 3.12
  • LLM API — OpenAI-compatible (e.g. OpenRouter, or a local server like SGLang).
  • Redis — Used by GigaEvo for experiment state.

Installation

From source

Note:

--ignore-requires-python
relaxes the Python version check (KernelBench may declare 3.10 but works on 3.12).
For custom branches of
gigaevo
or
kernelbench
, edit the Git URLs in
pyproject.toml
.

Docker

Pull and run (when a pre-built image is published):

To build the image yourself (e.g. for private dependencies or development), see build/README.md.


Custom kernel task

To evolve your own kernel, create a task in KernelBench format. Example layout:

tasks/ └── armt_associate/ └── task.py

See

tasks/armt_associate
in this repo for a reference. You can also use any existing task from KernelBench.


Run evolution

Evolution can use a local or remote LLM (e.g. SGLang, OpenRouter). Examples below use OpenRouter and a remote eval server.

1. Start the eval server (optional, for remote validation)

In a separate terminal:

2. Evolve with a custom task

3. Evolve with a KernelBench task


Monitor progress

Use TensorBoard to find iterations with good performance before extracting programs.


Extract a program

Export the program from a specific iteration (e.g. after inspecting TensorBoard):


Compare two programs

Custom task

KernelBench task


CLI overview

CommandDescription
evolve
Run evolution (custom or KernelBench)
eval-server
Start remote validation server
extract
Export program by iteration from Redis
compare
Compare two programs (correctness + perf)

Best practices

Model selection

Evolution deeply depends on underlying model. For better results, one should use frontier models, like gpt, claude or gemini.

Recomendation for best value vendor model:

  1. gemini flash 3. Capable, yet not very costly. It creates faulty kernels, but able to recover buggy code.

Recomendation for opensource models:

  1. gpt-oss-120b - best baseline for kernel evolution. Good enough reasoning to recover faulty kernels.
  2. GLM-5. From all very large open llms, only one seems like knowing triton and generate decent kernels. Downside - slower on generation and very large for local inference.

Experiments

Quality of result depends on starting seeds and can vary from different run. So make sense to restart and try again if solution is very bad during first 200k tokens.

Also, we notised what triton is better on small efficient kernels, like softmax and matmuls. Just because it recuires less knowledge from model. For complex tasks like KernelBench level 2 difference is lower.

Remote validation

Better to run validation via validator server in different terminal. This way, one can see results.

Cheaper start

Use flag

--disable-insights-lineage
to disable addtitional calls. Benefitial for short debug runs or with expensive models.