kernel-evo
Описание
Evolutionary generation of efficient GPU kernels
Языки
- Python99,3%
- Dockerfile0,7%
Evolutionary generation of efficient GPU kernels using GigaEvo.
Define a task, run evolution with an LLM backend, extract and compare optimized programs.
Features
- Custom tasks — Define your own kernel tasks in KernelBench format and evolve them.
- KernelBench integration — Use existing KernelBench problems.
- Triton and CUDA inline backends - two most popular ways to create kernels, suitable for different scenarious.
- Remote or local execution — Run validation locally or via a remote eval server.
- Cost efficient - works with fast models gemini flash 3 and gpt-oss-120b. Current experiments costs 0.5-1$. Frontier models with high reasoning effort would be benefitial, yet cost would be magnitude higher.
Requirements
- Python >= 3.12
- LLM API — OpenAI-compatible (e.g. OpenRouter, or a local server like SGLang).
- Redis — Used by GigaEvo for experiment state.
Installation
From source
Note:
relaxes the Python version check (KernelBench may declare 3.10 but works on 3.12).--ignore-requires-python
For custom branches oforgigaevo, edit the Git URLs inkernelbench.pyproject.toml
Docker
Pull and run (when a pre-built image is published):
To build the image yourself (e.g. for private dependencies or development), see build/README.md.
Custom kernel task
To evolve your own kernel, create a task in KernelBench format. Example layout:
tasks/
└── armt_associate/
└── task.py
See in this repo for a reference. You can also use any existing task from KernelBench.
Run evolution
Evolution can use a local or remote LLM (e.g. SGLang, OpenRouter). Examples below use OpenRouter and a remote eval server.
1. Start the eval server (optional, for remote validation)
In a separate terminal:
2. Evolve with a custom task
3. Evolve with a KernelBench task
Monitor progress
Use TensorBoard to find iterations with good performance before extracting programs.
Extract a program
Export the program from a specific iteration (e.g. after inspecting TensorBoard):
Compare two programs
Custom task
KernelBench task
CLI overview
| Command | Description |
|---|---|
| Run evolution (custom or KernelBench) |
| Start remote validation server |
| Export program by iteration from Redis |
| Compare two programs (correctness + perf) |
Best practices
Model selection
Evolution deeply depends on underlying model. For better results, one should use frontier models, like gpt, claude or gemini.
Recomendation for best value vendor model:
- gemini flash 3. Capable, yet not very costly. It creates faulty kernels, but able to recover buggy code.
Recomendation for opensource models:
- gpt-oss-120b - best baseline for kernel evolution. Good enough reasoning to recover faulty kernels.
- GLM-5. From all very large open llms, only one seems like knowing triton and generate decent kernels. Downside - slower on generation and very large for local inference.
Experiments
Quality of result depends on starting seeds and can vary from different run. So make sense to restart and try again if solution is very bad during first 200k tokens.
Also, we notised what triton is better on small efficient kernels, like softmax and matmuls. Just because it recuires less knowledge from model. For complex tasks like KernelBench level 2 difference is lower.
Remote validation
Better to run validation via validator server in different terminal. This way, one can see results.
Cheaper start
Use flag to disable addtitional calls. Benefitial for short debug runs or with expensive models.