April 2026
Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090
The first megakernel for hybrid DeltaNet/Attention LLMs. We fused all 24 layers into a single CUDA dispatch and it changed everything we assumed about GPU efficiency.
TL;DR
We fused all 24 layers of Qwen 3.5-0.8B, a hybrid DeltaNet + Attention model, into a single CUDA kernel launch. On an RTX 3090 power-limited to 220W:
- 1.8x the throughput of an Apple M5 Max (411 vs 229 tok/s)
- 1.87 tok/J, matching or exceeding M5 Max efficiency at sustained load
- 1.55x faster than llama.cpp on the same GPU
The RTX 3090 launched in 2020. It's rated at 350W. Everyone calls it power-hungry.
The Assumption Everyone Makes
Everyone knows: NVIDIA GPUs are fast but power-hungry. Apple Silicon is slower but efficient. Pick your tradeoff.
Here's what that looks like on Qwen 3.5-0.8B with stock frameworks:
| Setup | tok/s | Power | tok/J |
|---|---|---|---|
| RTX 3090 (llama.cpp) | 267 | 350W | 0.76 |
| M5 Max (LM Studio) | 229 | ~130W | 1.76 |
NVIDIA: faster, but 2.3x worse on efficiency. Case closed, right?
What If the Problem Isn't the GPU?
The RTX 3090 has 936 GB/s of memory bandwidth and 142 TFLOPS of FP16 compute. But llama.cpp only extracts 267 tok/s from it on this model. That's low.
Why? Because of how frameworks run inference:
Traditional: ~100 kernel launches per token
Megakernel: 1 launch, 24 layers, zero gaps
Every layer boundary means:
- Return control to the CPU
- Dispatch the next kernel
- Re-fetch weights from global memory
- Synchronize threads
For 24 layers, that's roughly 100 kernel launches per token. Each launch wastes microseconds. Those microseconds add up, and each one burns power doing nothing useful.
But there's a deeper problem here. Qwen 3.5-0.8B isn't a standard transformer.
A New Architecture
Qwen 3.5-0.8B alternates between two types of layers:
- 18 DeltaNet layers: linear attention with a learned recurrence
- 6 Full Attention layers: standard multi-head attention
Qwen 3.5-0.8B -18 DeltaNet + 6 Attention (3:1 ratio)
This hybrid architecture is where LLMs are heading. Qwen3-Next, Kimi Linear, they all use this pattern. DeltaNet scales linearly with context length instead of quadratically. It's more efficient by design.
It's a new architecture and no one has built a fused kernel for it yet. MLX doesn't have DeltaNet kernels. llama.cpp supports it, but generically. This is the first megakernel for it.
That 267 tok/s on the RTX 3090? Not a hardware limitation. A software one.
One Kernel Launch
We wrote a single CUDA kernel that processes all 24 layers in one dispatch. No CPU round-trips between layers. No redundant memory fetches. Data stays in registers and shared memory as it flows through the network.
What this means concretely:
- 82 blocks, 512 threads: all SMs on the RTX 3090 stay occupied
- BF16 weights, BF16 activations, FP32 accumulation
- DeltaNet recurrence runs natively: warp-cooperative state updates in F32 registers
- Full attention with online softmax: fused QKV, RoPE, causal attention, output projection
- Zero inter-layer overhead: cooperative grid sync replaces kernel launches
| Setup | Prefill (pp520) | Decode (tg128) |
|---|---|---|
| Megakernel | 37,800 tok/s | 413 tok/s |
| llama.cpp BF16 | 11,247 tok/s | 267 tok/s |
| PyTorch HuggingFace | 7,578 tok/s | 108 tok/s |
3.4x faster prefill. 1.55x faster decode. Same hardware, same model, same weights.
Now Turn Down the Power
Fewer wasted cycles means less heat. So we should be able to cut power without losing much speed.
We used nvidia-smi -pl to sweep power limits:
| Power Limit | Clock | Draw | tok/s | tok/J | vs Stock |
|---|---|---|---|---|---|
| 420W (stock) | 1980 MHz | 314W | 433 | 1.38 | baseline |
| 300W | 1935 MHz | 299W | 432 | 1.44 | 99.8% speed, 5% less power |
| 220W | 1635 MHz | 220W | 411 | 1.87 | 95% speed, 30% less power |
| 150W | 405 MHz | 150W | 194 | 1.29 | too aggressive |
At 220W: 95% of the speed, 30% less power. The curve is nonlinear, there's a sweet spot where tighter execution converts directly into saved watts.
Power measurement methodology.
We follow the same approach as Hazy Research's Intelligence Per Watt study: NVML energy counters for NVIDIA GPUs, powermetrics for Apple Silicon. This measures accelerator power, not total system draw.
Side by side
| RTX 3090 (llama.cpp) | M5 Max | RTX 3090 (Megakernel @220W) | |
|---|---|---|---|
| tok/s | 267 | 229 | 411 |
| Power | 350W | ~130W | 220W |
| tok/J | 0.76 | 1.76 | 1.87 |
| GPU price | $700 | $2,499+ (system) | $700 |
Without the megakernel, the RTX 3090 barely edges out a laptop chip: 267 vs 229.
With the megakernel, the same GPU delivers 1.8x the throughput at equal or better efficiency. On a chip released in 2020. At 1/6th the system cost.
The efficiency gap between NVIDIA and Apple wasn't a hardware gap. It was a software gap.
Why DeltaNet matters
Attention has had years of optimization. FlashAttention, PagedAttention, every framework has decent kernels for it by now.
DeltaNet is different. It's a recurrent layer with learned state updates, and it showed up in production models less than a year ago. Frameworks are adding support: MLX, vLLM via Triton, forks of llama.cpp. But nobody has fused all layers into a single kernel yet.
More models are adopting this hybrid pattern because linear attention scales better with context length. As the architecture matures, so will the kernels. This is an early one.
What We Learned Building This
Some things broke along the way.
grid.sync() inside a loop = instant deadlock
Our first attempt synchronized all blocks inside the per-token DeltaNet recurrence loop. Every block waited for every other block. Nothing moved. No error message, just silence. The fix: synchronize between layers, not within them.
Register pressure is the real enemy
We tried tiling the 128x128 DeltaNet state matrix with S_TILE=16 for better instruction-level parallelism. Silent crash. No CUDA error. The compiler spilled registers to local memory, performance collapsed, and eventually the kernel just stopped. S_TILE=8 was the sweet spot.
Try It
The megakernel is open source.
git clone https://github.com/Luce-Org/luce-megakernel.git
cd luce-megakernel
pip install -e .
python bench_pp_tg.py Requirements:
- NVIDIA GPU (RTX 3090 tested, should work on Ampere+)
- CUDA 12+
- PyTorch 2.0+
- ~1.5GB VRAM for BF16 weights
The efficiency gap between NVIDIA and Apple isn't inherent to the silicon. It's an artifact of running generic software on capable hardware.
When you write a kernel that actually uses what the GPU offers (tensor cores, shared memory, cooperative grid launches, register-resident state) a five-year-old GPU matches Apple's latest chip on efficiency while delivering nearly twice the throughput.
As models move beyond standard attention, the inference stack matters more than the spec sheet.
The code is open source. The hardware is coming soon.