April 2026

Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. We fused all 24 layers into a single CUDA dispatch and it changed everything we assumed about GPU efficiency.

RTX 3090 vs MacBook Pro

TL;DR

We fused all 24 layers of Qwen 3.5-0.8B, a hybrid DeltaNet + Attention model, into a single CUDA kernel launch. On an RTX 3090 power-limited to 220W:

The RTX 3090 launched in 2020. It's rated at 350W. Everyone calls it power-hungry.


The Assumption Everyone Makes

Everyone knows: NVIDIA GPUs are fast but power-hungry. Apple Silicon is slower but efficient. Pick your tradeoff.

Here's what that looks like on Qwen 3.5-0.8B with stock frameworks:

Setuptok/sPowertok/J
RTX 3090 (llama.cpp)267350W0.76
M5 Max (LM Studio)229~130W1.76

NVIDIA: faster, but 2.3x worse on efficiency. Case closed, right?

What If the Problem Isn't the GPU?

The RTX 3090 has 936 GB/s of memory bandwidth and 142 TFLOPS of FP16 compute. But llama.cpp only extracts 267 tok/s from it on this model. That's low.

Why? Because of how frameworks run inference:

Traditional: ~100 kernel launches per token

Megakernel: 1 launch, 24 layers, zero gaps

DeltaNet Attention idle

Every layer boundary means:

For 24 layers, that's roughly 100 kernel launches per token. Each launch wastes microseconds. Those microseconds add up, and each one burns power doing nothing useful.

But there's a deeper problem here. Qwen 3.5-0.8B isn't a standard transformer.

A New Architecture

Qwen 3.5-0.8B alternates between two types of layers:

Qwen 3.5-0.8B -18 DeltaNet + 6 Attention (3:1 ratio)

DeltaNet Attention

This hybrid architecture is where LLMs are heading. Qwen3-Next, Kimi Linear, they all use this pattern. DeltaNet scales linearly with context length instead of quadratically. It's more efficient by design.

It's a new architecture and no one has built a fused kernel for it yet. MLX doesn't have DeltaNet kernels. llama.cpp supports it, but generically. This is the first megakernel for it.

That 267 tok/s on the RTX 3090? Not a hardware limitation. A software one.

One Kernel Launch

We wrote a single CUDA kernel that processes all 24 layers in one dispatch. No CPU round-trips between layers. No redundant memory fetches. Data stays in registers and shared memory as it flows through the network.

What this means concretely:

SetupPrefill (pp520)Decode (tg128)
Megakernel37,800 tok/s413 tok/s
llama.cpp BF1611,247 tok/s267 tok/s
PyTorch HuggingFace7,578 tok/s108 tok/s

3.4x faster prefill. 1.55x faster decode. Same hardware, same model, same weights.

Now Turn Down the Power

Fewer wasted cycles means less heat. So we should be able to cut power without losing much speed.

We used nvidia-smi -pl to sweep power limits:

Power LimitClockDrawtok/stok/Jvs Stock
420W (stock)1980 MHz314W4331.38baseline
300W1935 MHz299W4321.4499.8% speed, 5% less power
220W1635 MHz220W4111.8795% speed, 30% less power
150W405 MHz150W1941.29too aggressive

At 220W: 95% of the speed, 30% less power. The curve is nonlinear, there's a sweet spot where tighter execution converts directly into saved watts.

Power measurement methodology.

We follow the same approach as Hazy Research's Intelligence Per Watt study: NVML energy counters for NVIDIA GPUs, powermetrics for Apple Silicon. This measures accelerator power, not total system draw.

Side by side

RTX 3090 (llama.cpp)M5 MaxRTX 3090 (Megakernel @220W)
tok/s267229411
Power350W~130W220W
tok/J0.761.761.87
GPU price$700$2,499+ (system)$700

Without the megakernel, the RTX 3090 barely edges out a laptop chip: 267 vs 229.

With the megakernel, the same GPU delivers 1.8x the throughput at equal or better efficiency. On a chip released in 2020. At 1/6th the system cost.

The efficiency gap between NVIDIA and Apple wasn't a hardware gap. It was a software gap.

Why DeltaNet matters

Attention has had years of optimization. FlashAttention, PagedAttention, every framework has decent kernels for it by now.

DeltaNet is different. It's a recurrent layer with learned state updates, and it showed up in production models less than a year ago. Frameworks are adding support: MLX, vLLM via Triton, forks of llama.cpp. But nobody has fused all layers into a single kernel yet.

More models are adopting this hybrid pattern because linear attention scales better with context length. As the architecture matures, so will the kernels. This is an early one.

What We Learned Building This

Some things broke along the way.

grid.sync() inside a loop = instant deadlock

Our first attempt synchronized all blocks inside the per-token DeltaNet recurrence loop. Every block waited for every other block. Nothing moved. No error message, just silence. The fix: synchronize between layers, not within them.

Register pressure is the real enemy

We tried tiling the 128x128 DeltaNet state matrix with S_TILE=16 for better instruction-level parallelism. Silent crash. No CUDA error. The compiler spilled registers to local memory, performance collapsed, and eventually the kernel just stopped. S_TILE=8 was the sweet spot.

Try It

The megakernel is open source.

git clone https://github.com/Luce-Org/luce-megakernel.git
cd luce-megakernel
pip install -e .
python bench_pp_tg.py

Requirements:


The efficiency gap between NVIDIA and Apple isn't inherent to the silicon. It's an artifact of running generic software on capable hardware.

When you write a kernel that actually uses what the GPU offers (tensor cores, shared memory, cooperative grid launches, register-resident state) a five-year-old GPU matches Apple's latest chip on efficiency while delivering nearly twice the throughput.

As models move beyond standard attention, the inference stack matters more than the spec sheet.

The code is open source. The hardware is coming soon.

Built from the kernel up.

GitHub Discord Get Early Access