April 2026

Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090

The first megakernel for hybrid DeltaNet/Attention LLMs. We fused all 24 layers into a single CUDA dispatch and it changed everything we assumed about GPU efficiency.

TL;DR

We fused all 24 layers of Qwen 3.5-0.8B, a hybrid DeltaNet + Attention model, into a single CUDA kernel launch. On an RTX 3090 power-limited to 220W:

1.8x the throughput of an Apple M5 Max (411 vs 229 tok/s)
1.87 tok/J, matching or exceeding M5 Max efficiency at sustained load
1.55x faster than llama.cpp on the same GPU

The RTX 3090 launched in 2020. It's rated at 350W. Everyone calls it power-hungry.

The Assumption Everyone Makes

Everyone knows: NVIDIA GPUs are fast but power-hungry. Apple Silicon is slower but efficient. Pick your tradeoff.

Here's what that looks like on Qwen 3.5-0.8B with stock frameworks:

Setup	tok/s	Power	tok/J
RTX 3090 (llama.cpp)	267	350W	0.76
M5 Max (LM Studio)	229	~130W	1.76

NVIDIA: faster, but 2.3x worse on efficiency. Case closed, right?

What If the Problem Isn't the GPU?

The RTX 3090 has 936 GB/s of memory bandwidth and 142 TFLOPS of FP16 compute. But llama.cpp only extracts 267 tok/s from it on this model. That's low.

Why? Because of how frameworks run inference:

Traditional: ~100 kernel launches per token

Megakernel: 1 launch, 24 layers, zero gaps

DeltaNet Attention idle

Every layer boundary means:

Return control to the CPU
Dispatch the next kernel
Re-fetch weights from global memory
Synchronize threads

For 24 layers, that's roughly 100 kernel launches per token. Each launch wastes microseconds. Those microseconds add up, and each one burns power doing nothing useful.

But there's a deeper problem here. Qwen 3.5-0.8B isn't a standard transformer.

A New Architecture

Qwen 3.5-0.8B alternates between two types of layers:

18 DeltaNet layers: linear attention with a learned recurrence
6 Full Attention layers: standard multi-head attention

Qwen 3.5-0.8B -18 DeltaNet + 6 Attention (3:1 ratio)

DeltaNet Attention

This hybrid architecture is where LLMs are heading. Qwen3-Next, Kimi Linear, they all use this pattern. DeltaNet scales linearly with context length instead of quadratically. It's more efficient by design.

It's a new architecture and no one has built a fused kernel for it yet. MLX doesn't have DeltaNet kernels. llama.cpp supports it, but generically. This is the first megakernel for it.

That 267 tok/s on the RTX 3090? Not a hardware limitation. A software one.

One Kernel Launch

We wrote a single CUDA kernel that processes all 24 layers in one dispatch. No CPU round-trips between layers. No redundant memory fetches. Data stays in registers and shared memory as it flows through the network.

What this means concretely:

82 blocks, 512 threads: all SMs on the RTX 3090 stay occupied
BF16 weights, BF16 activations, FP32 accumulation
DeltaNet recurrence runs natively: warp-cooperative state updates in F32 registers
Full attention with online softmax: fused QKV, RoPE, causal attention, output projection
Zero inter-layer overhead: cooperative grid sync replaces kernel launches

Setup	Prefill (pp520)	Decode (tg128)
Megakernel	37,800 tok/s	413 tok/s
llama.cpp BF16	11,247 tok/s	267 tok/s
PyTorch HuggingFace	7,578 tok/s	108 tok/s

3.4x faster prefill. 1.55x faster decode. Same hardware, same model, same weights.

Now Turn Down the Power

Fewer wasted cycles means less heat. So we should be able to cut power without losing much speed.

We used nvidia-smi -pl to sweep power limits:

Power Limit	Clock	Draw	tok/s	tok/J	vs Stock
420W (stock)	1980 MHz	314W	433	1.38	baseline
300W	1935 MHz	299W	432	1.44	99.8% speed, 5% less power
220W	1635 MHz	220W	411	1.87	95% speed, 30% less power
150W	405 MHz	150W	194	1.29	too aggressive

At 220W: 95% of the speed, 30% less power. The curve is nonlinear, there's a sweet spot where tighter execution converts directly into saved watts.

Power measurement methodology.

We follow the same approach as Hazy Research's Intelligence Per Watt study: NVML energy counters for NVIDIA GPUs, powermetrics for Apple Silicon. This measures accelerator power, not total system draw.

Side by side

	RTX 3090 (llama.cpp)	M5 Max	RTX 3090 (Megakernel @220W)
tok/s	267	229	411
Power	350W	~130W	220W
tok/J	0.76	1.76	1.87
GPU price	$700	$2,499+ (system)	$700

Without the megakernel, the RTX 3090 barely edges out a laptop chip: 267 vs 229.

With the megakernel, the same GPU delivers 1.8x the throughput at equal or better efficiency. On a chip released in 2020. At 1/6th the system cost.

The efficiency gap between NVIDIA and Apple wasn't a hardware gap. It was a software gap.

Why DeltaNet matters

Attention has had years of optimization. FlashAttention, PagedAttention, every framework has decent kernels for it by now.

DeltaNet is different. It's a recurrent layer with learned state updates, and it showed up in production models less than a year ago. Frameworks are adding support: MLX, vLLM via Triton, forks of llama.cpp. But nobody has fused all layers into a single kernel yet.

More models are adopting this hybrid pattern because linear attention scales better with context length. As the architecture matures, so will the kernels. This is an early one.

What We Learned Building This

Some things broke along the way.

`grid.sync()` inside a loop = instant deadlock

Our first attempt synchronized all blocks inside the per-token DeltaNet recurrence loop. Every block waited for every other block. Nothing moved. No error message, just silence. The fix: synchronize between layers, not within them.

Register pressure is the real enemy

We tried tiling the 128x128 DeltaNet state matrix with S_TILE=16 for better instruction-level parallelism. Silent crash. No CUDA error. The compiler spilled registers to local memory, performance collapsed, and eventually the kernel just stopped. S_TILE=8 was the sweet spot.

Try It

The megakernel is open source.

git clone https://github.com/Luce-Org/luce-megakernel.git
cd luce-megakernel
pip install -e .
python bench_pp_tg.py

Requirements:

NVIDIA GPU (RTX 3090 tested, should work on Ampere+)
CUDA 12+
PyTorch 2.0+
~1.5GB VRAM for BF16 weights

The efficiency gap between NVIDIA and Apple isn't inherent to the silicon. It's an artifact of running generic software on capable hardware.

When you write a kernel that actually uses what the GPU offers (tensor cores, shared memory, cooperative grid launches, register-resident state) a five-year-old GPU matches Apple's latest chip on efficiency while delivering nearly twice the throughput. That is exactly the kind of GPU you'd build a local-inference PC around to run AI models locally.

As models move beyond standard attention, the inference stack matters more than the spec sheet.

The code is open source. The hardware is coming soon.

Built from the kernel up.

GitHub Discord Compare hardware Get Early Access