How Did One RTX 5090 Run Gemma at ~600 Tok/s?
A new research paper made it possible. A regular developer made it real. The implications are bigger than either piece on its own.
Hope you guys are having a good Monday, here’s an insane letter for you…
Every breakthrough in AI looks like one moment, but it usually comes from two separate groups of people who never talk to each other.
The first group writes papers and ships open source code. They are usually employed by a research lab, they publish on arXiv, and the work is freely available the day it lands. They do not ship products. They ship capability.
The second group buys hardware and tries to make the new capability actually run. They use Discord and Twitter, not press releases. They post benchmark screenshots and config files. They do not write papers. They write README files.
When the two groups happen to overlap on the same idea in the same month, something gets unlocked that neither side could have produced alone. That just happened with local AI on a desktop GPU, and the result is worth understanding even if you have never read a research paper or installed a model in your life.
WHAT THE PAPER SIDE PRODUCED
In February of 2026, three researchers (Jian Chen, Yesheng Liang, and Zhijian Liu) posted a paper on arXiv called DFlash. It introduced a new way to make large language models produce text faster, with no loss in quality. The technique is called block diffusion speculative decoding, and the core idea is simple enough to explain in one paragraph.
When a language model writes a response, it produces one word at a time. Each word requires the model to run through its full set of weights, which is slow. For years, researchers have been working on a trick called speculative decoding, where a small fast model guesses several words ahead and the big slow model checks all the guesses in one pass. If the guesses are mostly right, you get many words for the price of one. The trick worked, but the guessing model itself still had to write its guesses one at a time, which capped how fast the whole system could go.
[Perfect time to read this letter that explains MTP which speculative decoding birthed: Multi Token Prediction Just Made Local Agents Running in vLLM ~3x Faster [2026]
DFlash replaces the guessing model with a different kind of network that can produce sixteen words in a single shot, all at once, by looking at the same internal signals the big model is already computing. The drafting step becomes parallel instead of sequential, and the cap goes away. The paper measures end to end speedups of up to six times, with the output being mathematically identical to what the big model would have produced on its own.
That is the research side. A clever idea, a working implementation, and a posted preprint. The kind of contribution that sits on arXiv getting downloaded by a few thousand specialists. Important, but not yet world changing.
Then Red Hat AI did the second half of the work that almost nobody is talking about. They trained production-ready versions of the DFlash drafter for the models people actually want to run, including Google’s Gemma 4, and they shipped those drafters as open weights on Hugging Face. They also worked with the vLLM project (the most widely used open source inference engine) to land an official integration, so anyone running vLLM can flip DFlash on with a config setting instead of writing custom code.
That is the difference between a research paper and a usable tool. The first is a possibility. The second is a button.
WHAT THE EXECUTION SIDE DID WITH IT
Earlier this month, a dev named: TechPractice1 (on Medium, YouTube, and X) decided to find out what the new button was actually capable of on a single consumer GPU. The setup was the kind of thing thousands of people have at home:

One NVIDIA RTX 5090. The model was Gemma 4 26B-A4B, quantized to 4-bit so it fits in the 32 GB of memory on the card. vLLM as the inference server. The new Red Hat DFlash drafter loaded alongside it.
He then did something the paper authors did not bother to do: a complete sweep of every possible value for the “how many words should the drafter guess at once” setting, from zero (DFlash off, plain inference) up to fifteen. Each setting ran the same workload. He recorded the throughput at every step.
The baseline with the feature turned off was 228 tokens per second, which was already a strong number on a consumer card. At the optimal setting (which turned out to be a specific value in the middle of the range, not the maximum), the throughput climbed to about 600 tokens per second. He posted the full results table, the config, and a video of the benchmark running live.
That number, six hundred tokens per second on a single desktop GPU, is what made the rest of the local AI community sit up. It is roughly the per stream throughput an H100 was producing on this model class a year ago. The H100 lives in a data center, costs about thirty thousand dollars, draws seven hundred watts in a server chassis you cannot plug into a wall outlet, and is rented by the hour.
The 5090 sits in a tower under a desk, costs between three thousand five hundred and four thousand dollars at current street prices (for just raw GPU), draws five hundred and seventy five watts, and plugs into any 120 volt outlet in the United States. Can you see?
WHY THIS IS BIGGER THAN A BENCHMARK
If you have never thought about AI inference economics, the simplest way to feel the magnitude is this.
Right now, most companies that use AI rent it from a cloud provider. They pay per token of output, the same way phone customers used to pay per text message. The pricing of those AI APIs assumes a structural gap between what the provider can run in a data center and what a customer could run themselves. That gap is what justifies the bill.
A single RTX 5090 producing 600 tokens per second can generate about fifty one million tokens of output in a day if you keep it busy. At public API rates for a similar speed tier, that volume costs roughly a thousand dollars a day to rent. The same card costs about two dollars a day in electricity to run. The hardware pays for itself in about four days of equivalent API spending. After that, output is essentially free for the working life of the card, which is several years.
For a small business currently spending five or ten thousand dollars a month on inference, the conclusion is now mechanical. One or two desktop GPUs in a closet replace the cloud bill inside the first month of operation, and the marginal cost of inference drops to the electricity meter. There is no version of the math where renting wins for that workload.
The fair caveat: Gemma 4 26B-A4B is not the smartest model in the world. It is not GPT 5 or Claude Opus 4.7. For the hardest reasoning problems, the frontier APIs still win on quality. The point of this story is not that the desktop card replaces the frontier. The point is that it replaces the frontier for the eighty percent of work where the frontier was overkill in the first place. That eighty percent is where the volume lives, and the volume is where the money is.
WHY THE COMBINATION OF PAPER PLUS EXECUTION MATTERS
Either half of this story on its own would be a footnote.
The paper alone would have produced a few hundred downloads, a few conference presentations, and a few months of debate among speculative decoding researchers. The technique would have eventually made it into production at one or two of the big labs and stayed there as an internal optimization. The outside world would not have noticed.
The execution alone, without the paper, would have been impossible. You cannot benchmark a technique that does not exist. The reason TechPractice was able to produce that ~600 tok/s number is that the research had been done, the drafter had been trained by Red Hat, and the integration code had been written and merged into vLLM. The execution piece took maybe a few days of work, and that is precisely the point. It took a few days because the prior work had already happened.
The merger of the two is what makes this a real moment.
A capability lab produced the algorithm. A productization team turned the algorithm into open weights and shipped code. A regular developer took the open weights, ran them on a card he bought himself, and proved that the result was as advertised on the kind of hardware anyone can buy. The full chain from research preprint to reproducible consumer benchmark closed in about three months. That is fast in a way the rest of the industry has not caught up to.
WHAT THIS CHANGES FOR YOU
If you write software for a living, the most useful immediate change is that coding agents on local hardware stop feeling slow. A planning step, three or four tool calls, an evaluation, and a retry all finish in well under five seconds on this stack. Workflows that used to require a frontier API at meaningful per-call cost now run on a desktop with no rate limits and no privacy concerns.
If you run a small or mid size business that already pays for AI, the calculation on whether to keep paying just changed. For most workloads (drafting emails, summarizing meetings, classifying support tickets, generating product descriptions, answering customer questions out of a knowledge base), a tuned 26B model is enough, and the hardware to run it pays for itself faster than your quarterly review cycle.
If you work in a regulated field (healthcare, law, defense, finance, government), the previous reason for not running AI locally was that the local options were too slow to be useful. That reason is now gone. A radiology pipeline that reads five hundred cases overnight on hardware that never connects to the public internet is a single workstation problem now. So is a law firm summarizing thousands of contracts in private. So is an intelligence analyst running classified documents through summarization without sending them to a cloud.
If you live in a country that does not want to depend on the foreign clouds that currently host most frontier inference, this is the week the answer to “can we host our own AI” went from “not really” to “yes, on hardware we can already buy.” That is a small sentence with very large geopolitical implications.
And if you are an ordinary person who just buys AI products, the relevant change is downstream and slower. Over the next twelve months, you should expect the price of AI features in the apps you use to fall, because the cost floor of providing those features just fell. Subscriptions that were a hundred dollars a month will be twenty. Free tiers that were stingy will become generous. Companies that built their entire business model on the assumption that AI inference would stay expensive are going to have a hard year.
THE HONEST CAVEATS
Six hundred tokens per second is a single stream number, measured one user at a time. If you put thirty users on the same card, each user gets a smaller share. The paper documents this directly: “the speedup ratio drops to about half its peak at high concurrency.” Multi tenant deployments still benefit, just less than the headline suggests.
The vLLM integration is officially marked as under active development. Some hardware configurations have not been validated yet. Production deployments will run into rough edges, especially around setting the right draft token count for a particular workload. Tuning is required.
The 600 number is the peak of a sweep on a specific prompt distribution. Real workloads land between roughly four hundred and six hundred tokens per second (which is still absolutely mind bending for a single 5090), depending on how predictable the output is.
WHERE THIS IS GOING
The history of every major AI cost shift looks the same. A capability that needed a data center moves to a workstation. A capability that needed a workstation moves to a laptop. Each transition happens not in a smooth curve but in a jump, when advances in models, hardware, and decoding algorithms compound in the same window.
The jump that just happened was three of those advances arriving at once. Gemma’s mixture of experts architecture, which holds a large model in memory while running at small model speed. Blackwell’s hardware-native 4-bit floating point, which doubles throughput on the same silicon. And DFlash’s parallel drafting, which removes the last sequential step in the decoder. Each one alone would have been incremental. Together, on a card you can buy at retail, they produced a result that would have required a data center six months ago.
The local AI conversation is about to stop being a hobbyist conversation and start being a procurement conversation. The procurement conversation will make several very large incumbent business models look fragile inside twelve months.
If you build software, your roadmap got more interesting this week. If you do not, somebody in your market does, and they are going to ship before the press catches up…
What do you think?
God-Willing, see you at the next letter
GRACE & PEACE









