Over the last 18 months I have been in a lot of conversations about AI PCs — with enterprises evaluating fleet upgrades, with device vendors making the case for their hardware, and with IT leaders trying to figure out what their employees actually need.

The consistent signal: everybody agrees AI PCs matter. Purchases are happening — Windows 10 end-of-support has accelerated that — but the buying is cautious and uneven. Two reasons come up every time: the AI landscape is moving fast enough that enterprises are not confident their requirements will look the same in 12 months, and they do not have a reliable way to evaluate what they are being sold. Most are not even sure what the right criteria should be.

That second problem is what this post is about.


“AI PC” Is Doing Too Much Work as a Category

The term means completely different things depending on who you are buying for. Most enterprises are buying for several different people at once.

For knowledge workers, the AI PC is an efficiency question: does the device stay out of the way? Does the NPU keep the CPU free during a Teams call? Does battery life hold when all the ambient AI features are running? They are not running local models. The hardware’s job is to handle that background AI load without degrading the experience.

For developers building AI applications, it is a capability question. The hardware question is whether the machine can actually serve inference at the scale their application demands: low time to first token, enough memory to load the model cleanly, enough capacity to handle concurrent requests without stalling. A multi-agent pipeline doing RAG across several simultaneous queries, or an agentic workflow fanning out to multiple sub-calls, will expose headroom limits that a single sequential request never will.

Then there is a group in between: teams with sensitive data who cannot send it to a cloud API. A legal team that wants document summarization without shipping contracts to an external endpoint. A finance team running local search over internal reports. They need local inference capability but they are not developers. They are knowledge workers with a data privacy constraint.

Three groups. Three different hardware profiles. One procurement decision.

And real users do not fit cleanly into any one of them. The analyst who lives in Teams and occasionally needs to run a heavy workload is a real person in every enterprise. When the evaluation framework is built around clean personas, it misses the people who actually use the hardware.


The Stakes Make This Hard to Get Wrong

Enterprise hardware is not bought annually. A fleet decision today locks the organization into that hardware for at least three years, sometimes four or five.

The AI-enabled premium — the difference between a standard enterprise laptop and a Copilot+ or higher-spec machine — typically runs $300 to $800 per device. That delta often bundles higher RAM, storage, and display specs alongside the NPU, so not all of it is purely AI-driven. Across a 10,000-seat fleet, that is $3M to $8M in incremental spend before support contracts and deployment costs, locked in for three years.

Multiple user profiles with different needs, a commitment of that size, and a multi-year lock-in — that is exactly the kind of decision that makes intelligent people hesitate. Not because they do not understand the technology. Because they do not have the right instrument to evaluate it.


The Vendor Metrics Are Real. They Are Just Answering a Different Question.

I want to be precise about this because it is easy to make it sound like vendor criticism. That is not the point.

When I say vendor here, I am mostly talking about device vendors — Dell, HP, Lenovo. These are the conversations I have actually been in. A device vendor walks into an enterprise with a comparison deck: here is our Copilot+ configuration, here is what the NPU delivers, here is how it stacks up against the previous generation. Those NPU claims — the TOPS ratings — originate with the chip makers: Qualcomm, Intel, AMD. The device OEMs repackage them into their sales narrative. The numbers are accurate at every step. The benchmarks ran. Everyone in the chain is doing exactly what they should do: making the case for their hardware in a competitive market.

The problem is structural. Those benchmarks are designed to differentiate a product. They are not designed to answer whether a specific organization’s specific workload mix will benefit from the premium it is being asked to pay. The vendor knows their device. They do not know your workload.

In practice, the conversation always hit the same wall. The vendor deck had impressive numbers. The IT leader had no way to connect those numbers to the actual question: will this hardware deliver measurably better outcomes for my people, at this price, for the next three years?

That question does not have a vendor-supplied answer. It requires a different kind of measurement entirely.


What I Started Building

I came at this originally from the developer side — personally interested in evaluating devices for running LLMs locally, often experimenting with different quantization levels. I wanted to know whether a given machine could actually handle a local inference workload. Not just load a model, but serve it under the kind of concurrent load an agentic application generates. The tool I built measures that: parallel request throughput, latency distribution, where memory pressure appears, how performance holds under sustained load.

While that tool answers the developer question well, I know enterprises need to answer a different question: whether the AI-enabled premium would pay off for the thousands of knowledge workers who would never run a local model, but whose daily experience would depend on whether the NPU actually delivered on its promise during a Teams call.

Tokens per second does not tell you anything about CPU headroom during a video call. CPU headroom during a video call does not tell you anything about how an agent pipeline behaves when three requests arrive simultaneously. The tools I have been building are designed around that reality — one for each use case, not a one-size-fits-all benchmark.

The developer-side tool is production-grade and running today on Apple Silicon. The knowledge worker fleet evaluation tool is designed and in progress.

The next post goes into the developer-side tool in detail — the design decisions, what the implementation revealed, and where the obvious approach turned out to measure the wrong thing.


I focus on building production-grade AI systems, from agentic pipelines to inference infrastructure. If you are working through an AI PC evaluation as an IT leader or a developer benchmarking hardware for local inference, I am happy to share what I have built and the criteria I have settled on. Reach out on LinkedIn.