← Articles
Measurement

How to measure AI coding productivity without fooling yourself

The easiest AI metrics to collect are the least useful. Sessions run, tokens spent, lines generated, and tools installed all measure activity. None of them tell you whether the work was correct, trusted, or worth doing. Productivity is measured at the outcome, not the activity.

Avorelo Topic: Measurement Topic: AI waste Topic: Outcomes 4 min read

Why activity metrics mislead

Activity metrics are attractive because they are easy to collect and they always go up. More AI sessions, more tokens, more generated code, more installed tools. Each number looks like progress. None of them is evidence that anything useful happened.

The problem is that activity and value can move in opposite directions. An agent that runs constantly on low-value tasks produces a lot of activity and very little outcome. An agent that runs selectively, with the right context, on the right tasks, produces less activity and more outcome. If you reward activity, you optimize for the wrong thing.

From activity to outcome to confidence
Activity metrics
sessions, tokens, lines
Validated outcomes
tests pass, evidence attached
Confidence labels
confirmed, estimated, unknown

What is worth measuring instead

Useful AI productivity measurement starts from a different question: what did the work actually produce, and can we trust it? That points to a small set of outcome-oriented signals.

  • Validated outcomes. Runs that ended with evidence: tests passing, a diff that was reviewed, a check that ran. Not runs that merely completed.
  • Rework avoided. Work that did not have to be redone because it was correct and validated the first time.
  • Review load reduced. Smaller, scoped, evidence-backed changes that took less human review time than they would have otherwise.
  • Repeated setup avoided. Sessions that started from preserved context instead of re-explaining the project from zero.
  • Proof coverage. The share of completed work that carries a receipt: what changed, what was validated, what remains uncertain.

Cost per validated outcome, not cost per token

Token cost is a proxy metric. It tells you what you spent, not what you got. A cheap run that produced wrong output is more expensive than a costly run that produced trusted output, once you count the rework.

The more honest measure is cost per validated outcome: how much did it take, in tokens and human time, to produce a result someone could actually trust and use? Teams that track this tend to discover that their most expensive sessions were often their cheapest in the ways that mattered, and some of their cheapest sessions were quietly expensive.

Confidence labels keep measurement honest

An outcome metric is only as good as the confidence attached to it. A result marked "done" with no evidence is not the same as a result marked "validated" with a passing test and a diff. Productivity numbers that do not carry confidence labels invite self-deception, because everything looks finished.

Attaching confidence to each outcome (confirmed, estimated, or unknown) keeps the measurement grounded. It also makes the number defensible: a leader can see not just how much was produced, but how much of it was actually trusted.

A metric you can game is a metric you will game. Activity counts are trivial to inflate. Validated outcomes with confidence labels are much harder to fake, which is exactly what makes them useful.

Measuring without surveilling people

Outcome measurement does not require tracking individuals. The useful signals aggregate at the level of repos, workflows, tools, model classes, and capability and risk types. None of those require ranking developers against each other.

This matters for adoption as much as for ethics. The moment a measurement system feels like individual surveillance, people route around it, and the data stops reflecting reality. Aggregate, privacy-safe measurement produces numbers people trust enough to act on.

How Avorelo helps

Avorelo measures AI coding work at the outcome, not the activity. Every clean run produces a receipt that records what changed, what was validated, and what is still uncertain. Those receipts roll up into outcome-oriented signals: validated outcomes, rework avoided, review load reduced, and proof coverage, each carrying a confidence label.

The rollups aggregate by repo, workflow, tool, and capability, never by individual contributor. The result is a productivity picture you can defend: not how busy the agents were, but how much of their work was actually trusted.

Measure outcomes, not activity.

Avorelo attaches evidence to AI coding work and rolls it up into validated outcomes with confidence labels. Local-first. No individual tracking.

Start free See how Avorelo works