heygrc
Engineeringthe heygrc team

False positives are the only metric that matters for a compliance bot

A reviewer that cries wolf gets muted, and a muted reviewer catches nothing. For a compliance tool, precision is the whole game.

There is a failure mode that kills automated reviewers, and it is not missing a real issue. It is flagging a fake one. The first time a tool tells an engineer their change is non-compliant and it is wrong, the engineer loses a few minutes. The second time, they start to suspect the tool. By the fifth, they dismiss its comments on sight, including the one that was right.

For a compliance reviewer, this is the entire game. A bot that is muted catches nothing, so the only metric that ultimately matters is how rarely it is wrong.

One wrong flag is expensive out of proportion to its size

A false positive does not cost you one ignored comment. It costs you the credibility of every comment after it. Trust in an automated reviewer is not additive, it is a reputation, and a reputation is destroyed faster than it is built. A reviewer that is wrong even a small fraction of the time is not 'pretty good'; it is a steady supply of comments teaching your team to stop reading.

This is why a compliance tool cannot treat recall as the headline number. Catching more is worthless if the catching is noisy enough that people tune it out.

Compliance is harder to get right than bugs

A bug flag is checkable: the reviewer can often see whether the code is wrong. A compliance flag asks the reviewer to trust a second thing they cannot see directly, the mapping from this change to a control in a framework. 'This looks non-compliant' is unverifiable and therefore worthless; it is the kind of vague output that trains people to ignore the tool.

The only way to make a compliance flag verifiable is to cite the exact control, at the right grain, so the engineer can check the claim. A finding that says 'ISO 27001:2022 A.8.15' can be looked up and confirmed. A finding that says 'improves your compliance posture' cannot, and should never be shipped.

How we think about it

Two commitments fall out of this. First, every finding heygrc raises names the specific control clause, so it is checkable rather than a vibe. Second, we hold the bar at precision, how rarely a finding is wrong, rather than chasing a vanity recall figure, because a compliance reviewer engineers trust is worth more than one that flags everything. We would rather say less and be right than flag more and be tuned out.

A noisy compliance reviewer is worse than none, because none at least does not train your team to ignore the next warning. heygrc is built to be quiet, specific, and right, and it is in early access while we hold it to that bar.

evalsfalse-positivestrustcode-review