Evidence

Why AI security findings need evidence

AI tools can surface potential security issues quickly. But a finding is not the same as a proven problem. When findings lack evidence, developers spend time investigating claims that may or may not hold. The result is noise that competes with real problems for attention.

Avorelo Topic: Evidence Topic: Security findings Topic: Proof 4 min read

Plausible is not the same as proven

AI tools are good at pattern recognition. They can look at code and identify patterns that match known vulnerability classes: potential injection points, missing input validation, unsafe deserialization, dependency issues. Many of these findings are worth investigating.

But AI tools are also prone to producing findings that are syntactically plausible but contextually wrong. A flag for SQL injection in code that does not actually reach a database. A missing authorization check in a function that is only called from already-authenticated paths. An outdated dependency that has no published exploit against the version in use.

The problem is not that AI tools produce false findings. It is that the findings often lack the evidence that would let a developer quickly assess whether the finding applies in context. Without evidence, every finding requires manual investigation. With enough false findings, developers learn to distrust the signal entirely.

What counts as evidence for a security finding

A security finding with evidence gives the reviewer enough information to act without doing a full investigation from scratch. That does not mean every finding needs a complete exploit chain. It means the finding should include enough to answer the first few questions a developer will ask.

File reference and line range. Where exactly in the code does this occur?
Reproduction path. How would an attacker or a bad input reach this code?
Impact scope. What could go wrong if this were exploited?
Confidence indicator. Is this a confirmed issue, a pattern match, or a low-confidence observation?

Without these, a developer reading the finding has to answer all four questions themselves before they can decide whether to act. That is the investigation work being transferred from the tool to the person.

Confidence labels matter

Not all findings deserve the same response. A confirmed issue with reproduction steps should be treated urgently. A low-confidence pattern match should be queued for review when convenient. A high-confidence architectural concern with no current exploit should be logged and tracked.

When all findings are presented at the same level of urgency, developers calibrate their response to their experience with the tool rather than the severity of the issue. If the tool has produced many false positives before, the developer discounts new findings. If the developer trusts the tool too much, they over-respond to low-confidence observations.

Finding quality levels

Confirmed: file reference + reproduction + impact scope + confidence: high

Probable: file reference + likely path + confidence: estimated

Pattern match: no path + no scope + confidence: low

Why teams stop trusting AI security tools

Security tooling suffers a specific trust erosion pattern. When a tool surfaces many findings, a few are important, and most are noise. Developers learn, over time, which categories to take seriously and which to ignore. This learned calibration is fragile: when a genuinely new issue appears in a category the developer had learned to discount, it gets missed.

Evidence-backed findings reverse this. When a finding includes enough context to assess quickly, developers can triage at the tool level rather than at the manual investigation level. Findings that are low quality get closed quickly. Findings that are high quality get fast attention. The tool builds trust by making the distinction visible rather than hiding it.

Filtering before review

The most effective place to address the evidence problem is before findings reach the review queue. If a finding cannot pass a basic evidence check, it should not be presented as actionable. It should be logged at a lower confidence level and reviewed in batches rather than treated as an individual urgent issue.

This filtering has to be based on actual evidence criteria, not just confidence scores from the model. A model can be confident about wrong things. The evidence criteria should be structural: does the finding include a file reference? Does it include a path to the issue? Does it include a confidence label that was assigned based on actual analysis, not just output formatting?

How Avorelo helps

Avorelo applies evidence requirements to AI outputs before they become review items. Findings without file references, reproduction context, or confidence labels are classified as low-confidence and handled separately from confirmed issues. This keeps the review queue focused on findings that have enough evidence to act on.

It also attaches proof receipts to completed work: what changed, what was validated, what evidence exists, and what remains uncertain. This makes future reviews faster because the relevant context is attached to the work rather than reconstructed from logs.

← All articles