AI Hallucinations in Accounting and Audit

AI is moving fast inside accounting and audit. Teams are using it for document extraction, transaction reconciliation, substantive testing, and analytical review. The productivity case is real. But alongside the gains, a specific and serious risk has surfaced: AI hallucinations.

Unlike a software bug, hallucinations don’t announce themselves. They look like accurate outputs. In a profession where a single figure can be material, and where every conclusion needs to hold up under inspection, that’s a problem teams need to understand before it surfaces in the wrong place.

This article defines AI hallucinations, explains why they carry higher stakes in accounting and audit than in most other contexts, and lays out what auditable AI actually requires.

What Is an AI Hallucination?

An AI hallucination is an output from a large language model (LLM, a type of AI trained on large amounts of text to generate human-like responses) that is factually incorrect, fabricated, or unsupported by the underlying data, presented with the same confidence as an accurate result.

Hallucinations aren’t bugs in the traditional sense. They’re a structural byproduct of how LLMs generate text. These models predict plausible next tokens based on patterns in training data. They don’t look up facts or verify information against a source of truth. When a model generates a citation, a figure, or a document summary, it’s making a prediction about what sounds right, not confirming what is right.

Common forms include:

Fabricated citations or references that don’t exist
Invented figures presented in a well-formatted table
Blended or misattributed facts pulled from multiple sources
Plausible-sounding summaries of content that the source document doesn’t actually contain

The concern is already showing up in practice. According to KPMG’s 2024 global report AI in Financial Reporting and Audit: Navigating the New Era, 21% of companies using AI in financial reporting cite hallucinations as a significant concern, a figure that climbs as organizations move from general AI into generative AI applications.

Why Hallucinations Hit Differently in Accounting and Audit

Most industries have some tolerance for imprecision in AI output. Accounting and audit don’t. The professional and regulatory standards that govern this work create three specific risk categories that don’t apply in most other domains.

Regulatory exposure

Audit workpapers are subject to PCAOB and SEC inspection. AI-generated content without traceable sourcing can’t satisfy evidentiary standards. The PCAOB’s 2025 Inspection Priorities explicitly flag the increased use of technology, including generative AI, as an inspection focus area, and they make clear that inspectors will be alert to how firms document and control AI use in audits. Auditors need to be prepared to show how any AI-generated conclusion was derived and validated.

Evidentiary integrity

In auditing, evidence has a specific legal and professional meaning. Under PCAOB AS 1105 (Audit Evidence), audit evidence needs to be sufficient and appropriate, meaning it has to be reliable and relevant to the conclusion it supports. A hallucinated document summary doesn’t meet that standard. The CAQ’s April 2024 publication, Auditing in the Age of Generative AI, is direct about this: AI use in financial reporting introduces new risks that auditors need to identify, assess, and respond to with the same rigor they apply elsewhere in the audit.

Materiality math

A hallucinated figure that seems small in absolute terms can be material relative to an account balance or disclosure threshold. The accuracy standard in accounting is higher than nearly any other AI use case, and the consequences of getting it wrong compound through every downstream workflow that relies on the original extraction.

There’s also a fourth factor worth naming: the plausibility trap.

AI outputs in accounting contexts look authoritative. Formatted numbers, structured summaries, clean tables. That polished presentation makes hallucinations significantly harder to catch than an obvious error would be. The output passes an initial review, which is exactly when it becomes dangerous.

Where AI Hallucinations Show Up in Audit Workflows

Hallucination risk doesn’t distribute evenly across audit work. It concentrates where AI adoption tends to be most attractive: high-volume, document-heavy workflows where the volume of source material makes manual review slow and expensive.

Document extraction. AI pulling figures from leases, invoices, and contracts faces a specific challenge. Numerical hallucinations are the hardest to catch because they look plausible and don’t trigger formatting errors. A lease commencement date or payment amount that’s off by a single digit won’t stand out in a well-structured table.
Audit sampling and transaction testing. AI-assisted sample selection and exception identification carry significant downstream risk. A hallucinated population or an incorrect extraction at this stage contaminates everything that follows. Errors introduced early in the testing cycle tend to propagate.
Substantive procedures. AI summarizing or analyzing supporting documents can blend real content with fabricated details in ways that survive an initial review. The summary looks complete. The risk is in what it omitted, or invented.
Revenue testing. Variable consideration, contract modifications, and performance obligation analysis all require precise extraction at the source-document level against specific revenue recognition standards. This is a high-hallucination-risk workflow because the underlying contracts are complex and the standards are exacting.
Analytical procedures and narrative. AI-generated trend commentary that references figures not present in the underlying data is particularly hard to catch in review. It fits the surrounding context well enough to read as reasonable, especially when a reviewer is moving quickly under deadline pressure.

The pattern across all of these is consistent: hallucination risk scales with document complexity and volume, exactly the conditions that make AI adoption most attractive for audit teams.

The Problem with General-Purpose AI in Accounting

General-purpose LLMs are trained on broad web data. Accounting workflows require precision at the source-document level against specific professional standards. That gap creates real risk when general AI tools are deployed in audit environments without additional controls.

One compounding issue is overreliance. When AI output looks polished and well-formatted, human review tends to become less rigorous. The output looks done. That’s when errors get through.

The guidance that exists on managing hallucinations is mostly written for developers building models, not for accounting leaders evaluating and deploying tools.

Advice about adjusting training data or model parameters doesn’t map to the decisions a controller or audit manager actually makes. What accounting teams need is a clear set of criteria for assessing whether any AI tool is safe to use in document-heavy, standards-governed workflows.

Generic advice also misses a core structural point. Hallucination rate is only part of the problem. The deeper issue is the absence of architecture that makes outputs reviewable and defensible. Even an output that happens to be accurate can’t be used as audit evidence if the team can’t trace where it came from.

What Auditable AI Requires and How to Evaluate Any Tool Against It

Four requirements need to be met for AI to be trustworthy in an accounting or audit context. Teams can use these as evaluation criteria when assessing any AI tool or platform.

The question accounting teams actually need to answer isn’t “does this AI tool reduce manual work.” It’s “can we stand behind the output.” These four requirements define what standing behind it actually takes.

Source traceability. Every output needs to link back to a specific document, field, or data point. If a team can’t follow the chain from conclusion to source, the output can’t be used as evidence. Traceability is what makes AI-generated output reviewable and defensible, not just in normal workflow, but during inspection or client review.
Explainability. The model needs to show its reasoning, not just its answer. Black-box AI produces conclusions with no audit trail. That’s not usable in a profession where reviewers, inspectors, and clients need to understand the basis for every conclusion drawn.
Human-in-the-loop design. AI should propose, assist, and flag. Humans confirm, override, and sign off. Any workflow where AI makes final determinations without human review creates compliance exposure. The CAQ’s Auditing in the Age of Generative AI is clear that auditors remain responsible for professional judgment. AI doesn’t transfer that responsibility. It needs to support it.
Domain-specific grounding. The model needs to operate within verified accounting data and recognized standards, not general internet knowledge. This is structurally different from a general-purpose chatbot. Domain grounding reduces hallucination risk because the model is constrained to a defined set of inputs and outputs, rather than drawing on the full breadth of training data that may have nothing to do with the task at hand.

General-purpose LLMs fall short on all four of these criteria when deployed in accounting workflows without additional architecture. That’s a description of what those models were built for, not a criticism of the technology itself.

Trullion’s Approach to Auditable AI

Auditable AI at Trullion is a design principle built into how the platform works, not a feature layered on top of it.

Data Extract and Data Match connect source documents to validated, traceable outputs, so every extracted figure links back to its origin in the source document. That chain from source to conclusion is built into the workflow from the start, which means teams don’t have to reconstruct it later when a reviewer, inspector, or client asks where a number came from.

Knowledge Room adds the standards layer that makes domain-specific grounding possible, storing accounting standards, regulatory frameworks, and firm methodology in one AI-accessible place so Trulli agents draw on verified guidance rather than general training data.

Human review is central to that design. AI extracts and proposes, and auditors confirm and approve. Professional judgment doesn’t get removed from the process. It gets applied where it matters most, with better information underneath it.

The goal is to give accounting and audit teams AI they can actually stand behind, not AI that looks productive and creates exposure downstream.

The Question Every Team Should Be Asking

Before deploying any AI in an accounting or audit workflow, one question cuts through the rest: can we stand behind this output?

Not “does it save time,” though that matters. Not “does it look right at first glance,” the plausibility trap. The question is whether every conclusion the tool produces is traceable, reviewable, and defensible when a regulator, inspector, or client asks where it came from.

Hallucinations aren’t a reason to avoid AI in accounting and audit. They’re a reason to set that bar before you adopt, and to hold every tool to it.

Schedule a demo to see how Trullion’s auditable AI works in practice.

Author

Katie Cavanaugh

Trullion traces every extracted figure back to its source document and grounds outputs in verified accounting standards, not general training data.

Book A Demo

AI Hallucinations in Accounting and Audit: What They Are and Why the Stakes Are Higher