Introduction

Post-election audits can serve a variety of roles, including process monitoring, quality improvement,
fraud deterrence, and bolstering public confidence.
I am most interested in using auditing to check whether election outcomes are
correct—to limit the risk that an incorrect preliminary outcome is certified.

Financial auditors distinguish between compliance audits and materiality audits.
Compliance audits determine how well policies and procedures have been followed.
Materiality audits determine whether errors amount to much money.
In election auditing, compliance audits are important for ensuring the accuracy of elections
… but I know of no jurisdiction that routinely performs compliance audits.
The appropriate notion of materiality for election audits is whether
errors—whatever their source— caused the wrong candidate to appear to win.
There is no jurisdiction that performs routine materiality audits, but
election integrity advocates, legislators, elections officials, and scientists are
now expressing interest, and there have been six tests of materiality audits
in California.
In a materiality audit, it does not matter what caused the errors nor whether the errors
can be "explained."
What matters is whether they changed the electoral outcome.

Some audits, such as California's "1%" audit, don't quite fit in either category.
California's 1% audit is closer to manufacturing process monitoring than to a financial audit.
It is not a compliance audit because it does not determine whether correct procedures were followed.
Compliance audits should answer questions such as:
Was chain of custody maintained and documented properly?
Were voted ballots or election equipment ever taken home by a pollworker?
Were all ballots accounted for?
Were all security seals intact?
Was the required minimum number of people present whenever voted ballots were handled?

While California's 1% audit does measure the accuracy of the counts, the precision of the
accuracy measurement varies by contest, and generally is too low to determine whether
the reported outcome of a contest is in doubt—especially if the contest is small or
has a small margin.
Hence, the California 1% not quite a materiality audit.
The rest of this document is about using audits to check electoral outcomes: materiality audits.

Auditing to Check Election Outcomes

The notion of a batch of ballots is crucial for post-election auditing.
In this document, a batch of ballots is a group of ballots or voter-verified paper recored
(i) for which
subtotal of the votes are reported (or committed to) by the Vote Tabulation System,
and (ii) that can be retrieved and counted manually.
For instance, a batch might consist of the ballots cast in person in a precinct,
or a subset of the ballots cast in person in a precinct, if the vote subtotals for the subset are
reported and if there is a way to separate those ballots from the rest for hand counting.
Alternatively, a batch might consist of a "deck" of vote-by-mail (VBM) ballots
containing ballots from several precincts run through a scanner as a group.
In some auditing schemes, a single ballot is a batch; such single ballot auditing
methods require special equipment such as printer-scanners that can print a unique identifier on a
ballot while scanning the ballot.
This document focuses on auditing methods that can be "bolted on" to any voting system
that produces an audit trail.

Until 2007, most technical work on post-election auditing focused on the following question:
Suppose we are drawing a random sample of batches of ballots to audit and that the outcome
of the contest is wrong.
How many batches do we need to examine to ensure that there is a high chance that the sample
will reveal one or more counting errors?

Audits of voter-marked paper ballots almost always find errors at the rate of a modest
fraction of a percent, owing largely to the fact that a human tallier can interpret voter intent
better than an optical scanner can—especially because voters do not always
follow the instructions for marking ballots.
(If the margin is small, this background rate of "benign" error could have changed the
apparent outcome.)
Because it is easy to find errors, the auditing problem is not how big a sample to take to find error.
The problem is whether to confirm the outcome in light of the error the audit finds.

Hence, a more important question for election auditing is this:
Given the way the sample was drawn and the errors found in the sample, how
strong is the evidence that the outcome is correct?
If the evidence is weak, counting should continue, either until the evidence is strong or until
all the ballots have been counted by hand.
That approach makes it possible to limit the risk of certifying an election outcome that disagrees
with the outcome a full hand count would show.
The outcome that a full hand count would show is generally the legal touchstone: By definition, it
is the "correct" outcome.

Audit Prerequisites

Post-election audits require a number of things:

Something to check. There must be vote subtotals for auditable batches.
Those totals need to published before the audit sample is selected.

Batch-Level Results

Election management systems and vote tabulation systems, which I shall refer to collectively as
EMSs, currently do not support reporting batch-level results in a machine-readable, structured format
useful for auditing.
Generally, they do store that information internally, and can print reports containing the
requisite information.
But editing or transcribing the batch-level results to design and conduct audits is time-consuming
and error prone.
Indeed, it was the largest burden in some of the risk-limiting audits we conducted in California in 2008.

Small batch size is crucial to the efficiency of audits.
EMSs should be able to store and report vote subtotals for arbitrarily small batches.
Moreover, state laws should avoid requiring large audit batches.
For instance, California's 1% audit requires ballots cast in a precinct and vote-by-mail (VBM) ballots
for that precinct to be audited together.
If the law permitted them to be audited separately, it would cut batch size in half, which would
increase audit efficiency markedly.
This issue is discussed further below.

There is a tradeoff between batch size and voter privacy, however.
If a group of voters can be matched to a group of ballots that record similar voting patterns,
one can determine confidently how those voters voted.
That does not depend only on the size of the reporting batches:
For instance, if batch of 1,000 ballots records 100% for one candidate, we know
exactly how all 1,000 voters in that batch voted.
Indeed, if a contest has a unanimous winner, we know how everyone voted.
Conversely, if a batch of two ballots records 50% for one candidate and 50% for another,
we do not know how either of the two voters in that batch voted.
What matters for privacy is not just the batch size, it is also the variability of the votes
within the batch.

I am not aware of any quantitative rule that governs the size of reporting batches to maintain
voter privacy.
A principled rule might set a threshold on the extra amount of information the batches give about
how individuals voted.
For instance, batches might be kept large enough that the error in predicting an individual's vote
from the batch subtotals is at least 50% of the error in predicting an individual's vote from
the contest totals alone.
That amounts to aggregating batches so that the margin in a combined batch is not too
different from the overall margin.
A simpler rule that requires aggregating batches so that reported subtotals include at least 25
voters is more practical, although it will occasionally compromise voter privacy.

Another "solution" to the privacy problem is to separate the issues of public reporting
and auditing.
Audit batches can be smaller than reporting batches, provided there is a mechanism to ensure that
the auditable subtotals are committed to indelibly before the audit begins.
This might involve putting the subtotals in escrow, sending them to the Secretary of State,
or signing them digitally and publishing them in encrypted form.

The Audit Trail

An audit can be no better than the audit trail it examines.
At the moment, that means voting systems need to have a voter-verified paper trail.
There is wide agreement that voter-marked paper ballots are the most
secure, reliable, transparent, and auditable technology currently available—and I concur.
As the California TTBR found, a DRE VVPAT is volatile—the process by which it is
printed does not produce as durable a record as ink on paper.
(Entire spools can be spoiled.)
And VVPAT printers tend to jam.
Moreover, it is much harder to determine whether VVPAT spools have been truncated or gone missing,
because the amount of paper used by a voter varies with the number of times the voter changes his or her
mind.

For any voting system, checks that the number of votes reported for a precinct is not larger than the
number of registered voters should be conducted for every contest in every election.
Indeed, given poor voter turnout, precincts in which turnout substantially exceeds historical values should
get special scrutiny.

Physical security of the audit trail is obviously crucial.
The audit trail needs to be stored in a secure location.
Chain of custody needs to be tracked meticulously.
Seals and signatures should be checked and breaches must be reported to the SoS and the public.
Several people should be present whenever a seal is removed or replaced.
And so on.

With voter-marked paper ballots, there are simple measures that can help ensure that the
audit trail is complete and accurate.
For instance, there can be "ballot accounting" or "ballot reconciliation"
to check that the number of ballots sent to a precinct equals the number returned from the precinct
voted, spoiled, and unvoted.

Techniques for Hand Counting

Hand counting is subject to error.
Generally, if a hand count of a batch disagrees with the preliminary results for that batch,
the hand count is repeated until it is clear that the discrepancy is real and not the result of
error in the hand count.
I have never seen written instructions that specified how many times to count by hand, when to conclude
that the discrepancy is real, and so on.
I'm sure some jurisdictions have such instructions, but I would guess that these vary considerably from
county to county and state to state.

Hand counts are generally conducted by teams of two, three, or four people.
In four-person teams, the usual approach is for one person to read the
votes aloud, a second person to look over the reader's shoulder to make
sure he or she read correctly, and two talliers to record the vote separately,
so their tallies can be checked against each other.

Some jurisdictions sort the ballots by how the votes were cast, then count stacks
of ballots (sort-and-stack).
Some cut VVPAT rolls apart before counting; some keep the rolls intact.
Some count one contest at a time on each ballot; some count all contests simultaneously.
I am not aware of any scientific studies of the efficiency and accuracy of competing
methods for hand counting.
Anecdotal evidence suggests that sort-and-stack is less accurate than simply calling
out the votes sequentially.

Audit Procedures: What batches to choose, where to start, when to stop

Selecting batches at random is essential to efficient auditing.
As described below, methods for selecting random batches
should involve a mechanical source of randomness, such as rolling dice.
Many sampling schemes have been proposed, including
simple random sampling,
stratified simple random sampling,
NEGEXP sampling,
sampling with probability proportional to an error bound (PPEB),
and combinations of random sampling and "targeted" sampling.

For all those approaches, it is possible to assess the evidence that
the outcome is correct in light of the errors the audit finds.
(See, for instance, this paper.)
If the goal is to ensure that whenever the outcome is wrong, there is a large chance of
a full hand count to set the record straight, it does not matter how large the initial
sample is: A sensible procedure will examine the sample, assess the evidence that the outcome
is correct, and require more counting if that evidence is not sufficiently strong.
The evidence could be weak because the sample is small or because the margin is small or because
the audit found many errors that favored the apparent winner.
What is crucial to controlling the risk of certifying a wrong outcome is when to stop
counting.
The rule must ensure that the chance of stopping before a full hand count is small if the apparent
outcome is wrong.
See the citations to my work below.

Currently, legislation and discussions of auditing are bogged down in arguments about how large a sample
to draw initially.
I think this is quite counterproductive.
It makes more sense to focus attention on rules for stopping the count.

Sound statistical rules for stopping short of a full hand count depend on the reported votes in
each auditable batch,
on how the sample is drawn, on the sample size, on the observed discrepancies in the sample,
and on technical details including the choice of statistical tests.
Some sampling schemes, such as PPEB, make it easier to devise tests that take into account the distribution of
error in the sample and allow errors that hurt the apparent winner to strengthen the evidence that the outcome
is correct.
Other sampling schemes make it difficult to use anything but the largest (weighted) error observed in the sample
to assess the evidence.
This is an area of rapid progress.
The most efficient method so far is based on PPEB and the Kaplan-Markov P-value.
See the references below.

What would a good audit law look like?

I think it is a bad idea to legislate particular audit methods, because that can lock in
a method that does not work as claimed, or that is superseded by a more efficient method.
Instead, I think audit laws should be as simple as possible and enunciate principles,
rather than procedures.
For instance, something like the following seems reasonable:

(i) Every statewide contest shall be audited to ensure that, if the preliminary outcome of the
contest is incorrect, there is at least a C% chance that the audit will correct the outcome.

(ii) Contests that are not statewide but for which more than X registered voters are eligible
to vote shall be audited to ensure that, if the preliminary outcome of the contest is incorrect,
there is at least a D% chance that the audit will correct the outcome.

(iii) A random sample of Y% of the contests not audited under (i) or (ii) shall be audited
to ensure that, if the preliminary outcome of the contest is incorrect, there is at least an E%
chance that the audit will correct the outcome.
The contests to be audited under this provision shall not be selected until after the preliminary
results for all contests have been published.

(iv) At least Z% of the ballots cast in contests not audited under (i)–(iii) shall be selected
at random and counted by hand.
Elections officials shall report the strength of the evidence that the outcomes
of those contests are correct.
Strength of evidence shall be defined to be the smallest chance that the ballots would contain
as little error as they were found to contain, if the outcome were in fact incorrect.
The definition of "little" shall be defined in regulation.

It would be up to lawmakers to choose C, D, E, X, Y,
and Z.
It would be a matter of regulation to specify how to achieve the limits on risk in (i)–(iii) and
the measure of "little" and the calculation of the probability in (iv).
Regulation could list one or more methods that are deemed by the Secretary of State to be acceptable,
and principles and procedures for evaluating and gaining approval for other methods.

California AB 2023 (Saldana) calls for a pilot of risk-limiting audits in 2011.
It is the first bill to get risk-limiting language right: It requires the audit to have a
large chance of resulting in a full hand count if the electoral outcome is wrong, in which
case, the hand count determines the official outcome.

As of this writing, there are methods to limit risk for every sampling scheme that I have seen proposed for
post-election audits: simple random samples, stratified simple random samples, sampling with
probability proportional to a bound on the error in each batch (PPEB),
and sampling with probability related to the negative exponential of a bound on the error in each
batch (NEGEXP).
Moreover, there are techniques to incorporate "targeted," deterministic sampling into the
risk calculations.
See the citations below.

It should be noted that ensuring a high probability of correcting a wrong outcome is not the only
sensible thing to do.
For instance, ensuring that at most S% of certified outcomes are incorrect seems quite reasonable,
and could well require less counting.
However, there is as yet no method to accomplish the latter, which is tied to the idea of the
False Discovery Rate.

Legislation should require disclosure of all algorithms and the source code for all software used in an
audit—even commercial, off-the-shelf (COTS) software—so that
the public can verify that the calculations were correct.
The fact that software is commercial is no guarantee that it does what it's supposed to do: See the
references below on known bugs in Excel.

Legislation should also require prompt publication of batch-level results, prior to the selection
of audit samples.
If stratified sampling is permitted, results should be published for all batches in a stratum before
the sample is selected from that stratum.
Batch-level results should be available in one easy-to-find location, for instance, the
Secretary of State's website.

If batches are extremely small—for instance, single ballots—it might make more sense
to report subtotals for somewhat larger groups of ballots, and have a mechanism that allows
elections officials to commit indelibly to the subtotals for the small batches.
For instance, the jurisdiction could put a digitally signed file in escrow with the Secretary of State;
the auditors would compare hand counts to that reference copy to determine whether the original
count has errors.

And legislation or regulation should require "sanity checks" such as (i) verifying that the
reported vote total for a precinct does not exceed the number of registered voters in that precinct,
(ii) (for paper ballots) verifying that the number of ballots returned voted, spoiled, and
unvoted equals the number
sent to the precinct, and (iii) verifying that the vote subtotals reported for the
batches sum to the totals for each contest.
Discrepancies should be posted in a central location, such as the Secretary of State's website.

Logistical Barriers

There are very real limits to the amount and nature of work that elections officials can add to
the burden of a canvass.
Expecting elections officials to perform statistical computations is perhaps unreasonable, but
expecting officials to report election results and audit results at the batch level would
be reasonable if EMSs supported structured data output.
EMSs currently are not designed to generate the reports needed to design and perform audits.
As discussed below, trivial changes to EMSs could make
auditing substantially easier and
more efficient by exporting structured,
machine-readable data at the level of arbitrarily small batches.

Questions that need to be considered include:

How much computing can elections officials be expected to do?

What computational tools are available to them?

How much counting can officials do during the canvass period?

How many contests can they audit during the canvass period?

How much space does it take to audit?

How many people does it take?

Who bears the expense?

How hard it it to retrieve batches of ballots for hand counting?

How much hand sorting is required before counting can happen?

What is the effect of the variability of audit burden on budgeting and logistics?

How can the public observe the process—including inspecting the computing tools
employed—and ensure it is accurate?

One political barrier that continues to surprise me is the insistence of some
election integrity advocates and politicians on putting detailed methods into
law, for instance, by mandating sample sizes in complicated tables and (mis)using
statistical jargon.
I think that makes bad law and bad audits.

Changes Needed

Election management needs to change in a couple of simple ways for routine
audits to be performed efficiently and economically.
If these changes are brought about—through legislation, regulation, or
market pressure—the rest of the auditing problem could be solved by
"bolt-on" audit procedures.
The latest and best audit procedures could be used as soon as the changes happen.
Hence, I believe that the following structural changes should start now.
The changes regard data export from election management systems, batch sizes,
and data reporting.

Data Plumbing

Any audit requires batch-level data on the number of votes cast and how they were cast.
As mentioned, EMSs generally do not support exporting batch-level data
in a structured, machine-readable format.
That needs to be addressed.

EMSs are generally built on databases that can be queried using SQL.
It would be next to trivial to write SQL queries to export the data needed for
audits—but it requires vendor cooperation.
If states would demand that vendors provide this functionality, routine audits would be much easier
and more efficient.
However, such changes might require recertification of the EMS.
On the other hand, query tools that could be used with a replicated database would suffice;
whether that would be acceptable requires a legal determination of whether the audit is
part of the canvass.

The plumbing should be changed to make it easy to collect batch-level election data and batch-level
audit data in a central location, such as the Secretary of State's website.
This could be facilitated by adopting a standard structured data format for election data,
such as OASIS/EML.

Reducing Batch Sizes

Generally, the smaller the auditable batches, the less counting is needed to confirm the outcome
(at a given risk limit) if the outcome is correct.
See Stark, P.B., 2010.
Why small audit batches are more efficient: two heuristic explanations
(http://statistics.berkeley.edu/~stark/Preprints/smallBatchHeuristics10.htm)
for heuristic explanations of the effect of reducing batch size.
Modest reductions in batch size can cut auditing burden enormously.
For instance, simulations using data from Marin County, CA, show that reducing batch size from precincts
(with VBM and in-person votes combined, as currently required by California law) to a
maximum of 100 ballots could reduce risk by a factor of 10, counting the same number of ballots in all.
California's 1% audit could actually confirm many outcomes if the batch size were reduced from entire
precincts of combined in-person and VBM ballots to smaller batches, such as 25–100 ballots.

How can we reduce batch sizes?
For DREs with VVPATS, we could audit machine results instead of precinct results; EMSs generally
can track results separately by machine.
If the VVPAT rolls corresponding to a given machine can be identified, those subtotals can be audited.
Regardless of the voting technology, we can keep votes cast in person separate from votes cast by mail
to create more, smaller batches.

For jurisdictions that use central-count optical scan systems (CCOS), ballots can be divided into
small "decks" of no more than 100 ballots before counting.
The jurisdiction would need to keep decks separate so that they can be counted by hand if
the deck is selected for audit, and the EMSs would need to be able to report subtotals by deck.
This approach will be tested in Marin County, CA, in November 2009.

For jurisdictions that use precinct-count optical scan systems (PCOS) here is an idea for
reducing batch sizes without increasing the burden on pollworkers.
EMSs can track votes by ballot style.
Artificially increasing the number of ballot styles by marking groups of no more than 100 ballots with
a barcode to identify them as a batch
would allow existing software to track ballots in smaller batches (and potentially to report subtotals
for those batches).
It would not be necessary for jurisdictions to account for each ballot pseudo-style sent to a precinct
separately; the difference between the styles is solely so that the EMS can tally subtotals for
each batch and so that—if the batch is selected for audit"the ballots that comprise the
batch can be identified.

Using pseudo-ballot-styles would increase printing costs
(modestly, I think—I hope to assess this quantitatively),
but would greatly reduce audit costs.
On the other hand, it could greatly increase the costs of logic and accuracy testing, if
every ballot pseudo-style needs to be tested individually.
But perhaps logic and accuracy testing could be done with a mix of ballot pseudo-styles for each
precinct (say, half a dozen of each pseudo-style, mixed together),
instead of using separate testing for each ballot pseudo-style.

Here is a sketch of how the idea would work.
Consider a precinct with 800 registered voters, of whom 300 request absentee ballots.
The jurisdiction would print three batches of 100 VBM ballots that are identical except for a barcode
and a letter (A–C).
The jurisdiction would print (up to) five batches of ballots to be used in the precinct, identical
except for a barcode and a letter (D–H).
When the ballots are scanned—whether using CCOS or PCOS—the barcode would make it
possible for the EMS to subtotal batches of no more than 100 ballots.
The data plumbing changes proposed elsewhere in this document
would make it possible for the EMS to export those subtotals in
a useful format.

If the audit selects one of the batches from a precinct to count by hand, the
ballots in that precinct are then
sorted manually (using the letter code) or with an automated sorter (using the barcode)
to separate the batch that is to be counted by hand.
There would be no need to sort and separate batches in precincts in which no batches are to be audited.

Using small batches raises privacy concerns.
Steps should be taken to ensure that pollworkers do not use the proliferation of ballot pseudo-styles to
determine how particular subgroups of voters voted, for instance, by giving all senior citizens
ballots of one style.
Omitting the human-readable letter would improve voter privacy, since—on the assumption that
people do not routinely read barcode—neither the voter, pollworkers, nor elections officials would
know which pseudo-ballot-style a given voter received.
To increase voter privacy, the ballot pseudo-styles could be shuffled before handing ballots to voters, so
that there is no way to know which batch contains a given voter's ballot.

Reducing batch sizes can reduce the burden of hand counting by a factor of hundreds.
It could increase costs in other ways, though.
For instance, the physical batches of ballots need to be retrievable, which entails some
organizational costs.
Using ballot pseudo-styles would increase printing costs, modestly, I think—but
I hope to assess this quantitatively.
Reducing the size of the "remainders" would reduce some waste:
For instance, if a precinct has 515 registered voters, one might print
75-ballot batches, so that the remainder is 515−450 = 65, rather than 515−500 = 15.
Proliferating ballot pseudo-styles could greatly increase the cost of logic and accuracy
testing if every ballot pseudo-style needs to be tested individually.
But perhaps logic and accuracy testing could be done with a mix of
ballot pseudo-styles for each
precinct (say, half a dozen of each pseudo-style, shuffled together),
instead of using separate testing for each ballot pseudo-style.
There is clearly room for clever solutions.
Longer term, voting systems should be designed to facilitate creating and reporting small batches.

Reporting Requirements

Jurisdictions should be required to publish batch-level subtotals for all contests promptly and before
auditing begins—preferably at a central location for the entire state, such as the Secretary of
State's website.
(Alternatively, the batch-level
subtotals need to be committed to indelibly before the audit starts, as discussed above.)
The report should include the number of registered voters in the batch, the number of ballots cast in the
batch, and the number of votes cast for each candidate in each contest.
Jurisdictions should also be required to publish audit procedures and audit results: the number of votes
found for each candidate in each audited batch in each audited contest.

As noted above, jurisdictions should also be required to publish algorithms and source code for
any software used in an audit.

Random Selection

It is crucial that the batches be selected at random.
It is crucial that the random selection have a real, observable random input, such as rolls of 10-sided dice.
However, it is impractical to roll dice for each batch to be selected.
Instead, it makes sense to roll 10-sided dice a moderate number of times (e.g., 10–15) and use
the result as a seed in a high-quality, open-source pseudo-random number generator (PRNG).
Using an open-source PRNG enables the public to input the seed that was generated by dice rolls and confirm
that the correct precincts were audited.
Using a PRNG also makes it possible to have one public "drawing" from which any size sample can
be generated by continuing the sequence that the PRNG produces.

Using Excel to select random batches should be prohibited.
The PRNG in Excel known to be faulty, and
it does not permit the user to specify a seed.

Next Steps

I expect that we will have very efficient audit methods applicable to a wide range of practices
within about two years—probably before the data plumbing and batch-size issues can be
worked out.
The theoretical bottlenecks right now are in applying the most efficient sampling scheme
(PPEB) to situations where logistical concerns mandate stratification (contests that cross
jurisdictional boundaries).
Drawing a stratified PPEB sample is not a problem, but analyzing the data from such a sample to
control risk is an unsolved problem.

Experiments to determine the rate of manual counting errors using different counting
strategies could be very helpful.
Are 4-person teams more accurate than 3-person teams?
Is sorting and stacking more accurate and faster than calling out each ballot?
Is it more accurate to count one contest at a time, or to count every contest on a ballot at once?
Which is faster?
How do these findings depend on ballot design?

It would be helpful to have systematic data collection on discrepancies found by audits,
the apparent causes of those discrepancies, and other variables such as the technology used to
count votes, the ballot design, and voter instructions.
Those data should be reported nationally in a central repository, perhaps on the US EAC's website.
Such data would be useful not only for improving ballot design, voter instructions, and voting technology, but
also for making informed decisions about when to require automatic recounts.
For instance, if it were determined that CCOS misinterprets voter intent about 0.1% of the time
for voter marked ballots that ask the voter to connect the candidate to the office by drawing a line,
then if the reported margin in a contest using that voting system were 0.1% or below,
it might make sense to skip the audit and simply count the entire contest by hand.

Summary

Post-election auditing has made enormous strides in the last few years.
The paradigm has shifted from detecting error to confirming outcomes.
It is now possible to determine the strength of the evidence that an outcome is correct using
a wide variety of sampling schemes.
Legislation and regulation are changing, but so far the only real success (in my opinion) is California
AB 2023, which is still under consideration.
I would consider the new laws that have passed to be failures
because they actually preclude doing a good job.
Legislation that mandates particular sampling schemes, sample sizes, etc., is counterproductive.
I think the place to focus immediate attention is on legislation and regulation to get the plumbing in place
that will enable effective, efficient audits.
The methods to use the data will be available by the time the data are there to be used.

References on bugs in Microsoft Excel

For issues and horror stories on spreadsheets more generally (not bugs), see
The European Spreadsheet Risk Interest Group.
The user interface (UI) of spreadsheets invites errors, then makes the errors hard to find,
in part because the UI conflates input, code, output, and presentation.
Performing unit testing on spreadsheets is difficult.

This bibliography singles out Excel, but Excel is not the only commercial,
off-the-shelf computational software with bugs.
I have run into bugs in both MATLAB and SAS that produced seriously erroneous numerical results.
But Excel is very widely used, and this literature is at hand.

Excerpt:
Excel 2007, like its predecessors, fails a standard set of
intermediate-level accuracy tests in three areas: statistical
distributions, random number generation, and estimation. Additional
errors in specific Excel procedures are discussed. Microsoft's
continuing inability to correctly fix errors is discussed. No
statistical procedure in Excel should be used until Microsoft
documents that the procedure is correct; it is not safe to assume that
Microsoft Excel's statistical procedures give the correct
answer. Persons who wish to conduct statistical analyses should use
some other package.

If users could set the seeds, it would be an easy matter to compute
successive values of the WH RNG and thus ascertain whether Excel is
correctly generating WH RNGs. We pointedly note that Microsoft
programmers obviously have the ability to set the seeds and to verify
the output from the RNG; for some reason they did not do so. Given
Microsoft's previous failure to implement correctly the WH RNG, that
the Microsoft programmers did not take this easy and obvious
opportunity to check their code for the patch is absolutely
astounding.