Selection: Sampling methodology

Koen Derks

last modified: 19-08-2021

Sampling methodology

Auditors are often required to assess balances or processes that involve a large number of items. Since they cannot inspect all of these items individually, they need to select a subset (i.e., a sample) from the total population to make a statement about a certain characteristic of the population. For this purpose, various selection methodologies are available that have become standard in an audit context. However, in practice it seems that the distinction between sampling methods —and when to use them— is not always easy to make.

This vignette outlines the most commonly used sampling methodology for auditing and shows how to select a sample using these methods with the jfa package.

Sampling units

Selecting a subset from the population requires knowledge of the sampling units; physical representations of the population that needs to be audited. Generally, the auditor has to choose between two types of sampling units: individual items in the population or individual monetary units in the population. In order to perform statistical selection, the population must be divided into individual sampling units that can be assigned a probability to be included in the sample. The total collection of all sampling units which have been assigned a selection probability is called the sampling frame.

Items

A sampling unit for record (i.e., attributes) sampling is generally a characteristic of an item in the population. For example, suppose that you inspect a population of receipts. A possible sampling unit for record sampling can be the date of payment of the receipt. When a sampling unit (e.g., date of payment) is selected by the sampling algorithm, the population item that corresponds to the sampled unit is included in the sample.

Monetary units

A sampling unit for monetary unit sampling is different than a sampling unit for record sampling in that it is an individual monetary unit within an item or transaction, like an individual dollar. For example, a single sampling unit can be the 10\(^{th}\) dollar from a specific receipt in the population. When a sampling unit (e.g., individual dollar) is selected by the sampling algorithm, the population item that includes the sampling unit is included in the sample.

Sampling algorithms

This section discusses the four sampling algorithms implemented in jfa. First, for notation, let the the population \(N\) be defined as the total set of individual sampling units \(x_i\).

\[N = \{x_1, x_2, \dots, x_N\}.\]

In statistical sampling, every sampling unit \(x_i\) in the population must receive a selection probability \(p(x_i)\). The purpose of the sampling algorithm is to provide a framework to assign selection probabilities to each of the sampling units, and subsequently draw sampling units from the population until a set of size \(n\) has been created.

The next section discusses which sampling algorithms are available in jfa. To illustrate the outcomes for different sampling algorithms we will use the BuildIt data set that can be loaded using the code below.

data(BuildIt)

Fixed interval sampling (Systematic sampling)

Fixed interval sampling is an algorithm designed for yielding representative samples from monetary populations. The algorithm determines a uniform interval on the (optionally ranked) sampling units. Next, a starting point is handpicked or randomly selected in the first interval and a sampling unit is selected throughout the population at each of the uniform intervals from the starting point. For example, if the interval has a width of 10 sampling units and sampling unit number 5 is chosen as the starting point, the sampling units 5, 15, 20, etc. are selected to be included in the sample.

The number of required intervals \(I\) can be determined by dividing the number of sampling units in the population by the required sample size:

\[I = \frac{N}{n},\]

in which \(n\) is the required sample size and \(N\) is the total number of sampling units in the population.

If the space between the selected sampling units is equal, the selection probability for each sampling unit is theoretically defined as:

\[p(x) = \frac{1}{I},\]

with the property that the space between selected units \(i\) is the same as the interval \(I\), see Figure 1. However, in practice the selection is deterministic and completely depends on the chosen starting points (using start).

Figure 1: Illustration of fixed interval sampling

The fixed interval algorithm yields a sample that allows every sampling unit in the population an equal chance of being selected. However, the systematic sampling algorithm has the property that all items in the population with a monetary value larger than the interval \(I\) have an selection probability of one because one of these items’ sampling units are always selected from the interval. Note that, if the population is arranged randomly with respect to its deviation pattern, fixed interval sampling is equivalent to random selection.

Advantages: The advantage of the fixed interval sampling algorithm is that it is often simple to understand and fast to perform. Another advantage is that, in monetary unit sampling, all items that are greater than the calculated interval will be included in the sample. In record sampling, since units can be ranked on the basis of value, there is also a guarantee that some large items will be in the sample.

Disadvantages: A pattern in the population can coincide with the selected interval, rendering the sample less representative. What is sometimes seen as an added complication for this algorithm is that the sample is hard to extend after drawing the initial sample. This is due to the chance of selecting the same sampling unit. However, by removing the already selected sampling units from the population and redrawing the intervals this problem can be efficiently solved.

As an example, the code below shows how to apply the fixed interval sampling algorithm in a record sampling and a monetary unit sampling setting. Note that, by default, the first sampling unit from each interval is selected. However, this can be changed by setting the argument start = 1 to a different value.

# Record sampling
sample <- selection(data = BuildIt, size = 100, units = 'items', method = 'interval', start = 1)
head(sample$sample, n = 6)
##   row times    ID bookValue auditValue
## 1   1     1 82884    242.61     242.61
## 2  36     1 80125    118.58     118.58
## 3  71     1 27566    481.44     481.44
## 4 106     1 88261    266.66     266.66
## 5 141     1 58999    568.60     568.60
## 6 176     1 27801    314.65     314.65
# Monetary unit sampling
sample <- selection(data = BuildIt, size = 100, units = 'values', method = 'interval', values = 'bookValue', start = 1)
head(sample$sample, n = 6)
##   row times    ID bookValue auditValue
## 1   1     1 82884    242.61     242.61
## 2  38     1 57172    329.30     329.30
## 3  73     1 90160    205.69     205.69
## 4 110     1  4756    295.96     295.96
## 5 146     1 90183    333.28     333.28
## 6 183     1 96080    449.07     449.07

Cell sampling

In the case of cell sampling, the algorithm, like in fixed interval sampling, is again dividing the (optionally ranked) population into a set of intervals \(I\) that are computed through the previously given equations. The difference is that in cell sampling, within each interval, a sampling unit is selected by randomly drawing a number within the interval range \(I\). This causes the space \(i\) between the sampling units to vary.

Like in the fixed interval sampling algorithm, the selection probability for each sampling unit is defined as:

\[p(x) = \frac{1}{I}.\]

Figure 2: Illustration of cell sampling

The cell sampling algorithm has the property that all items in the population with a monetary value larger than twice the interval \(I\) have a selection probability of one.

Advantages: More sets of samples are possible than in fixed interval sampling, as there is no systematic interval \(i\) to determine the selections. It is argued that the cell sampling algorithm offers a solution to the pattern problem in fixed interval sampling.

Disadvantages: A disadvantage of this sampling algorithm is that not all items in the population with a monetary value larger than the interval have a selection probability of one. Besides, population items can be in two adjacent cells, thereby creating the possibility that an items is included in the sample twice.

As an example, the code below shows how to apply the cell sampling algorithm in a record sampling and a monetary unit sampling setting. It is important to set a seed to make the results reproducible.

# Record sampling
set.seed(1)
sample <- selection(data = BuildIt, size = 100, units = 'items', method = 'cell')
head(sample$sample, n = 6)
##   row times    ID bookValue auditValue
## 1   9     1 14608    216.48     216.48
## 2  48     1 45437    347.94     139.18
## 3  90     1 90333    241.17     241.17
## 4 136     1 45746    440.72     440.72
## 5 147     1 72906    677.62     677.62
## 6 206     1 93529    528.79     528.79
# Monetary unit sampling
set.seed(1)
sample <- selection(data = BuildIt, size = 100, units = 'values', method = 'cell', values = 'bookValue')
head(sample$sample, n = 6)
##   row times    ID bookValue auditValue
## 1   8     1 81460    295.20     295.20
## 2  53     1 80645    677.88     677.88
## 3  92     1 75133    355.16     355.16
## 4 142     1 68676    612.46     612.46
## 5 153     1 63777    552.83     552.83
## 6 214     1 25379   1021.07    1021.07

Random sampling

Random sampling is the most simple and straight-forward selection algorithm. The random sampling algorithm provides a method that allows every sampling unit in the population an equal chance of being selected, meaning that every combination of sampling units has the same probability of being selected as every other combination of the same number of sampling units. Simply put, the algorithm draws a random selection of size \(n\) of the sampling units. Therefore, the selection probability for each sampling unit is defined as:

\[p(x) = \frac{1}{N},\]

where \(N\) is the number of units in the population. To clarify this procedure, Figure 3 provides an illustration of the random sampling algorithm.

Figure 3: Illustration of random sampling

Advantages: The random sampling algorithm yields an optimal random selection, with the additional advantage that the sample can be easily extended by applying the same algorithm again.

Disadvantages: Because the selection probabilities are equal for all sampling units there is no guarantee that items with a large monetary value in the population will be included in the sample.

As an example, the code below shows how to apply the random sampling algorithm in a record sampling and a monetary unit sampling setting. It is important to set a seed to make results reproducible.

# Record sampling
set.seed(1)
sample <- selection(data = BuildIt, size = 100, units = 'items', method = 'random')
head(sample$sample, n = 6)
##    row times    ID bookValue auditValue
## 1 1017     1 50755    618.24     618.24
## 2  679     1 20237    669.75     669.75
## 3 2177     1  9517    454.02     454.02
## 4  930     1 85674    257.82     257.82
## 5 1533     1 31051    308.53     308.53
## 6  471     1 84375    824.66     824.66
# Monetary unit sampling
set.seed(1)
sample <- selection(data = BuildIt, size = 100, units = 'values', method = 'random', values = 'bookValue')
head(sample$sample, n = 6)
##    row times    ID bookValue auditValue
## 1 2174     1 90260    625.98     625.98
## 2 2928     1 68595    548.21     548.21
## 3 1627     1 98301    429.07     429.07
## 4  700     1 29683    239.26     239.26
## 5  147     1 72906    677.62     677.62
## 6 3056     1 86317    246.22     246.22

Modified Sieve Sampling

The fourth option for the sampling algorithm is modified sieve sampling (Hoogduin, Hall, & Tsay, 2010). This algorithm starts by selecting a standard uniform random number \(R_i\) between 0 and 1 for each item in the population. Next, the sieve ratio:

\[S_i = \frac{Y_i}{R_i}\]

is computed for each item by dividing the book value of that item by the random number. Lastly, the items in the population are sorted by their sieve ratio \(S\) (in decreasing order) and the top \(n\) items are selected for inspection. In contrast to the classical sieve sampling algorithm (Rietveld, 1978), the modified sieve sampling algorithm provides precise control over sample sizes.

As an example, the code below shows how to apply the modified sieve sampling algorithm in a monetary unit sampling setting. It is important to set a seed to make results reproducible.

# Monetary unit sampling
set.seed(1)
sample <- selection(data = BuildIt, size = 100, units = 'values', method = 'sieve', values = 'bookValue')
head(sample$sample, n = 6)
##    row times    ID bookValue auditValue
## 1 2329     1 29919    681.10     681.10
## 2 2883     1 59402    279.29     279.29
## 3 1949     1 56012    581.22     581.22
## 4 3065     1 47482    621.73     621.73
## 5 1072     1 79901    789.97     789.97
## 6  488     1 50811    651.35     651.35