# (Sample) Size Matters

## Random sampling distribution are really something delightful

Last month I wrote about how the random sampling distribution (RSD) of various sample statistics are the basis for pretty much everything in statistics. If you understand RSDs, you understand a lot about why we do what we do in hypothesis testing, inferential statistics, and estimation of confidence intervals. Understanding RSDs gives you a huge advantage as you seek to use data in business, so let's take a closer look.

First, a brief refresher from last month. We have the big honkin' distribution (BHD), which is the distribution of the entire population of interest as shown in figure 1: Figure 1: Big honkin distribution (BHD) of the entire population of interest

To be able to make decisions in less than the amount of time necessary to evolve life on Earth, we only want to take a sample of some size n from the population. However, I have to understand how the sample statistics are distributed to make a decision, and that is where the RSD comes in. The RSD is the distribution of all possible samples of size n. For example, the RSD of the means for n = 10 for the population above looks something like figure 2 on the same scale as the BHD: Figure 2: Random sampling distribution of the means for n = 10 for the population

If the RSD of the means is somehow related back to the population of the individuals, we can make some statements about the population from a single sample mean. That relationship is in fact known from the central limit theorem, the one piece of real magic in the universe.

The central limit theorem states:

• The average of the RSD of the means is the mean of the individuals
• The standard deviation of the RSD of the means is related to the standard deviation of the individuals by the inverse of the square root of the sample size: • The shape of the RSD of the means of any distribution tends to become more normal as the sample size increases.

The theorem's third attribute allows us to use some more powerful statistics if we are investigating averages, because we can count on a normal distribution. I'll discuss more about that in a later article.

So how can we use the idea of an RSD to help us make decisions?

Let's say we are trying to increase the strength of a weld by changing welding parameters. Bobby Jo is fresh from welding school and thinks she has a procedure that will make it stronger.

The first step is to have a chat with management about risks to determine the correct sample size. You need to decide what minimum increase in strength is needed before you are going to change the current process. If the weld strength goes up by 0.0000001 ksi, we probably don't care to make any changes because it will cost more to do the experiment than we would ever make back from such a small increase. This minimum effect size is often called Δ, and it might be an engineering decision or a management calculation based on needed financial benefit and required return on investment. (Hmm, another idea for a future article.) Let's say that we do the calculations, and we need to have at least a 5 ksi improvement over the current weld strength to justify the expense of changing the process.

Ask the managers what probability they can live with of concluding that Bobby Jo's welding procedure increases the weld strength, when in fact it does not. (This is Type I or α error.) When your managers say, "Well, zero percent; I want you to be right!" you get to say, "Well then, we have to have an infinite sample size." And then you get an opportunity to teach them about statistical risks, which they should have learned in business school but because most business schools are data-averse, they learned about outsourcing instead. Once you get through that, they decide on α = 0.05.

Next they should decide what probability they are willing to tolerate of missing an actual improvement in weld strength of the minimum amount (Δ). (This is Type II or β error.) But the managers interrupt your fascinating discussion and say they want to have a sample size of 10. (Presumably because they have 10 fingers and can't count higher without taking off their shoes.)

What does that mean for your experiment? What does it have to do with RSDs?

When we calculate sample size, we are actually using the RSD of the means (or whatever statistic we are testing). But now we have to consider the size of the difference that we want to see. Let's assume normality for simplicity's sake and graph it out (see figure 3): Figure 3: Considering the size of the difference on weld strength

The top distributions are three possibilities that the BHD might be. The blue one is the distribution if Bobby Jo's procedure has absolutely no effect on weld strength (known as the null hypothesis, or H0). The greenish teal and salmon ones (anyone know why Excel chooses difficult-to-name colors?) are what the weld strength would look like if her procedure increases or decreases the strength by the minimum amount necessary to consider changing our standard operating procedure (or making sure we never do it, if it is a shift down). Looking at the individuals, there is a lot of overlap, so any one measurement is not going to tell you if there is a change—we are going to have to take a sample and get an average to tell.

The middle distribution is the RSD of the means if there is no effect due to Bobby Jo's welding procedure. The red areas sum up to our α of 0.05 on that RSD (0.025 on each side), and if we get an average of our 10 welds that is less than 93.80205 or more than 106.1980, we are going to say that the chances of that average coming from the blue BHD are pretty small—in fact, less than 5 percent of the time. So, as we explained to our managers, we have to tolerate some chance of saying that there is a change when in fact there is not, to be able to make any decision at all.

That is how we are going to make our decision, but what if the new weld procedure in fact makes a difference? That is what the bottom graph shows: two RSDs of the means if the weld procedure increases or decreases the weld strength by at least Δ. We already know how we are making our decision: We conclude there was a change when we see an average in one of the red areas on the middle graph. But is it possible that there was a change of Δ even though we got an average in the blue area between our two critical values? You bet it is, and that is shown by the purple area on each of those bottom distributions. (You can see that the purple areas are lined up with the blue area on the middle graph.) That is the probability if the new welding procedure increases or decreases the weld strength by Δ; we miss it and say that there was no difference. This is known as a β or Type II error. In this case, if there is a difference of Δ, we run a 64.8-percent chance of missing it. That seems high if this is something that we want to notice to improve the process.

You can see the importance of each of these inputs into the sample size. Type I error determines where I make my decision as to whether there was a change or not, and choosing a smaller Δ makes it harder to see differences that are there because those two bottom distributions get closer together.

Given where we are, what can we do to increase our ability to see changes that are there? Well, we can change where we make our decision by increasing our α, say to 0.10, as shown in figure 4: Figure 4: Increasing the ability to see change by increasing α to 0.10

That is still a lot of purple, though—about 52.5-percent chance of missing a real change of ±Δ. And of course, we bought that reduction with an increased chance of saying that there is a difference when there isn't (the red areas are larger than before).

We could change Δ, say to 20, but we should have had a really good reason to have selected 10 in the first place, so that is probably out. If we did, the BHDs showing the effects of the new procedure would move further apart, and the RSDs move with them (see figure 5): Figure 5: BHDs showing the effects of the new procedure move farther apart and the RSDs move with them

We obviously decreased our β error (the purple is so tiny you can't even see it), but all this is saying is if there is an enormous difference, we can detect it. As the vernacular would have it, "Well, duhh!"

I guess we have to make those managers reexamine their sample size of 10. Otherwise, why even run the experiment if it only has a 50–50 chance of detecting the minimum change you want to detect?

As we increase the sample size, the width of those RSDs decreases, because you are dividing by the square root of the sample size. Because Δ isn't changing, the averages of the distributions stay the same. Let's take a look at sample sizes of 20, 30, and 40 shown in figure 6: Figure 6: Sample sizes of 20, 30, and 40

As you can see, each time the sample size goes up, the RSDs get skinnier (if only it worked that way for diets). As the RSDs get skinnier, the purple areas of β error get smaller.

Instead of iterating to a sample size, we usually get the computer to do it for us by entering in α, β, Δ, and σ. Now that the managers understand β error, they meekly tell you that they would like no more than a β = 10-percent chance of that, please. If we continue our assumption of normality, the sample size required would be 42 (which, as it turns out, really is the answer to life, the universe, and everything—or at least our experiment).

So we have seen how the RSD is the basis of sample size calculations, and I sneaked in how they are also the basis of hypothesis testing (making the decision to accept or reject the null hypothesis). Next month I thought I'd take a look at that bit in the central limit theorem about the RSDs becoming more normal as the sample size increases, regardless of the distribution of the individuals.

RSDs are really something delightful, aren't they?

By the way, if you want to play with the spreadsheet I used to generate the RSD and the effect of sample size, you can download it here:

RSD and sample size work sheet