1.3 Basic Multivariate Parameters & Statistics

First let us clarify the difference between the population and sample. In thinking about statistical inference it is useful to distinguish between samples and populations. The latter is the universe of objects in the ‘real world’ in which we are interested. These objects may be individuals, households, organizations, countries or practically anything we can define as belonging to a single taxonomic class. Because populations are often extremely large, or even infinite, it is usually impossible – for cost and practical reasons – to take measurements on every element of the population. For this reason, more often than not, we draw a sample and generalise from the properties of the sample to the broader population. In addition to the cost savings this entails, we are usually able to make more - and more detailed - observations of each element. When we do make observations on every element in the population, we are conducting a population census and the issue of inference is not applicable as we will know from our data the true score of the population on the variable of interest (measurement error notwithstanding).

Making valid and reliable inferences from a sample to a population is a cornerstone of science and there are many pitfalls that may crop up along the way in our efforts to do this. Because of such difficulties, we often hear students and researchers apparently attempting to limit the claims they are making for their analyses by claiming something along the lines that their results ‘apply only to the sample at hand and should be generalized to the broader population with caution’. Such claims should, in almost every instance, be viewed with scepticism, for we are hardly ever interested in the idiosyncratic characteristics of a particular sample. Even when this sort of statement is made, generalization to a broader population is almost always implicit in the conclusion being drawn.

In science we are almost always interested in making general statements from particular instances i.e. drawing inferences from known facts to unknown facts. What statements like the above are really saying is ‘I know this sample is not very representative of the population but, if it were, this might be true’. Such statements, however, are little better than armchair conjecture, as they both fail to adequately link the postulated theory with observations representative of the population of interest. Fortunately, though, if a sample is collected properly, it is possible to make valid and reliable generalizations to the broader population within known bounds of error. To do this, it is essential to understand the concept of the sampling distribution as this is the key that allows us to link our specific sample with the broader population.

Next is to be able to distinguish between the parameter and statistic. Parameter is any characteristic of the population. Statistic on the other hand is a characteristic of the sample. Statistic is used to estimate the value of the parameter. Note that the value of statistic changes from one sample to the next which leads to a study of the sampling distribution of statistic. When we draw a sample from a population, it is just one of many samples that might have been drawn and, therefore, observations made on any one sample are likely to be different from the ‘true value’ in the population (although some will be the same). Imagine we were to draw an infinite (or very large) number of samples of individuals and calculate a statistic, say the arithmetic mean, on each one of these samples and that we then plotted the mean value obtained from each sample on a histogram (a chart using bars to represent the number of times a particular value occurred). This would represent the sampling distribution of the arithmetic mean. Don’t worry about the practicalities of doing this, as we are only talking about a hypothetical set of possible samples that could, in theory, be drawn.

Sampling distributions are useful because they allow us to make statements about how likely our estimate from one particular sample is to be the true population value. This is because, by knowing the frequency with which the mean from our particular sample is found in the sampling distribution, we can make statements about the probability of obtaining our particular estimate of the population mean from our sample.

But how do we know the sampling distribution of our statistic without drawing a huge (or infinite) number of samples each time we wish to use it? Fortunately, we don’t need to actually draw all the samples that would be necessary to physically plot sampling distributions, which would of course be completely impractical, because of known mathematical links between the parameters of a sample and the sampling distribution from which it is taken. If we draw a sufficiently large random sample, all the information necessary for drawing inferences to the population from which the sample was drawn is contained within the sample data. To understand why this is so, it is important to understand a number of additional, inter-related ideas. The first, and possibly the most important of these, is the concept and properties of the normal distribution.