1.3 Basic Multivariate
Parameters & Statistics
First let us clarify the difference between
the population and sample. In thinking about
statistical inference it is useful to distinguish between samples and populations.
The latter is the universe of objects in the ‘real world’ in which
we are interested. These objects may be individuals, households, organizations,
countries or practically anything we can define as belonging to a single taxonomic
class. Because populations are often extremely large, or even infinite, it is
usually impossible – for cost and practical reasons – to take measurements
on every element of the population. For this reason, more often than not, we
draw a sample and generalise from the properties of the sample to the broader
population. In addition to the cost savings this entails, we are usually able
to make more - and more detailed - observations of each element. When we do make
observations on every element in the population, we are conducting a population
census and the issue of inference is not applicable as we will know from our
data the true score of the population on the variable of interest (measurement
error notwithstanding).
Making valid and reliable inferences from a sample to a population is a cornerstone
of science and there are many pitfalls that may crop up along the way in our
efforts to do this. Because of such difficulties, we often hear students and
researchers apparently attempting to limit the claims they are making for their
analyses by claiming something along the lines that their results ‘apply
only to the sample at hand and should be generalized to the broader population
with caution’. Such claims should, in almost every instance, be viewed
with scepticism, for we are hardly ever interested in the idiosyncratic characteristics
of a particular sample. Even when this sort of statement is made, generalization
to a broader population is almost always implicit in the conclusion being drawn.
In science we are almost always interested in making general statements from
particular instances i.e. drawing inferences from known facts to unknown facts.
What statements like the above are really saying is ‘I know this sample
is not very representative of the population but, if it were, this might be true’.
Such statements, however, are little better than armchair conjecture, as they
both fail to adequately link the postulated theory with observations representative
of the population of interest. Fortunately, though, if a sample is collected
properly, it is possible to make valid and reliable generalizations to the broader
population within known bounds of error. To do this, it is essential to understand
the concept of the sampling distribution as this is the key that allows us to
link our specific sample with the broader population.
Next is to be able to distinguish between the parameter and statistic.
Parameter is any characteristic of the population. Statistic on the other hand
is a characteristic of the sample. Statistic is used to estimate the value of
the parameter. Note that the value of statistic changes from one sample to the
next which leads to a study of the sampling distribution of statistic.
When we draw a sample from a population, it is just one of many samples that
might have been drawn and, therefore, observations made on any one sample are
likely to be different from the ‘true value’ in the population (although
some will be the same). Imagine we were to draw an infinite (or very large) number
of samples of individuals and calculate a statistic, say the arithmetic mean,
on each one of these samples and that we then plotted the mean value obtained
from each sample on a histogram (a chart using bars to represent the number of
times a particular value occurred). This would represent the sampling distribution
of the arithmetic mean. Don’t worry about the practicalities of doing this,
as we are only talking about a hypothetical set of possible samples that could,
in theory, be drawn.
Sampling distributions are useful because they allow us
to make statements about how likely our estimate from one particular sample
is to be the true population value. This is because, by knowing the frequency
with which the mean from our particular sample is found in the sampling distribution,
we can make statements about the probability of obtaining our particular
estimate of the population mean from our sample.
But how do we know the sampling distribution of our statistic without drawing
a huge (or infinite) number of samples each time we wish to use it? Fortunately,
we don’t need to actually draw all the samples that would be necessary
to physically plot sampling distributions, which would of course be completely
impractical, because of known mathematical links between the parameters of a
sample and the sampling distribution from which it is taken. If we draw a sufficiently
large random sample, all the information necessary for drawing inferences to
the population from which the sample was drawn is contained within the sample
data. To understand why this is so, it is important to understand a number of
additional, inter-related ideas. The first, and possibly the most important of
these, is the concept and properties of the normal distribution.