Introduction and Review of Basic Concepts
Multivariate statistical analysis can be divided into two main stages as
univariate statistical analysis; exploratory/descriptive and confirmatory/inferential.
In the first stage, one looks for patterns, explore relationships to recognize
and formulate problems. This stage also be called “data mining”.
In the next stage, one tests statistical hypotheses about the observations
by using models that aim to create a simplified version of the reality. We
will follow the new fashion and call this stage “data crafting”.
The aim of this course is to move students’ understandings beyond simple
descriptive statistics and measures of bivariate association, to more sophisticated
analyses of the simultaneous inter-relationships between several variables.
This enables us to distinguish between multiple potential causes of particular
phenomena or events.
The ability to examine the relationships between multiple variables at the
same time is very important, as it enables us to separate what we might think
of as truly causal from merely correlational relationships. We often hear social
scientists saying that some characteristic (say, coming from a single parent
family) has an effect on some future outcome (say, adolescent delinquency)
controlling for some other characteristic or set of characteristics (say, household
income or parental employment status). Using bivariate analysis (bivariate
means we obtain a measure of the relationship between just two variables) does
not allow us to make such statements.
The ‘magic’ of multivariate analysis is that such comparative statements
about the relative importance of effects between multiple possible causal mechanisms
can be made – albeit with a known degree of uncertainty. This course
is about how to perform this type of analysis. The emphasis, however, will
be less on demystifying the ‘magic’ process than on teaching people
how to actually do the tricks.
This is not to downplay the importance of understanding the mathematical and
statistical assumptions upon which these techniques are grounded. True confidence
in one’s analyses and the ability to move beyond simple rote learning
comes only with a fuller understanding in these areas. However, as social scientists
often struggle with a more ‘formulae based’ approach, it is probably
better to equip students with some basic ‘technical’ skills upon
which they can later elaborate, than to scare them off for good with too much
matrix algebra and non-linear logarithmic transformations at an early stage!
Wherever possible, verbal descriptions of the assumptions underlying the statistical
techniques covered in the course have been employed.
Having said this, it should be acknowledged that (a) it is not always possible
or helpful to omit formulae entirely and (b) verbal explanations of mathematical
formulae can become very wordy! Students wishing to obtain a more mathematically
grounded understanding of the techniques covered in the course will find plenty
to keep them busy in the accompanying reading.
1.2 Types of Data
Multivariate data consists of a series of measurements or observations on any
individual. Variables measured could be distinct or repeated measures of the
same characteristic over time. Combination of these two leads to doubly
multivariate data.
The type of model, or statistical test we choose to analyze our data with will
depend upon the level at which the data is measured. Data measurement is traditionally
characterized as being divided into four possible levels.
Nominal data has no order, and the assignment of numbers to
categories is purely arbitrary (e.g. 1=Male, 2=Female). Because of lack of
order, it is not possible
to perform arithmetic (+, -, /, *) or logical operations (>, <, =) on
nominal data. Nominal data is also commonly referred to as categorical, as
it assigns observations to qualitative categories. If there are only two categories
we will refer to the data as binomial.
Ordinal data has quantitative order, but the intervals between scale points
are unequal. For example, although we can order all the football teams in the
Premiership in terms of their level of ability (best done by looking at the
League table rather than through rational debate!), the interval distance from
the top team to the second-highest team may be great, but the interval from
the second-ranked team to the third-ranked team may be very close. Because
of lack of equal distances, arithmetic operations are inappropriate with ordinal
data, which are restricted to logical operations (more than, less than, equal
to). For instance, given a team finishing 10th and a team finishing 5th in
a league of 20, where 1 is top of the table, we cannot divide 10 into 5 to
conclude that the second team has twice the achievement of the first (although
statistically unsophisticated sports fans often attempt such operations!).
However, one can say that the first team represents more achievement than the
second team.
Interval data has quantitative order and equal intervals. Counts are interval,
such as counts of income, years of education, or number of votes. Ratio data
are interval data which also have a true zero point. The Celsius temperature
scale is not ratio because zero degrees is not "no temperature" but
income is ratio because zero dollars is truly "no income". For most
statistical procedures the distinction between interval and ratio does not
matter and it is common to use the term "interval" to refer to ratio
data as well. Occasionally, however, the distinction between interval and ratio
becomes important. With interval data, one can perform logical operations,
add, and subtract, but one cannot multiply or divide. For instance, if a liquid
is 40 degrees and we add 10 degrees, it will be 50 degrees. However, a liquid
at 40 degrees does not have twice the temperature of a liquid at 20 degrees
because 0 degrees does not represent "no temperature" -- to multiply
or divide in this way we would have to be using the Kelvin temperature scale,
with a true zero point (0 degrees Kelvin = -273.15 degrees Celsius).
The reason that level of measurement is important in determining the particular
technique to use is that virtually all statistical tests and procedures are
based on a range of assumptions about the distributional properties of the
data. While this can sometimes seem a bit obscure, imagine trying to describe
in some statistical summary fashion, the names of the people in this class.
It wouldn’t make much sense to take the arithmetic mean (adding all the
names up and dividing by the number of people in the class) as we might if
we were trying to summarize a different characteristic of this population,
say age. While this simple example makes obvious intuitive sense, in statistical
terms it is because the arithmetic mean requires data measured at the interval
or ratio level. Categorical or nominal data is inappropriate for estimating
an arithmetic mean.
Which Measures go with which Scales?
For measures of the bivariate relationship between two variables the following
statistical procedures would be appropriate (though not uniquely):
•
two ordinal/interval variables – correlation e.g. Pearson correlation.
•
one nominal and one ordinal/interval variable – comparison of means with
t-test.
•
two nominal variables – crosstabulation and Pearson Chi Square.
For common multivariate procedures, including those covered in this course,
the following applies:
•
Multiple Linear regression – ordinal/interval dependent variable; ordinal/interval
or nominal independent variables, although nominal variables need to be recoded
to dummy variables.
•
Factor analysis – ordinal/interval variables, although for some more
advanced procedures, ordinal dichotomous variables can be used in some cases.
Note, there are no dependent or independent variables in factor analysis.
•
Logistic regression – nominal/ordinal (dichotomous) dependent variable;
ordinal/interval or nominal independent variables, although nominal variables
need to be recoded to dummy variables.
•
Loglinear modeling – All variables should be measured at the nominal/ordinal
level. This is not to say that interval variables cannot be used but they tend
to have too many categories for practical purposes. The model does not require
variables to be specified as dependent or independent, although thinking of
the model in these terms often makes sense, which is not the case in factor
analysis.
What is a Model?
There are conceptual models, logical models, mathematical models, statistical
models, physical models, biological models and even business models.
In the field of statistics, model means a simplified version of reality specified
in mathematical and/or graphical form in which some unknown characteristics
(or parameters) of the target population are estimated. Models can be either
extremely simple or extremely complex in terms of the number and nature of
the parameters estimated. They are, however, always a simplification of the
real-world processes they seek to represent.