Multivariate Analysis Notes

Introduction and Review of Basic Concepts

Multivariate statistical analysis can be divided into two main stages as univariate statistical analysis; exploratory/descriptive and confirmatory/inferential. In the first stage, one looks for patterns, explore relationships to recognize and formulate problems. This stage also be called “data mining”. In the next stage, one tests statistical hypotheses about the observations by using models that aim to create a simplified version of the reality. We will follow the new fashion and call this stage “data crafting”.

The aim of this course is to move students’ understandings beyond simple descriptive statistics and measures of bivariate association, to more sophisticated analyses of the simultaneous inter-relationships between several variables. This enables us to distinguish between multiple potential causes of particular phenomena or events.

The ability to examine the relationships between multiple variables at the same time is very important, as it enables us to separate what we might think of as truly causal from merely correlational relationships. We often hear social scientists saying that some characteristic (say, coming from a single parent family) has an effect on some future outcome (say, adolescent delinquency) controlling for some other characteristic or set of characteristics (say, household income or parental employment status). Using bivariate analysis (bivariate means we obtain a measure of the relationship between just two variables) does not allow us to make such statements.

The ‘magic’ of multivariate analysis is that such comparative statements about the relative importance of effects between multiple possible causal mechanisms can be made – albeit with a known degree of uncertainty. This course is about how to perform this type of analysis. The emphasis, however, will be less on demystifying the ‘magic’ process than on teaching people how to actually do the tricks.

This is not to downplay the importance of understanding the mathematical and statistical assumptions upon which these techniques are grounded. True confidence in one’s analyses and the ability to move beyond simple rote learning comes only with a fuller understanding in these areas. However, as social scientists often struggle with a more ‘formulae based’ approach, it is probably better to equip students with some basic ‘technical’ skills upon which they can later elaborate, than to scare them off for good with too much matrix algebra and non-linear logarithmic transformations at an early stage! Wherever possible, verbal descriptions of the assumptions underlying the statistical techniques covered in the course have been employed.

Having said this, it should be acknowledged that (a) it is not always possible or helpful to omit formulae entirely and (b) verbal explanations of mathematical formulae can become very wordy! Students wishing to obtain a more mathematically grounded understanding of the techniques covered in the course will find plenty to keep them busy in the accompanying reading.

1.2 Types of Data

Multivariate data consists of a series of measurements or observations on any individual. Variables measured could be distinct or repeated measures of the same characteristic over time. Combination of these two leads to doubly multivariate data.

The type of model, or statistical test we choose to analyze our data with will depend upon the level at which the data is measured. Data measurement is traditionally characterized as being divided into four possible levels.

Nominal data has no order, and the assignment of numbers to categories is purely arbitrary (e.g. 1=Male, 2=Female). Because of lack of order, it is not possible to perform arithmetic (+, -, /, *) or logical operations (>, <, =) on nominal data. Nominal data is also commonly referred to as categorical, as it assigns observations to qualitative categories. If there are only two categories we will refer to the data as binomial.

Ordinal data has quantitative order, but the intervals between scale points are unequal. For example, although we can order all the football teams in the Premiership in terms of their level of ability (best done by looking at the League table rather than through rational debate!), the interval distance from the top team to the second-highest team may be great, but the interval from the second-ranked team to the third-ranked team may be very close. Because of lack of equal distances, arithmetic operations are inappropriate with ordinal data, which are restricted to logical operations (more than, less than, equal to). For instance, given a team finishing 10th and a team finishing 5th in a league of 20, where 1 is top of the table, we cannot divide 10 into 5 to conclude that the second team has twice the achievement of the first (although statistically unsophisticated sports fans often attempt such operations!). However, one can say that the first team represents more achievement than the second team.

Interval data has quantitative order and equal intervals. Counts are interval, such as counts of income, years of education, or number of votes. Ratio data are interval data which also have a true zero point. The Celsius temperature scale is not ratio because zero degrees is not "no temperature" but income is ratio because zero dollars is truly "no income". For most statistical procedures the distinction between interval and ratio does not matter and it is common to use the term "interval" to refer to ratio data as well. Occasionally, however, the distinction between interval and ratio becomes important. With interval data, one can perform logical operations, add, and subtract, but one cannot multiply or divide. For instance, if a liquid is 40 degrees and we add 10 degrees, it will be 50 degrees. However, a liquid at 40 degrees does not have twice the temperature of a liquid at 20 degrees because 0 degrees does not represent "no temperature" -- to multiply or divide in this way we would have to be using the Kelvin temperature scale, with a true zero point (0 degrees Kelvin = -273.15 degrees Celsius).

The reason that level of measurement is important in determining the particular technique to use is that virtually all statistical tests and procedures are based on a range of assumptions about the distributional properties of the data. While this can sometimes seem a bit obscure, imagine trying to describe in some statistical summary fashion, the names of the people in this class. It wouldn’t make much sense to take the arithmetic mean (adding all the names up and dividing by the number of people in the class) as we might if we were trying to summarize a different characteristic of this population, say age. While this simple example makes obvious intuitive sense, in statistical terms it is because the arithmetic mean requires data measured at the interval or ratio level. Categorical or nominal data is inappropriate for estimating an arithmetic mean.

Which Measures go with which Scales?
For measures of the bivariate relationship between two variables the following statistical procedures would be appropriate (though not uniquely):

• two ordinal/interval variables – correlation e.g. Pearson correlation.
• one nominal and one ordinal/interval variable – comparison of means with t-test.
• two nominal variables – crosstabulation and Pearson Chi Square.

For common multivariate procedures, including those covered in this course, the following applies:

• Multiple Linear regression – ordinal/interval dependent variable; ordinal/interval or nominal independent variables, although nominal variables need to be recoded to dummy variables.
• Factor analysis – ordinal/interval variables, although for some more advanced procedures, ordinal dichotomous variables can be used in some cases. Note, there are no dependent or independent variables in factor analysis.
• Logistic regression – nominal/ordinal (dichotomous) dependent variable; ordinal/interval or nominal independent variables, although nominal variables need to be recoded to dummy variables.
• Loglinear modeling – All variables should be measured at the nominal/ordinal level. This is not to say that interval variables cannot be used but they tend to have too many categories for practical purposes. The model does not require variables to be specified as dependent or independent, although thinking of the model in these terms often makes sense, which is not the case in factor analysis.

What is a Model?

There are conceptual models, logical models, mathematical models, statistical models, physical models, biological models and even business models.
In the field of statistics, model means a simplified version of reality specified in mathematical and/or graphical form in which some unknown characteristics (or parameters) of the target population are estimated. Models can be either extremely simple or extremely complex in terms of the number and nature of the parameters estimated. They are, however, always a simplification of the real-world processes they seek to represent.