RStudio: Inferential Statistics

Type of Data: Quantitative (Numerical)

  • Z- Procedure: σ Known
  • t-Procedure: σ Unknown

PACKAGE: BSDA

GENERAL FORM OF R COMMAND:

z.test(dataframename$variablename, alternative = " ", mu = munull, sigma.x = knownstandarddeviation, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

EXAMPLE:

Dataset:

We are asked to analyze the average height of individuals of Italian nationality. The variance of the Italian population is known to be 5. Here is the data:

175, 168, 168, 190, 156, 181, 182, 175, 174, 179

We will test the hypothesis that the true mean height is different than 170, and construct a confidence interval.

Since the data set is small we will enter the data directly to R without using a data frame.

a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)

library(BSDA)

z.test(a, alternative = "two.sided", mu = 170, sigma.x = 2.24, conf.level = 0.95)

Note that we have been given the varaince, you need to take the square root of 5 for the standard deviation.

EXERCISE: Construct a 99% confidence interval for this problem

GENERAL FORM OF R COMMAND:

t.test(dataframename$variablename, alternative = " ", mu = munull, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

EXAMPLE:

Dataset:

We are asked to analyze the average height of individuals of Italian nationality. Here is the data:

175, 168, 168, 190, 156, 181, 182, 175, 174, 179

We will test the hypothesis that the true mean height is different than 170, and construct a confidence interval.

Since the data set is small we will enter the data directly to R without using a data frame.

a = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)

t.test(a, alternative = "two.sided", mu = 170, conf.level = 0.95)

EXERCISE: Test the hypothesis that mean height is more than 170.

 



Type of Data: Quantitative (Numerical)

GENERAL FORM OF R COMMAND:

t.test(dataframename$variablename1, dataframename$variablename1, paired = TRUE, alternative = " ", mu = munull, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

EXAMPLE:

Dataset:

In this example we will compare affect of a new training on 100 meter runners. The following data gives time in seconds before and after training for 10 randomly selected athletes.

Before: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3

After: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1

We will test the hypothesis that the true mean performances for these groups are different and construct a confidence interval for the difference between mean performances.

Since the data set is small we will enter the data directly to R without using a data frame.

Before = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)

After = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)

t.test(Before, After, paired = TRUE, alternative = "two.sided", mu = 0, var.equal = FALSE, conf.level = 0.95)

EXERCISE: Construct a 90% confidence interval for the difference between means.

 

 

Type of Data: Qualitative (Categorical)

  • Z- Proc.: σ1 and σ2 Known
  • t- Proc.: σ1 and σ2 Unknown

PACKAGE: BSDA

GENERAL FORM OF R COMMAND:

z.test(dataframename$variablename1, dataframename$variablename1, alternative = " ", mu = munull, sigma.x = knownstandarddeviation1, sigma.x = knownstandarddeviation2, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

EXAMPLE:

Dataset:

We are asked to compare the mean heights of individuals of Italian nationality and German nationality. The population variances for the Italian and German Nationalities are known to be 5 and 8.5, respectively. Here is the data:

Italian: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179

German: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180

We will test the hypothesis that the true mean heights for these groups are different and construct a confidence interval for the difference between mean heights.

Since the data set is small we will enter the data directly to R without using a data frame.

Italian = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)

German = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

z.test(Italian, German, alternative = "two.sided", mu = 0, sigma.x = 2.24, sigma.y = 2.92, conf.level = 0.95)

EXERCISE: Construct a 90% confidence interval for the difference between means.

GENERAL FORM OF R COMMAND:

t.test(dataframename$variablename1, dataframename$variablename1, alternative = " ", mu = munull, var.equal = TRUE/FALSE, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

If you assume that population variances are equal use var.equal = TRUE, otherwise use var.equal = FALSE

EXAMPLE:

Dataset:

We are asked to compare the mean heights of individuals of Italian nationality and German nationality. Here is the data:

Italian: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179

German: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180

We will test the hypothesis that the true mean heights for these groups are different and construct a confidence interval for the difference between mean heights.

Since the data set is small we will enter the data directly to R without using a data frame.

Italian = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)

German = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

t.test(Italian, German, alternative = "two.sided", mu = 0, var.equal = FALSE, conf.level = 0.95)

EXERCISE: Construct a 90% confidence interval for the difference between means.

 


Type of Data: Quantitative (Numerical)

DATA FORMAT: Two columns; response variable (numerical) and treatments (factors) as follows

Response Treatment
   
   
   

GENERAL FORM OF R COMMAND:

aov(response ~ treatment, data = dataframename)

EXAMPLE:

Dataset: donut.csv

Twenty four batches of donuts were prepared and six radomly assigned to each of the four fats. The amount of fat absorbed for each batch (in grams) were measured. Upload and load the data into RStudio. Here is the data
Fat1
Fat2
Fat3
Fat4
164
178
175
155
172
191
193
166
168
197
178
149
177
182
171
164
156
185
163
170
195
177
176
168

Note that the data is not in the format to be used in R so we will stack the data.

donut <- read.csv(file.choose())

sdonut = stack(donut)

This will create a data frame with variable names values and ind.

sdonut

summary(donut)

attach(sdonut)

boxplot(values ~ ind)

aov(values ~ ind, data = sdonut)

To see the ANOVA table

donut.aov = aov(values ~ ind, data = sdonut)

summary(donut.aov)

 

Type of Data: Quantitative (Numerical)

  • Independent Samples
  • Paired Sample

GENERAL FORM OF R COMMAND:

wilcox.test(dataframename$variablename1, dataframename$variablename1, alternative = " ", mu = munull, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

EXAMPLE:

Dataset:

We are asked to compare the median heights of individuals of Italian nationality and German nationality. Here is the data:

Italian: 175, 168, 168, 190, 156, 181, 182, 175, 174, 179

German: 185, 169, 173, 173, 188, 186, 175, 174, 179, 180

We will test the hypothesis that the true median heights for these groups are different and construct a confidence interval for the difference between median heights.

Since the data set is small we will enter the data directly to R without using a data frame.

Italian = c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)

German = c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)

wilcox.test(Italian, German, alternative = "two.sided", mu = 0,conf.level = 0.95)

EXERCISE: Construct a 90% confidence interval for the difference between medians.

GENERAL FORM OF R COMMAND:

wilcox.test(dataframename$variablename1, dataframename$variablename1, paired = TRUE, alternative = " ", mu = munull, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

EXAMPLE:

Dataset:

In this example we will compare affect of a new training on 100 meter runners. The following data gives time in seconds before and after training for 10 randomly selected athletes.

Before: 12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3

After: 12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1

We will test the hypothesis that the true median performance for these groups are different and construct a confidence interval for the difference between median performances.

Since the data set is small we will enter the data directly to R without using a data frame.

Before = c(12.9, 13.5, 12.8, 15.6, 17.2, 19.2, 12.6, 15.3, 14.4, 11.3)

After = c(12.7, 13.6, 12.0, 15.2, 16.8, 20.0, 12.0, 15.9, 16.0, 11.1)

wilcox.test(Before, After, paired = TRUE, alternative = "two.sided", mu = 0, var.equal = FALSE, conf.level = 0.95)

EXERCISE: Construct a 90% confidence interval for the difference between medians.

 

 

Type of Data: Binary (Two possible outcomes)

GENERAL FORM OF R COMMAND:

prop.test(xnumberofsuccesses, nnumber of observations, p = hypothesized value, alternative = " ", correct = TRUE/FALSE, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

Without continuity correction use correct = FALSE.

EXAMPLE:

Dataset:

Suppose that 60% of citizens in Minnesota voted in last election. 85 out of 148 people on the telephone survey said that they voted in currect election. Is there an evidence that the proportion of voters in the population is less than last election? We woud like to construct a 99% confidence interval for the true proportion of voters in the current election.

prop.test(85,148, p = 0.6, alternative = "less", correct = FALSE, conf.level = 0.99)

Note that this gives us a one-sided confidence interval. For the two-sided confidence interval use

prop.test(85,148, p = 0.6, alternative = "two.sided", correct = FALSE, conf.level = 0.99)

 

 

 

Type of Data:Binary (Two possible outcomes)

GENERAL FORM OF R COMMAND:

prop.test(c(success1,success2), c(total1, total2), alternative = " ", correct = TRUE/FALSE, conf.level = 0.95)

alternative could be equal to "greater", "less", or "two.sided".

Without continuity correction use correct = FALSE.

EXAMPLE:

Dataset:

Based on the research published by Robert Rutledge, MD, and his colleaques in the Annals of Surgery (1993), in car accidents in1916 cases the patients did not use the seat belt and 135 of them died. On the other hand, in 1490 cases the patient use the seat belts and 47 of them died. Test the hypothesis that the proportion of the cases ended up with dead is the same for the no seat belt and seat belt groups.

prop.test(c(135,47),c(1916,1490), alternative = "two.sided", correct = FALSE, conf.level = 0.99)

Type of Data: Quantitative (Categorical)

Data Format: Frequency table

  • From a File
  • Within R

DATA FILE: Data file should look like the following.

 
Column Level 1
...
Column Level c
Row Level 1
count
count
...
Row Level r
count
count

GENERAL FORM OF R COMMAND:

chisq.test(data.matrix, correct = FALSE)

EXAMPLE:

Dataset: Handedness.csv

The table gives you offspring being left-handed and parental handedness. For the parental handedness fir one is for father the second one is for mother. Click on the file to download it and move it into RStudio.

STEP 1. Create a data file outside of R by using Excel just like given above. Note that you give a variable name for the row categorical variable. In this example it is "Father.Mother". Download and load the file to RStudio. In this case this has been already done.

Handedness <- read.csv(file.choose(), row.names = "Father.Mother")

STEP 2. Carry out the chi-square test

testresult <- chisq.test(Handedness, correct = FALSE)

testresult

STEP 3. If significant get observed andexpected counts, residuals and standardized residuals.

testresult$observed
testresult$expected
testresult$residuals
testresult$stdres

STEP 4. Produce a graphical display.

barplot(t(prop.table(Handedness)), beside = TRUE, xlab = "Parental Handedness", ylab = "Proportion", legend = T)

mosaicplot(Handedness, shade = TRUE, main = "Genetics and Handedness")

NOTE:

Cell percentages:
prop.table(data.matrix)

Row percentages:
prop.table(t(data.matrix), 1)

Column percentages:
prop.table(t(data.matrix), 2)

EXERCISE: The data were obtained by asking large number of people in the UK which of 13 characteristics they would associate with the nationals of the UK's partners in the European Community.

Load the following data and analyze it. In this case row variables name is "Country".

UKPerceptionsofEuropeans.csv

 

GENERAL FORM OF R COMMAND:

chisq.test(data.matrix, correct = FALSE)

EXAMPLE:

Randomly selected subjects by the Pew Research Center were asked about the use of marijuana for medical purposes and their genders have been noted. Here is the frequency table:

  Infavor Oppose Don't Know
Men
538
167
29
Women
557
186
31

STEP 1. Create a data matrix. We will create the matrix from the rows and name the columns.

Men = c(538, 167, 29)

Women = c(557, 186, 31)

data.table = rbind(Men, Women)

col.names = c("In favor", "Oppose", "Don't Know")

STEP 2. Carry out the chi-square test

testresult <- chisq.test(data.table, correct = FALSE)

testresult

STEP 3. If significant get observed andexpected counts, residuals and standardized residuals.

testresult$observed
testresult$expected
testresult$residuals
testresult$stdres

STEP 4. Produce a graphical display.

barplot(t(prop.table(data.table)), beside = TRUE, xlab = "Gender", ylab = "Proportion", legend = T)

mosaicplot(data.table , shade = TRUE, main = "The Use of Marijuana")

EXERCISE: The following tables are from a study on Eye-Dominance, Writing Hand, and Throwing Hand relationships. To see the original paper click here. If you would like to determine your dominant eye visit this site.

Analyze the tables by using the technique that you have learned in this section.

Eye-
Dominance
Writing Hand
Right
Left
Right
544
56
Left
251
83

Throwing Hand Writing Hand
Right
Left
Right
544
56
Left
251
83

Eye-Dominance Throwing Hand
Right
Left
Right
544
56
Left
251
83

 

 

 

 

Type of Data: Two Quantitative (Numerical)

GENERAL FORM OF R COMMAND:

lm(ResponseVariable ~ ExplanatoryVariable, data=dataframename)

EXAMPLE:

Dataset: faithful

There are two observation variables in the data set. The first one, called eruptions, is the duration of the geyser eruptions. The second one, called waiting, is the length of waiting period until the next eruption. We will carry out simple regression analysis of the waiting intervals (response) on the eruption durations (explanatory)

lm(waiting ~ eruptions, data = faithful)

To get the detailed analysis

waiting.lm = lm(waiting ~ eruptions, data = faithful)

summary(waiting.lm)

To see the residuals

waiting.res = resid(waiting.lm)

waiting.res

To get the residula plot

plot(faithful$eruptions, waiting.res)

abline(0,0)

To see the fitted values

waiting.fit = fitted(waiting.lm)

waiting.fit

To find the Pearson's corelation

cor(faithful$eruptions, faithful$waiting)