knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#ws>"
)
library(Intro2MLR)

Introduction

The use of categorical variables is an important part of MLR and regression in general. Therefore we must learn about them and the multinomial.

Example 1 MS pg 448 9.8

Prepare the data as a table

data(irr)
irr
irr$FREQ -> opinion
names(opinion) <- irr$STRATEGY
opinion <- as.table(opinion)
opinion

Now analyze

Suppose we wish to test the NULL hypothesis

$$H_0:p_i = 1/5,\;\;\forall i\in{1,\ldots,5}$$

test <-chisq.test(opinion)
test

Conclusion

We do not have sufficient evidence at the 0.05 level to reject the NULL hypothesis and so we retain it as plausibly true given the data.

Going further

You can investigate the $\chi^2$ test and associated statistics further by looking at the output

names(test)

Notice that there are some very useful components that make up the test.

The expected values $E(n_i)$

test$expected

These are calculated under the assumption that the NULL hypothesis is correct

Two way tables

We will look at the three mile island example as detailed in the book MS page 453

data(mile3)

mile3$NUMBER->freq
mat <- matrix(freq, nrow = 2, byrow =FALSE)
dimnames(mat) = list(c("yes", "no"), c("1-6","7-12", "13+"))
tab <- as.table(mat)
addmargins(tab)

We wish to know whether the two directions of classification are dependent.

Plotting the area

To see the p-value area we can use ggplot. It helps to know that the mean of a chisquare is $\nu$ and the variance is $2\nu$.

The p value is giving us evidence against the NULL hypothesis of factor independence.

$$H_0: \tt{Attitude} \; \perp \; \tt{Distance}$$

library(ggplot2)

out <- chisq.test(tab)
chiargs = list(df = out$parameter)
g <- ggplot(data.frame(x=c(out$statistic, out$parameter+4*sqrt(2*out$parameter))), aes(x)) + 
  stat_function(fun = dchisq, args = chiargs , geom = "area", fill = "green") 


g <- g + stat_function(fun = dchisq, args = chiargs, geom = "area", fill = "black",xlim = c(0,out$statistic)) 

g <- g + xlab("X") + ylab("Density")
g
out
1-pchisq(out$statistic,2)

What happens to the cutoff as $\nu$ increases?

To determine what the cut off values are we can create them using qchisq()

qchisq(1-0.05, df = 1:10)

Can we make sense of this? If the number of cells increase so the degrees of freedom will also, since $\nu = (r-1)(c-1)$.

q = qchisq(1-0.05, 1:20)
diff(q)


MATHSTATSOU/Intro2MLR documentation built on Dec. 11, 2020, 9:01 p.m.