In laderast/DSI_explore1: Exploratory Data Analysis and Statistics for OHSU Data Science Institute

knitr::opts_chunk$set(echo = FALSE)
library(learnr)
library(gridExtra)
library(tidyverse)
library(DSIExplore)
library(broom)

library(tidyverse)
library(DSIExplore)

Learning Objectives for this Session

Please fill out the pre-assessment before you do anything! https://goo.gl/forms/jRa7TRFWXp5rlF6H3

Make sure to buddy up, and use your post-its!

At the end of this session you should be able to

Understand why we need to do Exploratory Data Analysis (EDA)
Understand statistical association between categorical variables
Visual ways to identify association
Basic understanding of Chi-squared statistics
Understand how to assess missing data

What is Exploratory Data Analysis?

Pioneered by Tukey
Detective work on your data
An attitude towards data, not just techniques
'Find patterns, reveal structure, and make tenative model assessments (Behrens)'

Being a good analyst requires you to look at the data and notice the weird things about it, and follow the thread to the end. Be a data detective!

Why EDA and Visualization?

Visualization is good for exploring data because we are really good at evaluating things visually.

We need to become aware of patterns in the data.
Sometimes these patterns are desirable:
- Associations

Broad Street Pump Map http://www.ph.ucla.edu/epi/snow/mapsbroadstreet.html

Undesirable Variation

Sometimes they are not desirable:
- Experimental Artefacts

Systolic Blood Pressure Reporting http://care.diabetesjournals.org/content/30/8/1959

Good EDA visualizations tell us something about the data

A good visualization will make certain qualities of the data apparent

Is there an association between smoking and death?
Is this association due to something related to smoking?
- What are some examples?

What are important associations?

Let's think about this.

If we know about whether someone smokes, do we know whether they are more likely to die?

Compare it to whether someone wears green.

question("If someone wears green are they more likely to die?",
  answer("Yes. Because Lucky the Leprechaun will be angry.", message="Not correct."),
  answer("No. The two variables are not associated", correct=TRUE, message="Yes, that's correct!"),
  allow_retry = TRUE
)

The Data (Whickham)

We're going to look at some data of from a group study of 1314 female smokers. It's called the Whickham dataset. The version of the dataset is actually modeled on the original study, but are a group of synthetic patients.

Let's take a look at a summary of the data.

library(tidyverse)
data("Whickham",package = "DSIExplore")
summary(Whickham)

question("What can we learn from this summary table?",
  answer("There are more alive people than dead people in the data.", correct=TRUE, message="Correct. There are more alive people (945) than dead people (369) in this dataset."),
  answer("There are two variables", message="Not quite. There's `outcome`, `smoker`, and `age`."),
  answer("You are either considered either a smoker or a non-smoker", correct=TRUE, message="Yes, that's correct! This dataset doesn't account for how long you smoked."),
  answer("The age range of the patients includes minors.", message="Incorrect. Look at the min value for age."),
  answer("The average age of the participant was around 47 years of age", correct="Yes, the mean age was 46.92 years of age. Yes, you can see it as the mean value for the age" ),
  allow_retry = TRUE
)

What does the data actually look like?

Here are the first 10 rows of the data table. Each row of the data corresponds to a patient.

Whickham[1:10,]

question("How many patients are Alive (outcome = 'Alive') and didn't smoke (smoker = 'No') from these first 10 rows?",
  answer("There are 4 patients.", correct=TRUE, message="Correct. There are 4 patients who are alive and not smokers."),
  answer("There are 5 patients.", message="Not quite. Go back and count."),
  answer("There are 6 patients.", message="Not quite. Go back and count.."),
  allow_retry = TRUE
)

Let's look at our Outcome

Our outcome in our dataset is death. One tool we often use to explore the data is a table, which counts how much of each category is there. Which group is larger?

table(Whickham$outcome)

Here's another way to look at the data, a bar plot. You can instantly see which of the groups is larger.

Whickham %>% ggplot(aes(x=outcome)) + geom_bar() + ggtitle("Outcomes: Alive versus Dead")

Sometimes it is easier to look at a table, but usually a bar chart will make things more obvious.

Now let's look at Smoking Status

One way of looking at the data is to make a summary table.

table(Whickham$smoker)

Here's the bar plot.

Whickham %>% ggplot(aes(x=smoker)) + geom_bar() + ggtitle("Number of smokers versus Non Smokers")

Two Variables Walk Into A Bar

Now that we've seen the variables by themselves, we can consider whether these two variables are associated.

Try to phrase the association in terms of the variables, such as "If you smoke, are you more likely to die?"

Before we start assessing whether our two variables are associated, let's think about what we're looking for. We're going to talk about two cases first:

1) Perfect Association, and 2) No association.

Our data is probably somewhere in between these two extremes of data.

Perfect Association

One case is that smoking would perfectly predict outcome. That is, if you smoke, you would die. Here we simulate fake data that shows this. The association would look like this:

tab <- data.frame(outcome=factor(c("Alive", "Dead", "Alive", "Dead"), levels=c("Alive", "Dead")), smoker=factor(c("No", "No", "Yes", "Yes"), levels=c("Yes", "No")), value=c(500, 10, 7, 300))

vis1 <- tab %>% ggplot(aes(x=outcome, y=smoker)) + geom_tile(fill="white", colour="black") + ggtitle("2 x 2 Table of Smoking versus Outcome")  + geom_text(aes(label=value))

vis2 <- tab %>% ggplot(aes(x=outcome, fill=smoker, y=value)) + geom_bar(color="black",stat="identity") + ggtitle("Stacked Bar Plot of Smoking versus Outcome")

lay <- t(as.matrix(c(1,2), nrow=1))
grid.arrange(vis1, vis2, layout_matrix=lay)

question("Which part of the stacked bar plot corresponds to the top left cell in the table?",
  answer("Right side, red part", message="Not quite, look at what two conditions the top left cell belongs to."),
  answer("Left side, blue portion", correct=TRUE, message="Yes, correct. This cell corresponds to people who are alive and who didn't smoke."),
  answer("Left side, red part", message="Not quite, look at what two conditions the top left cell belongs to."),
  answer("Right side, blue part.", message="Not quite, look at what two conditions the top left cell belongs to."),
  allow_retry = TRUE
)

What if there is no association between smoking and death?

This is harder to detect with two by two tables, but it is easier to detect with the proportional plot. A proportional plot doesn't emphasize the raw frequency of a condition, but its proportion within a category.

Again we simulate some fake data:

sampSize <- 997

noAssocsmoker <- data.frame(probSmok = runif(n=sampSize), probDeath = runif(n=sampSize))
noAssocsmoker <- noAssocsmoker %>% mutate(smoker=ifelse(probSmok < .3, "Y", "N"), 
                          outcome=ifelse(probDeath < .2, "Dead", "Alive"))

tabNoAssoc <- data.frame(table(noAssocsmoker$smoker, noAssocsmoker$outcome))

colnames(tabNoAssoc) <- c("smoker", "outcome", "value")

vis1 <- tabNoAssoc %>% ggplot(aes(x=outcome, y=smoker)) + geom_tile(fill="white", colour="black") + 
  geom_text(aes(label=value)) 

vis2 <- noAssocsmoker %>% ggplot(aes(x=outcome, fill=smoker)) + geom_bar(color="black",position = "fill") 

lay <- t(as.matrix(c(1,2), nrow=1))
grid.arrange(vis1, vis2, layout_matrix=lay)

The 2x2 table

So let's look at how these two variables really predict each other in the Whickham dataset.

testTab <- table(Whickham$outcome, Whickham$smoker) %>% as.data.frame() 
colnames(testTab) <- c("outcome", "smoker", "Freq")

testTab %>% ggplot(aes(x=outcome, y=smoker)) + geom_tile(fill="white", colour="black") + 
  geom_text(aes(label=Freq))

Whickham %>% ggplot(aes(x=outcome, fill=smoker)) + geom_bar(color="black")

Whickham %>% ggplot(aes(x=outcome, fill=smoker)) + geom_bar(position="fill", color="black")

question("Based on this plot, if you smoke will you be more likely to be alive or dead?",
  answer("Dead, because that's what makes common sense.", message="Not correct."),
  answer("Alive, because that's what the data tells us", correct=TRUE, message="Yes, that's correct!")
)

The effect of Age on the Data

Let's assess what happens with number of deaths as we look at older patients.

Adjust the slider on the left and see what happens to the number of deaths as we increase the average age of the group.

sliderInput("slidy", "Age Cutoff", min=min(Whickham$age)+1, max=max(Whickham$age)-1, value=c(23, 83),step = 5)
plotOutput("distPlot")

  output$distPlot <- renderPlot(
      exMelt() %>% ggplot(aes(x=status, fill=outcome)) + geom_bar(position="fill", color="black") #+
        #facet_wrap(facets = ~AgeCat)
      )

    exMelt <- reactive({
      cutoff = min(input$slidy)
      Whickham %>% filter(age > cutoff) %>% mutate(status="allPatients")
        #mutate(AgeCat=factor(ifelse(age > cutoff, "older", NA))) 
    })

question("What do you notice about the number of deaths as you increase the age of the subgroup?",
  answer("The number of deaths decreases, which means that as you get older, you are less likely to die.", message="Not correct."),
  answer("The number of deaths increases, which means that as you get older, you are more likely to die.", correct=TRUE, message="Yes, that's correct!")
)

Not as easy as we thought!

So as you get older, you're more likely to die. This may be messing up our overall results!

Let's ask the question again, with a younger group: are smokers under 60 more likely to die than non smokers?

sliderInput("slidy2", "Age Cutoff", min=min(Whickham$age)+1, max=max(Whickham$age)-1, value=c(18, 83),step = 2)
plotOutput("distPlot2")

  output$distPlot2 <- renderPlot(
      exMelt2() %>% ggplot(aes(x=smoker, fill=outcome)) + geom_bar(position="fill", color="black")
      )

    exMelt2 <- reactive({
      cutoffMax = max(input$slidy2)
      cutoffMin = min(input$slidy2)
      Whickham %>% filter(age < cutoffMax & age > cutoffMin) %>% mutate(status="allPatients")
    })

question("For patients who are under 60, is smoking associated with death?",
  answer("Yes, the proportion of smokers who die is greater than the proportion of non-smokers who die for those patients younger than 60 years.", correct=TRUE, message="Yes, correct."),
  answer("No, the proportion of smokers who die is smaller than the proportion of non-smokers who die for those patients younger than 60 years.", message="Go back and look at the proportions.")
)

Chi-Squared Test

We have discovered that the appears to be a difference in proportions in smokers who die versus smokers who live in women under 60.

Now let's talk about statistics. How likely is this a random result?

In other words, how likely is the association between smoking and outcome just something that we could observe by random chance?

We will use a statistic to decide this.

In statistics we model randomness

Every statistic has an underlying model of randomness. Our random model is based on the case with no association between our two variables, smoking and death.

In other words, how rare is our outcome compared to a world of random outcomes? This world of outcomes is known as the null distribution.

Our metric of rareness is the statistic. Where it places us on the null distribution determines our p-value.

For our test of proportions, the appropriate null distribution (or random model) is called a Chi-Square Distribution. We won't explore all of the parameters, but it represents a random process for which there is no association between variables.

Observed versus Expected

What are the expected values? They are the values in the cells we expect in the case of no association between our two variables. In other words, they are the values that we expect if there is no association between smoking and outcome.

We calculate these expected values by looking at the column and row totals for each cell.

cohort <- Whickham %>% 
  ##here is our filtering criteria
  filter(age < 60) %>%
  select(smoker, outcome) %>% 
  table
cohort

outcome_count <- colSums(cohort)
outcome_count
smoking_count <- rowSums(cohort)
smoking_count
total <- sum(smoking_count)

expected_value = (outcome_count[1]*smoking_count[1]) / total
expected_value

What does Chi-squared represent?

The chi-squared statistic summarizes the total deviation from the expected values in each cell. We have 4 cells, so we sum up our calculation over each of these.

We're going to do a little calculation using R. Try running the code. How have we filtered the data? Change the filtering criteria to age < 60 and see how the tables change.

cohort <- Whickham %>% 
  ##here is our filtering criteria
  filter(age < 40) %>%
  select(smoker, outcome) %>% 
  table

##these are the observed values 
print("observed values")
cohort

##we can calculate expected values using chisq.test()
cohort <- chisq.test(cohort, correct=FALSE)

## show expected values
print("expected values")
cohort$expected

## show chisq = (observed - expected)^2/expected
print("obs-expected^2/expected")
chi <- (cohort$observed - cohort$expected)^2/cohort$expected
chi

question("How did the top left cell change for the chi-matrix when you changed the filtering criteria?",
  answer("It got larger, meaning there is more of a deviation from the expected value.", correct=TRUE, message="Yes, correct."),
  answer("It got smaller, meaning there is less of a deviation from the expected value. ")
)

Generating our null universe

actionButton("gen_data", "Generate random data")
plotOutput("chi_tab")
plotOutput("chi_dist")

null_data <- function(samp_size){
   smoking = ifelse(rbinom(samp_size, 1, 0.493)==0, "No","Yes")
   outcome = ifelse(rbinom(samp_size, 1, 0.885)==0, "Dead", "Alive")
   out <- table(smoking, outcome)
   out
}

set.seed(111)
samp_size <- 500
initial_tab <- null_data(samp_size)
chistat <- tidy(chisq.test(initial_tab))$statistic
chiveq <- reactiveValues(vec=rep(NA,500), tab=initial_tab)
##really only need last generated table and chi-square statistic

observeEvent(input$gen_data, {
  out_data <- null_data(samp_size)
  chistat <- tidy(chisq.test(out_data, correct=FALSE))$statistic
  iter <- 1
  if(!is.null(input$gen_data)){
      iter = input$gen_data
  }
  isolate({
  #chiveq$vec[iter] <- chistat 
  chiveq$tab <- out_data
  })
  print(out_data)
})

chi_tab <- renderPlot({
  chiveq$tab %>% ggplot(aes(x=outcome, y=smoker)) + geom_tile(fill="white", colour="black") + 
  geom_text(aes(label=Freq)) + ggtitle("New data")
})

chi_dist <- renderPlot({
  data.frame(chiveq$vec) %>% ggplot(aes(x=chiveq)) + geom_histogram(bins =40)
})

Our null hypothesis and alternative hypothesis

What we call our null hypothesis represents what we think of results that are the result of complete random chance. In our case, we are saying there is no association between variables.

Our null hypothesis is that there is no association between variables. To make things concrete:

Null hypothesis (Ho): No association between smoking and outcome.
Alternative hypothesis (Ha): there is an association between smoking and outcome.

For our null hypothesis, we need to calculate the probability of smoking and the probability of death. Under our null hypothesis, we say these probabilities are independent of each other.

So we can look at how our chi-squared statistic is distributed for our null model.

library(tidyverse)
sampSize <- 300
nTrials <- 500
set.seed(111)

 chiveq <- rep(NA, nTrials)

 for(i in 1:nTrials){
   smoking = ifelse(rbinom(sampSize, 1, 0.493)==0, "No","Yes")
   outcome = ifelse(rbinom(sampSize, 1, 0.885)==0, "Dead", "Alive")
   chiveq[i] <- chisq.test(table(smoking, outcome))$statistic
 }

data.frame(chiveq) %>% ggplot(aes(x=chiveq)) + geom_histogram(bins=60)

P-values

There is a straightforward interpretation to the p-value, and it has to do with how unique or rare our case is compared to our distribution of randomly generated cases.

So the p-value is interpreted as the probability that we will see a random case with the same exact statistic or higher.

For example, if we had 10,000 random cases, and our p-value was 0.2, that means that of our 10000 cases, we would expect to see 10000 * 0.2 = 2000 random cases with our statistic or greater.

Try adjusting the value of the chi-square statistic and see how many random cases are expected to have that statistic or higher.

sliderInput("slide_chi", "slide to adjust statistical cutoff", 
            value=0.8, min = 0.1, max=10)
plotOutput("chihist")

library(tidyverse)
sampSize <- 300
nTrials <- 500
set.seed(111)

 chiveq <- rep(NA, nTrials)

 for(i in 1:nTrials){
   smoking = ifelse(rbinom(sampSize, 1, 0.493)==0, "No","Yes")
   outcome = ifelse(rbinom(sampSize, 1, 0.885)==0, "Dead", "Alive")
   chiveq[i] <- chisq.test(table(smoking, outcome))$statistic
 }

output$chihist <- renderPlot({
  chival <- input$slide_chi
  pval <- length(chiveq[chiveq >= chival]) / length(chiveq) 

  val <- paste("Chi-square statistic:", chival, "\n", "% above chi-square:", pval * 100,
               "\n", "p-value: ", pval)

  data.frame(chiveq, GT=ifelse(chiveq >= chival, "> chisq", "< chisq")) %>% 
    ggplot(aes(x=chiveq, fill=GT)) + geom_histogram(bins=60) +
    geom_vline(xintercept = chival) + labs(x="chi-square statistic") + ggtitle(val)

})

Level of significance

Before we even calculate a p-value, we must decide on a level of significance. Setting this value before we test is important because we specify how tolerant we are of false positives.

This value is also known as alpha.

Let's set our alpha at 0.05. If we set it to be 0.05, that means that there is 5% probability that our result has arisen due to our random variables.

What is our p-value?

Now that we have set our alpha, we can now find our p-value.

Our chi-square statistic for our cohort of women less than 60 is:

library(broom)

under60 <- Whickham %>% 
  ##here is our filtering criteria
  filter(age < 60) %>%
  select(smoker, outcome) %>% 
  table

#show our chi-squared statistic
chistat <- tidy(chisq.test(under60))$statistic
chistat

#calculate our p-value
pvalue = length(chiveq[chiveq > chistat]) / length(chiveq)
pvalue

question("So for our alpha of 0.05, do we call our statistic significant?",
  answer("No, we do not reject the null hypothesis that there is no association between smoking and outcome", message="Compare our p-value to our alpha."),
  answer("Yes, we reject the null hypothesis that there is no association and conclude that there is an association", correct=TRUE, message="Go back and look at the proportions.")
)

The chi-square distribution

By doing our simulation of no association between smoking and outcome with our null model, we have found the probability of our statistic to be less than our alpha.

In reality, we don't simulate these processes, and instead rely on a mathematically derived distribution called the chi-squared distribution, which represents these processes.

Our chi-squared distribution can be calculated using the dchisq() function.

There is an additional parameter known as degrees of freedom that is based on the number of categories for your two variables.

bb <- 0.3
data.frame(chiveq) %>% ggplot(aes(x = chiveq)) + geom_histogram(binwidth = 0.3) + geom_density(aes(y = 0.3 * 
    ..count.., color = "red")) + ggtitle("Empirical versus derived chi-square") + 
    guides(color = FALSE)

Write your results up

question("Write up your results by selecting from the following statements:",
  answer("For our cohort of under 60 women", correct=TRUE),
  answer("we found an association between wearing colors and death", message="Go back and look at the proportions."), 
    answer("we found an association between smoking and death", correct=TRUE),
  answer("this association was significant with a p-value of 0.01", correct=TRUE)

)

The stability of our test statistic with small n

One thing our test statistic is dependent on is n, the total number of subjects in the study. Chi-square statistics work best with larger groups.

We need to be cautious when our n is very small (less than 10), or when we have less than 5 subjects in each cell of our table.

Your Turn

We're going to look at another dataset called NHANES (National Health and Nutrition Examination Survey).

We're going to look at whether a high BMI (30+) is associated with Diabetes status. Additionally, we're going to ask whether there is a gender difference between this association.

data("bmi_diabetes")
library(tidyverse)
bmi_diabetes <- bmi_diabetes %>% tidyr::complete(Diabetes, Gender, BMIstatus) 

bmi_diabetes[1:15,]
summary(bmi_diabetes)

categoricalVars <- c("BMIstatus", "Gender")
selectInput(inputId = "condTab", "Select Variable to Calculate Proportions",
                         choices=categoricalVars, selected=categoricalVars[1])
verbatimTextOutput("proportionTab")
plotOutput("proportionBarplot")

dataOut <- reactive({
    out <- bmi_diabetes %>% 
      #tidyr::complete(Diabetes, Gender, BMIstatus) %>%
      droplevels()
    na.omit(out)
  })

  output$summaryTable <- renderPrint({
    summary(dataOut())
  })

  proportionTable <- reactive({
    out <- dataOut()[,c(input$condTab, "Diabetes")]
    #tab <- as.matrix(table(out))
    out
  })

  output$proportionTab <- renderPrint({
    tab <- table(proportionTable())
    return(tab[,"Yes"]/(tab[,"No"] + tab[,"Yes"]))

  })

  output$proportionBarplot <- renderPlot({
     proportionTable() %>% 
      ggplot(aes_string(x=input$condTab, fill="Diabetes")) + 
      geom_bar(position="fill", color="black")
  })

Decide on your alpha

Decide on your alpha now!

Your Turn: Calculate your p-value

Run the following code to find your p-value.

tab <- table(bmi_diabetes$Diabetes,bmi_diabetes$BMIstatus)
tab

chi_calc <- tidy(chisq.test(tab))
##show chi-square statistic
chi_calc$statistic
##show p-value
chi_calc$p.value

So, is there an association between BMIstatus and Diabetes?

Missing Values

One thing we haven't talked about is missing values. For NHANES, there are actually missing values (encoded as NAs for the dataset).

data("bmi_diabetes_raw")
summary(bmi_diabetes_raw)

bmi_diabetes_raw %>%
  filter(is.na(Diabetes)) %>%
  dplyr::select(Diabetes, BMIstatus, Gender) %>%
  summary()

bmi_diabetes_raw %>%
  filter(is.na(BMIstatus)) %>%
  dplyr::select(Diabetes, BMIstatus, Gender) %>%
  summary()

Be proud of yourself!

These concepts are not easy!

If you have a basic understanding of how associations work in categorical data and how we calculate statistics on them, be happy!

Make sure to fill out the post-assessment. We'll use it to make this course better: https://goo.gl/forms/loIbHGxUqOE2Jv263

References

Chi Squared Test
http://www.stat.wmich.edu/s216/book/node115.html

laderast/DSI_explore1 documentation built on May 29, 2019, 3:40 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

laderast/DSI_explore1
Exploratory Data Analysis and Statistics for OHSU Data Science Institute

In laderast/DSI_explore1: Exploratory Data Analysis and Statistics for OHSU Data Science Institute

Learning Objectives for this Session

What is Exploratory Data Analysis?

Why EDA and Visualization?

Undesirable Variation

Good EDA visualizations tell us something about the data

What are important associations?

The Data (Whickham)

What does the data actually look like?

Let's look at our Outcome

Now let's look at Smoking Status

Two Variables Walk Into A Bar

Perfect Association

What if there is no association between smoking and death?

The 2x2 table

The effect of Age on the Data

Not as easy as we thought!

Chi-Squared Test

In statistics we model randomness

Observed versus Expected

What does Chi-squared represent?

Generating our null universe

Our null hypothesis and alternative hypothesis

P-values

Level of significance

What is our p-value?

The chi-square distribution

Write your results up

The stability of our test statistic with small n

Your Turn

Decide on your alpha

Your Turn: Calculate your p-value

Missing Values

Be proud of yourself!

References

R Package Documentation

Browse R Packages

We want your feedback!

laderast/DSI_explore1 Exploratory Data Analysis and Statistics for OHSU Data Science Institute

In laderast/DSI_explore1: Exploratory Data Analysis and Statistics for OHSU Data Science Institute

Learning Objectives for this Session

What is Exploratory Data Analysis?

Why EDA and Visualization?

Undesirable Variation

Good EDA visualizations tell us something about the data

What are important associations?

The Data (Whickham)

What does the data actually look like?

Let's look at our Outcome

Now let's look at Smoking Status

Two Variables Walk Into A Bar

Perfect Association

What if there is no association between smoking and death?

The 2x2 table

The effect of Age on the Data

Not as easy as we thought!

Chi-Squared Test

In statistics we model randomness

Observed versus Expected

What does Chi-squared represent?

Generating our null universe

Our null hypothesis and alternative hypothesis

P-values

Level of significance

What is our p-value?

The chi-square distribution

Write your results up

The stability of our test statistic with small n

Your Turn

Decide on your alpha

Your Turn: Calculate your p-value

Missing Values

Be proud of yourself!

References

R Package Documentation

Browse R Packages

We want your feedback!

laderast/DSI_explore1
Exploratory Data Analysis and Statistics for OHSU Data Science Institute