In laderast/DSI_explore1: Exploratory Data Analysis and Statistics for OHSU Data Science Institute

knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, tidy = TRUE)
library(tidyverse)
library(learnr)
library(gridExtra)
library(naniar)
library(DSIExplore) #devtools::install_github("laderast/DSIExplore")
library(NHANES)
library(knitr)
library(kableExtra)

# FOR EXERCISES
nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))

library(tidyverse)
library(DSIExplore)
library(NHANES)

# FOR SERVER
nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))

Learning Objectives for this Session

Don't forget! Please fill out our pre-session survey. (pre-categorical and continuous)

Find this session at https://minnier.shinyapps.io/ODSI_continuousData/, with code at https://github.com/laderast/DSIExplore

At the end of this session you should be able to

Understand statistical associations between continuous variables
Understand statistical associations between a continuous and a binary variable
Visualize ways to identify associations
Basic understanding of correlation and t-tests
If time allows, basic understanding of linear regression modeling

EDA with continuous variables

We saw bar plots and proportional plots used to visualize binary and categorical variables in the previous section. What are some good ways of visualizing continuous (quantitative) data?

Let's use the NHANES (National Health and Nutrition Examination Survey) data set to visualize the variable BMI as a continuous variable.

(Note: info about the NHANES data in the NHANES R package can be found here with the disclaimer that NHANES are survey data so to do proper analyses we should use sampling weights. For illustration of more straightforward analyses we will ignore this detail.)

Histogram and density plots

A common visualization of the distribution of a continuous variable is a histogram (or the smoothed version---the density plot):

data(NHANES)

NHANES %>% ggplot(aes(x=BMI)) +
  geom_histogram(binwidth=2.5) + 
  geom_density(aes(y=2.5*..count..,color="red")) + 
  ggtitle("Histogram and Density of NHANES BMI") + guides(color=FALSE)

This plot shows us the frequency of certain values of BMI. We can see the distribution is somewhat "skewed" positively in that there are a few quite large values on the right tail of the distribution.

This is something to think about, data is not always a perfect bell shaped (Normal distribution) curve!

Box plots

We can also use a box plot to see similar patterns. The help from the box plot R function ?geom_boxplot tells us what all the parts of the box plot mean:

"The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot function, and may be apparent with small samples. See boxplot.stats for for more information on how hinge positions are calculated for boxplot.

The upper whisker extends from the hinge to the largest value no further than 1.5 x IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 x IQR of the hinge. Data beyond the end of the whiskers are called "outlying" points and are plotted individually."

NHANES %>% ggplot(aes(x="All Subjects",y=BMI))+geom_boxplot() +ggtitle("Box plot of NHANES BMI")+xlab("")

All those dots piled up on the bottom are people who are outside the 1.5 x IQR, often though of as "outliers" of the population distribution. This also shows the positive skewness of the distribution.

Practice Coding

If you want to practice coding a histogram and density plot or boxplot, try editing ggplot2 code below to show a histogram of Age:

# edit the ggplot code after x= 
NHANES %>% ggplot(aes(x=BMI)) +
  geom_histogram(binwidth=2.5) + 
  geom_density(aes(y=2.5*..count..,color="red")) + 
  ggtitle("Histogram and Density of Age") + guides(color=FALSE)

# edit the ggplot code after y= 
NHANES %>% ggplot(aes(x="All Subjects",y=BMI))+geom_boxplot() +ggtitle("Box plot of NHANES Age")+xlab("")

# edit the ggplot code after x= 
NHANES %>% ggplot(aes(x=Age)) +
  geom_histogram(binwidth=2.5) + 
  geom_density(aes(y=2.5*..count..,color="red")) + 
  ggtitle("Histogram and Density of Age") + guides(color=FALSE)

# edit the ggplot code after y= 
NHANES %>% ggplot(aes(x="All Subjects",y=Age))+geom_boxplot() +ggtitle("Box plot of NHANES Age")+xlab("")

How do we assess associations between two continuous variables?

We learned that good EDA can help us identify associations.

Scatter plots

The first visualization you are likely to make when you have two continuous variables is a scatter plot. Let's look at the scatter plots of Height and Weight vs. BMI.

How would you describe these relationships?

NHANES %>% ggplot(aes(x=Weight,y=BMI,color=Height)) + geom_point()
NHANES %>% ggplot(aes(x=Height,y=BMI,color=Weight)) + geom_point()

Correlation

A simple statistical quantification of the association of two continuous variables is the Pearson's Correlation Coefficient (often labeled r).

Correlation = a quantity measuring the extent of interdependence of variable quantities

Pearson's correlation coefficient: a measure of the linear correlation between two variables

Note that this is quantifying a linear relationship.
Value between -1 and +1, with 0 denoting no linear correlation
We can visually represent the linear relationship with a line through the scatter plot.
If the relationship is relatively curved or exponential Pearson's correlation will not capture this relationship.
An alternative might be the Spearman's correlation which essentially is the Pearson's correlation of the ranks. This evaluates monotone relationships.

Question: How well does the line "fit" the data?

plot_w_cor = function(df,cont_x,cont_y,color) {
  enquo_x = enquo(cont_x)
  enquo_y = enquo(cont_y)
  enquo_color = enquo(color)
  tmpdat =  df%>%summarize(x=mean(range(!!enquo_x,na.rm=T)),
                           y=max(!!enquo_y,na.rm=T),
                           cortext=paste("Pearson's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs"), digits = 2),
                                         "\nSpearman's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs",method="spearman"), digits = 2))
                           )
  df %>% ggplot(aes_(x=enquo_x,y=enquo_y,color=enquo_color)) + geom_point() + stat_smooth(method="lm",se = FALSE) + 
    geom_text(data = tmpdat,aes(x=x,y=y,label=cortext),inherit.aes = FALSE,size=4)
}


NHANES%>%plot_w_cor(Weight,BMI,Height)
NHANES%>%plot_w_cor(Height,BMI,Weight)

Correlation explorer

Now you can try to get a feel for what correlation (linear and non-linear) looks like. Try a few pairs:

(For fun sometime, play the "guess the correlation" game at guessthecorrelation.com)

cont_vars = c("Age","Weight","Height","BMI","BPSysAve","BPDiaAve","TotChol","Pulse","HHIncomeMid","Testosterone")

selectInput("select_x", "X-axis", choices = cont_vars,
            selected = "Weight")
selectInput("select_y", "Y-axis", choices = cont_vars,
            selected = "BMI")
selectInput("select_color", "Color", choices = cont_vars,
            selected = "Height")
plotOutput("scatterPlotCont")

library(rlang)
# should wrap this into the package and not repeat like this
# need to figure out a better way to use quoted vs quosures
plot_w_cor_quotes = function(df,cont_x,cont_y,color) {
  enquo_x = quo(!! sym(cont_x))
  enquo_y = quo(!! sym(cont_y))
  enquo_color = quo(!! sym(color))
  tmpdat =  df%>%summarize(x=mean(range(!!enquo_x,na.rm=T)),
                           y=max(!!enquo_y,na.rm=T),
                           cortext=paste("Pearson's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs"), digits = 2),
                                         "\nSpearman's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs",method="spearman"), digits = 2))
  )
  df %>% ggplot(aes_(x=enquo_x,y=enquo_y,color=enquo_color)) + geom_point() + stat_smooth(method="lm",se = FALSE) + 
    geom_text(data = tmpdat,aes(x=x,y=y,label=cortext),inherit.aes = FALSE,size=4)
}

  output$scatterPlotCont <- renderPlot(
      NHANES %>% plot_w_cor_quotes(input$select_x,input$select_y,input$select_color)
      )

Practice Coding

If you want to practice coding a scatter plot, try editing ggplot2 code below to show a scatter plot of Age vs Height, colored by Gender:

# edit the ggplot code after x= and y= and color= to change the axes and the color
NHANES %>% ggplot(aes(x=Weight,y=BMI,color=Diabetes)) + geom_point(alpha=0.5) + stat_smooth(method="lm",se = FALSE)

# edit the ggplot code after x= and y= and color= to change the axes and the color
NHANES %>% ggplot(aes(x=Age,y=Height,color=Gender)) + geom_point(alpha=0.5) + stat_smooth(method="lm",se = FALSE)

What is a factor that may be associated with BMI?

Now let's explore the association of BMI with a binary (yes/no) variable. How does BMI differ by Diabetes status?

NHANES %>% ggplot(aes(x=Diabetes,y=BMI,color=Diabetes)) + geom_boxplot()

question("Is BMI associated with diabetes status?",
  answer("No, BMI is the same for people with or without diabetes.", message="Not true, look at the average BMI for each group."),
  answer("Yes, BMI is different depending on diabetes status.", correct=TRUE, message="Yes, BMI is on average higher for subjects with diabetes than without.")
)

We see that BMI is on average higher for subjects with diabetes than without. There is also a third category called NA. This means the data is missing.

Let's explore the missingness in this data a bit before we move on.

Missingness and suspicious data elements

We can summarize the data with missing diabetes status:

NHANES %>% filter(is.na(Diabetes)) %>% select(Diabetes, BMI, Age) %>% summary

Note that 137 out of 142 subjects with missing diabetes data also have missing BMI data. Also note the Age distribution. Does something look interesting?

NHANES %>% filter(is.na(Diabetes)) %>% select(Diabetes, BMI, Age) %>% janitor::tabyl(Age)

What about missingness in BMI? How does this relate to age? We can use the naniar package in R to visualize this with a scatter plot:

NHANES %>% ggplot(aes(x=Age,y=BMI,color=Diabetes))+naniar::geom_miss_point()

We also might notice something interesting about the age distribution of our population.

question("Why do you think there's a pile up of points at age 80?",
  answer("Everyone died at age 80.", message="Highly unlikely."),
  answer("They oversampled from the 80 year olds.", message="Highly unlikely."),
  answer("The data was truncated.", correct=TRUE, message="Correct! But why? Probably to preserve de-identification. If we were to fit a model to find the association of age and height, how do you think those 80 year olds would influence our results?")
)

For future analyses, let's work with subjects older than 20 years of age, and remove subjects with missing Diabetes and BMI values. Lastly, let's check out the box plots on this filtered data.

nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))

nrow(nhanes_filtered)

nhanes_filtered %>% ggplot(aes(x=Diabetes,y=BMI,color=Diabetes))+geom_boxplot() + xlab("Filtered by Age >=20")+
  ggtitle("Box plot of NHANES BMI by Diabetes Status")+xlab("")

T Test

The most common statistical test to compare the distribution of continuous variables across a binary (Yes/No) variable (like diabetes) is the two sample Student's T Test.

two sample T-statistic = (difference in means)/sqrt(pooled variance)

The test is is so named because the test statistic---which quantifies how different the means are in relation to the variance---follows a T-distribution. The T-distribution is like the normal bell shaped curve but with "fatter tails". The tails become closer to the Normal distribution as n gets larger.

xx = seq(-5,5,by=0.01)
plot(xx,dnorm(xx),type="l",main="Normal and T distribution functions",xlab="x",ylab="p(x)")
points(xx,dt(xx,df = 10),type="l",col="red")
points(xx,dt(xx,df = 3),type="l",col="blue")
points(xx,dt(xx,df = 100),type="l",col="green")
legend("topright",c("Normal(0,1)","t df=3","t df=10","t df=100"),lty=1,col=c("black","blue","red","green"))

The T test makes assumptions about the distribution of your continuous variable within the two groups:

The measure (BMI) follows a normal distribution with a certain mean and variance within each set (group).
The two sets (groups) of data are independent. If we have some kind of relationship (i.e. matching, or measured on the same subjects) we need to use a "paired T test."

Null vs alternative hypothesis

Null hypothesis (Ho): two group means are equal
Alternative hypothesis (Ha): the two group means are not equal

Let's look at the smoothed histograms (density plots) of the two groups' BMIs.

tmp_summarydat = nhanes_filtered %>% group_by(Diabetes) %>% summarize(mean=mean(BMI))
nhanes_filtered %>% ggplot(aes(x=BMI,color=Diabetes)) + geom_density() + geom_vline(data = tmp_summarydat,aes(xintercept=mean,color=Diabetes),lty=2)

Question: Do we think the assumptions of the t-test hold?

question("Do you think the assumptions of the t-test hold?",
  answer("No, not completely.", correct=TRUE, message="The independence assumption holds, but the fact that our distribution is somewhat positively skewed actually violates the normality assumption of the T-test"),
  answer("Yes, completely.", message="Not quite: the independence assumption holds, but the fact that our distribution is somewhat positively skewed actually violates the normality assumption of the T-test.")
)

However! The T-test is pretty robust to slight violations of the normality assumption, especially since we have a large sample size

statistics side note: thanks to the Central Limit Theorem, our test is still valid as in we preserve our type I error; for a nice explanation of this see this Stats Geek blog post and Lumley T, et al 2002)

So, let's run a t-test (yay!) to assess the difference in means of BMI comparing diabetics and non-diabetics:

broom::tidy(t.test(BMI~Diabetes,data=NHANES)) %>% select(estimate:p.value) %>% mutate(p.value=as.character(signif(p.value,2))) %>%
  knitr::kable(col.names = c("Difference in Means","Means No","Means Yes", "T Statisitic","P Value"),digits=2)%>%
  kableExtra::kable_styling("striped", full_width = F)

Note the p-value is extremely small. This is because we have a very large sample size and the difference in means is pretty large.

Smaller sample size

What happens if we have a much smaller sample size? We can examine the effect of sample size by randomly sampling a subset of the data. Look at our test statistic and p-value, as well as the difference in means.

sliderInput("samplesize", label="Total Sample Size",min=10,max=500,value=500)
actionButton("buttonResample","Take a Sample")
plotOutput("densityPlot1")
tableOutput("ttest1")

nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))
subsetdata <- reactive({
  input$buttonResample
  nhanes_filtered %>% dplyr::sample_n(size=input$samplesize)
})

subsetdata_summary <- reactive({
  subsetdata() %>% group_by(Diabetes) %>% summarize(mean=mean(BMI))
})

output$densityPlot1 <- renderPlot(
 subsetdata() %>% filter(!is.na(Diabetes)) %>% ggplot(aes(x=BMI,color=Diabetes)) + geom_density() + 
   geom_vline(data = subsetdata_summary(),aes(xintercept=mean,color=Diabetes),lty=2)
)

output$ttest1 <- function() {
  broom::tidy(t.test(BMI~Diabetes,data=subsetdata())) %>% 
    select(estimate:p.value) %>%
    mutate(p.value=as.character(signif(p.value,2))) %>%
    knitr::kable(format = "html",col.names = c("Difference in Means","Means No","Means Yes", "T Statisitic","P Value"), digits=2) %>%
    kableExtra::kable_styling("striped", full_width = F)
}

Try it yourself

Change the outcome (left hand side of ~ is the continuous outcome) to be Height and the binary factor (right hand side) to be Gender to test height differences in males vs females.

t.test(BMI~Diabetes,data=nhanes_filtered) %>% broom::tidy() %>% kable(digits=3)

t.test(Height~Gender,data=nhanes_filtered) %>% broom::tidy() %>% kable(digits=3)

Also examine the histogram of Height by Gender (or try your own hypothesis!). Also see what happens when you change the data set from nhanes_filtered to NHANES (not filtered).

nhanes_filtered %>% ggplot(aes(x=BMI,color=Diabetes)) + geom_density()

nhanes_filtered %>% ggplot(aes(x=Height,color=Gender)) + geom_density()

Explore other factors

What other factors might be associated with BMI?

To assess continuous factors we can make scatter plots
We can stratify scatter plots by categorical variables
- --> assess associations of continuous variables with BMI within subcategories (i.e. of Diabetes)
To remove the stratification option select AllSubjects as your Factor
Question: Do any interesting associations arise?

cat_vars = c("AllSubjects","Gender","Diabetes","Smoke100n","Race1","Race3","PhysActive","TVHrsDay","SleepTrouble","AgeDecade")

selectInput("select_factor1", "Factor", choices = cat_vars,
            selected = "AllSubjects")
selectInput("select_cont1", "Continous X-axis", choices = cont_vars)
plotOutput("pointPlot1")
plotOutput("boxPlot1")

nhanes_filtered$AllSubjects = "All"

output$pointPlot1 <- renderPlot(
  nhanes_filtered %>% ggplot(aes(x=get(input$select_cont1),y=BMI, color=get(input$select_factor1))) + geom_point(alpha=0.1) + facet_wrap(~get(input$select_factor1)) + 
    xlab(input$select_cont1) + geom_smooth(method="lm")+scale_color_discrete(name=input$select_factor1)
)

output$boxPlot1 <- renderPlot(
  nhanes_filtered %>% ggplot(aes(x=get(input$select_factor1),y=BMI,color=get(input$select_factor1))) + geom_boxplot() + xlab("") + 
    scale_color_discrete(name=input$select_factor1)
)

Advanced Topic: Linear Models - multiple predictors/associations

Often we believe our measure or outcome of interest (i.e. BMI) is associated with several other variables at once. For instance, BMI could be associated with diabetes, blood pressure, physical activity, diet, among many other characteristics.

A common framework for the analysis of these types of associations is linear regression.
Linear regression attempts to associate a dependent variable or outcome (i.e. BMI) with one or more indepedent variables or predictors (i.e. diabetes, blood pressure).

Assumptions and model:

Mathematically, we are modeling our outcome $\ Y$

as a linear combination of our other variables $\ X_1, X_2, X_3 \ldots$
plus some random error $\ \epsilon$ that is normally distributed

$$\ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$$

Suppose we want to determine if systolic blood pressure is associated with BMI after adjusting for diabetes status.

That is, if we look at the subgroup of diabetics, is pulse associated with BMI?
Similarly, if we look at the subgroup of non-diabetics, is blood pressure associated with BMI?

We would fit a model:

$$\ BMI = \beta_0 + \beta_1 (Pulse) + \beta_2 (Diabetes = Yes) + \epsilon$$ and test whether the coefficient $\ \beta_1$ is equal to 0 or not.

We can visualize this:

nhanes_filtered %>% ggplot(aes(x=Pulse,y=BMI,color=Diabetes,alpha=Diabetes)) + geom_point() + geom_smooth(method="lm")

We can run a linear regression in R and view the output:

summary(fit1 <- lm(BMI~Pulse+Diabetes,data=nhanes_filtered))

We interpret the coefficient of Pulse as: for every one unit (beat per minute) increase in pulse, we expect the average BMI to increase by r round(fit1$coef[2],3) kg/m^2.
Question: is this a very strong association?
R-squared is a measure of how well the predictors explain the variation in the outcome, on a scale from 0 to 1. Do you think these predictors as explaining BMI very well? That is, if we are given a subjects pulse and diabetes status, how sure are we that the predicted BMI from this model is correct?

This idea of "how well can we predict our outcome" drives much of statistical modeling, including machine learning and predictive analytics.

We can include many predictors in our model and still not have a very good R-squared, as BMI is a very complex trait:

summary(fit1 <- lm(BMI~PhysActive+SmokeNow+Gender+Race3+BPSysAve+BPDiaAve+Pulse+Diabetes,data=nhanes_filtered))

What if we tried to predict BMI with weight?

summary(fit1 <- lm(BMI~Weight+Diabetes,data=nhanes_filtered))

What is the R-squared, now? Why?
Does this model make sense in practice?

nhanes_filtered %>% ggplot(aes(x=Weight,y=BMI,color=Diabetes,alpha=Diabetes)) + geom_point() + geom_smooth(method="lm")

Your Turn

Modify the R code in the lm() function to build your own linear regression model. Here are the names of the NHANES data set:

names(nhanes_filtered)

nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))

# edit the code in the formula by adding in variable names
summary(fit <- lm(BMI~Pulse+Diabetes,data=nhanes_filtered))

nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))

# edit the ggplot code after x= and y= and color= to change the axes and the color
nhanes_filtered %>% ggplot(aes(x=Pulse,y=BMI,color=Diabetes)) + geom_point() + geom_smooth(method="lm")