knitr::opts_chunk$set(echo = FALSE, warning=FALSE, message=FALSE, tidy = TRUE) library(tidyverse) library(learnr) library(gridExtra) library(naniar) library(DSIExplore) #devtools::install_github("laderast/DSIExplore") library(NHANES) library(knitr) library(kableExtra) # FOR EXERCISES nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))
library(tidyverse) library(DSIExplore) library(NHANES) # FOR SERVER nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI))
Don't forget! Please fill out our pre-session survey. (pre-categorical and continuous)
Find this session at https://minnier.shinyapps.io/ODSI_continuousData/, with code at https://github.com/laderast/DSIExplore
At the end of this session you should be able to
We saw bar plots and proportional plots used to visualize binary and categorical variables in the previous section. What are some good ways of visualizing continuous (quantitative) data?
Let's use the NHANES (National Health and Nutrition Examination Survey) data set to visualize the variable BMI
as a continuous variable.
(Note: info about the NHANES data in the NHANES
R package can be found here with the disclaimer that NHANES are survey data so to do proper analyses we should use sampling weights. For illustration of more straightforward analyses we will ignore this detail.)
A common visualization of the distribution of a continuous variable is a histogram (or the smoothed version---the density plot):
data(NHANES)
NHANES %>% ggplot(aes(x=BMI)) + geom_histogram(binwidth=2.5) + geom_density(aes(y=2.5*..count..,color="red")) + ggtitle("Histogram and Density of NHANES BMI") + guides(color=FALSE)
This plot shows us the frequency of certain values of BMI
. We can see the distribution is somewhat "skewed" positively in that there are a few quite large values on the right tail of the distribution.
We can also use a box plot to see similar patterns. The help from the box plot R function ?geom_boxplot
tells us what all the parts of the box plot mean:
"The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). This differs slightly from the method used by the boxplot function, and may be apparent with small samples. See boxplot.stats for for more information on how hinge positions are calculated for boxplot.
The upper whisker extends from the hinge to the largest value no further than 1.5 x IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles). The lower whisker extends from the hinge to the smallest value at most 1.5 x IQR of the hinge. Data beyond the end of the whiskers are called "outlying" points and are plotted individually."
NHANES %>% ggplot(aes(x="All Subjects",y=BMI))+geom_boxplot() +ggtitle("Box plot of NHANES BMI")+xlab("")
All those dots piled up on the bottom are people who are outside the 1.5 x IQR, often though of as "outliers" of the population distribution. This also shows the positive skewness of the distribution.
If you want to practice coding a histogram and density plot or boxplot, try editing ggplot2
code below to show a histogram of Age
:
# edit the ggplot code after x= NHANES %>% ggplot(aes(x=BMI)) + geom_histogram(binwidth=2.5) + geom_density(aes(y=2.5*..count..,color="red")) + ggtitle("Histogram and Density of Age") + guides(color=FALSE) # edit the ggplot code after y= NHANES %>% ggplot(aes(x="All Subjects",y=BMI))+geom_boxplot() +ggtitle("Box plot of NHANES Age")+xlab("")
# edit the ggplot code after x= NHANES %>% ggplot(aes(x=Age)) + geom_histogram(binwidth=2.5) + geom_density(aes(y=2.5*..count..,color="red")) + ggtitle("Histogram and Density of Age") + guides(color=FALSE) # edit the ggplot code after y= NHANES %>% ggplot(aes(x="All Subjects",y=Age))+geom_boxplot() +ggtitle("Box plot of NHANES Age")+xlab("")
We learned that good EDA can help us identify associations.
The first visualization you are likely to make when you have two continuous variables is a scatter plot. Let's look at the scatter plots of Height
and Weight
vs. BMI
.
NHANES %>% ggplot(aes(x=Weight,y=BMI,color=Height)) + geom_point() NHANES %>% ggplot(aes(x=Height,y=BMI,color=Weight)) + geom_point()
A simple statistical quantification of the association of two continuous variables is the Pearson's Correlation Coefficient (often labeled r).
Correlation = a quantity measuring the extent of interdependence of variable quantities
Pearson's correlation coefficient: a measure of the linear correlation between two variables
Question: How well does the line "fit" the data?
plot_w_cor = function(df,cont_x,cont_y,color) { enquo_x = enquo(cont_x) enquo_y = enquo(cont_y) enquo_color = enquo(color) tmpdat = df%>%summarize(x=mean(range(!!enquo_x,na.rm=T)), y=max(!!enquo_y,na.rm=T), cortext=paste("Pearson's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs"), digits = 2), "\nSpearman's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs",method="spearman"), digits = 2)) ) df %>% ggplot(aes_(x=enquo_x,y=enquo_y,color=enquo_color)) + geom_point() + stat_smooth(method="lm",se = FALSE) + geom_text(data = tmpdat,aes(x=x,y=y,label=cortext),inherit.aes = FALSE,size=4) } NHANES%>%plot_w_cor(Weight,BMI,Height) NHANES%>%plot_w_cor(Height,BMI,Weight)
Now you can try to get a feel for what correlation (linear and non-linear) looks like. Try a few pairs:
(For fun sometime, play the "guess the correlation" game at guessthecorrelation.com)
cont_vars = c("Age","Weight","Height","BMI","BPSysAve","BPDiaAve","TotChol","Pulse","HHIncomeMid","Testosterone") selectInput("select_x", "X-axis", choices = cont_vars, selected = "Weight") selectInput("select_y", "Y-axis", choices = cont_vars, selected = "BMI") selectInput("select_color", "Color", choices = cont_vars, selected = "Height") plotOutput("scatterPlotCont")
library(rlang) # should wrap this into the package and not repeat like this # need to figure out a better way to use quoted vs quosures plot_w_cor_quotes = function(df,cont_x,cont_y,color) { enquo_x = quo(!! sym(cont_x)) enquo_y = quo(!! sym(cont_y)) enquo_color = quo(!! sym(color)) tmpdat = df%>%summarize(x=mean(range(!!enquo_x,na.rm=T)), y=max(!!enquo_y,na.rm=T), cortext=paste("Pearson's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs"), digits = 2), "\nSpearman's r = ", round(cor(!!enquo_x, !!enquo_y,use="complete.obs",method="spearman"), digits = 2)) ) df %>% ggplot(aes_(x=enquo_x,y=enquo_y,color=enquo_color)) + geom_point() + stat_smooth(method="lm",se = FALSE) + geom_text(data = tmpdat,aes(x=x,y=y,label=cortext),inherit.aes = FALSE,size=4) } output$scatterPlotCont <- renderPlot( NHANES %>% plot_w_cor_quotes(input$select_x,input$select_y,input$select_color) )
If you want to practice coding a scatter plot, try editing ggplot2
code below to show a scatter plot of Age
vs Height
, colored by Gender
:
# edit the ggplot code after x= and y= and color= to change the axes and the color NHANES %>% ggplot(aes(x=Weight,y=BMI,color=Diabetes)) + geom_point(alpha=0.5) + stat_smooth(method="lm",se = FALSE)
# edit the ggplot code after x= and y= and color= to change the axes and the color NHANES %>% ggplot(aes(x=Age,y=Height,color=Gender)) + geom_point(alpha=0.5) + stat_smooth(method="lm",se = FALSE)
Now let's explore the association of BMI with a binary (yes/no) variable. How does BMI differ by Diabetes
status?
NHANES %>% ggplot(aes(x=Diabetes,y=BMI,color=Diabetes)) + geom_boxplot()
question("Is BMI associated with diabetes status?", answer("No, BMI is the same for people with or without diabetes.", message="Not true, look at the average BMI for each group."), answer("Yes, BMI is different depending on diabetes status.", correct=TRUE, message="Yes, BMI is on average higher for subjects with diabetes than without.") )
We see that BMI is on average higher for subjects with diabetes than without. There is also a third category called NA
. This means the data is missing.
Let's explore the missingness in this data a bit before we move on.
We can summarize the data with missing diabetes status:
NHANES %>% filter(is.na(Diabetes)) %>% select(Diabetes, BMI, Age) %>% summary
Note that 137 out of 142 subjects with missing diabetes data also have missing BMI data. Also note the Age distribution. Does something look interesting?
NHANES %>% filter(is.na(Diabetes)) %>% select(Diabetes, BMI, Age) %>% janitor::tabyl(Age)
What about missingness in BMI? How does this relate to age? We can use the naniar
package in R to visualize this with a scatter plot:
NHANES %>% ggplot(aes(x=Age,y=BMI,color=Diabetes))+naniar::geom_miss_point()
We also might notice something interesting about the age distribution of our population.
question("Why do you think there's a pile up of points at age 80?", answer("Everyone died at age 80.", message="Highly unlikely."), answer("They oversampled from the 80 year olds.", message="Highly unlikely."), answer("The data was truncated.", correct=TRUE, message="Correct! But why? Probably to preserve de-identification. If we were to fit a model to find the association of age and height, how do you think those 80 year olds would influence our results?") )
For future analyses, let's work with subjects older than 20 years of age, and remove subjects with missing Diabetes
and BMI
values. Lastly, let's check out the box plots on this filtered data.
nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI)) nrow(nhanes_filtered)
nhanes_filtered %>% ggplot(aes(x=Diabetes,y=BMI,color=Diabetes))+geom_boxplot() + xlab("Filtered by Age >=20")+ ggtitle("Box plot of NHANES BMI by Diabetes Status")+xlab("")
The most common statistical test to compare the distribution of continuous variables across a binary (Yes/No) variable (like diabetes) is the two sample Student's T Test.
two sample T-statistic = (difference in means)/sqrt(pooled variance)
The test is is so named because the test statistic---which quantifies how different the means are in relation to the variance---follows a T-distribution. The T-distribution is like the normal bell shaped curve but with "fatter tails". The tails become closer to the Normal distribution as n gets larger.
xx = seq(-5,5,by=0.01) plot(xx,dnorm(xx),type="l",main="Normal and T distribution functions",xlab="x",ylab="p(x)") points(xx,dt(xx,df = 10),type="l",col="red") points(xx,dt(xx,df = 3),type="l",col="blue") points(xx,dt(xx,df = 100),type="l",col="green") legend("topright",c("Normal(0,1)","t df=3","t df=10","t df=100"),lty=1,col=c("black","blue","red","green"))
The T test makes assumptions about the distribution of your continuous variable within the two groups:
Let's look at the smoothed histograms (density plots) of the two groups' BMIs.
tmp_summarydat = nhanes_filtered %>% group_by(Diabetes) %>% summarize(mean=mean(BMI)) nhanes_filtered %>% ggplot(aes(x=BMI,color=Diabetes)) + geom_density() + geom_vline(data = tmp_summarydat,aes(xintercept=mean,color=Diabetes),lty=2)
Question: Do we think the assumptions of the t-test hold?
question("Do you think the assumptions of the t-test hold?", answer("No, not completely.", correct=TRUE, message="The independence assumption holds, but the fact that our distribution is somewhat positively skewed actually violates the normality assumption of the T-test"), answer("Yes, completely.", message="Not quite: the independence assumption holds, but the fact that our distribution is somewhat positively skewed actually violates the normality assumption of the T-test.") )
However! The T-test is pretty robust to slight violations of the normality assumption, especially since we have a large sample size
So, let's run a t-test (yay!) to assess the difference in means of BMI comparing diabetics and non-diabetics:
broom::tidy(t.test(BMI~Diabetes,data=NHANES)) %>% select(estimate:p.value) %>% mutate(p.value=as.character(signif(p.value,2))) %>% knitr::kable(col.names = c("Difference in Means","Means No","Means Yes", "T Statisitic","P Value"),digits=2)%>% kableExtra::kable_styling("striped", full_width = F)
Note the p-value is extremely small. This is because we have a very large sample size and the difference in means is pretty large.
What happens if we have a much smaller sample size? We can examine the effect of sample size by randomly sampling a subset of the data. Look at our test statistic and p-value, as well as the difference in means.
sliderInput("samplesize", label="Total Sample Size",min=10,max=500,value=500) actionButton("buttonResample","Take a Sample") plotOutput("densityPlot1") tableOutput("ttest1")
nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI)) subsetdata <- reactive({ input$buttonResample nhanes_filtered %>% dplyr::sample_n(size=input$samplesize) }) subsetdata_summary <- reactive({ subsetdata() %>% group_by(Diabetes) %>% summarize(mean=mean(BMI)) }) output$densityPlot1 <- renderPlot( subsetdata() %>% filter(!is.na(Diabetes)) %>% ggplot(aes(x=BMI,color=Diabetes)) + geom_density() + geom_vline(data = subsetdata_summary(),aes(xintercept=mean,color=Diabetes),lty=2) ) output$ttest1 <- function() { broom::tidy(t.test(BMI~Diabetes,data=subsetdata())) %>% select(estimate:p.value) %>% mutate(p.value=as.character(signif(p.value,2))) %>% knitr::kable(format = "html",col.names = c("Difference in Means","Means No","Means Yes", "T Statisitic","P Value"), digits=2) %>% kableExtra::kable_styling("striped", full_width = F) }
Change the outcome (left hand side of ~
is the continuous outcome) to be Height
and the binary factor (right hand side) to be Gender
to test height differences in males vs females.
t.test(BMI~Diabetes,data=nhanes_filtered) %>% broom::tidy() %>% kable(digits=3)
t.test(Height~Gender,data=nhanes_filtered) %>% broom::tidy() %>% kable(digits=3)
Also examine the histogram of Height
by Gender
(or try your own hypothesis!). Also see what happens when you change the data set from nhanes_filtered
to NHANES
(not filtered).
nhanes_filtered %>% ggplot(aes(x=BMI,color=Diabetes)) + geom_density()
nhanes_filtered %>% ggplot(aes(x=Height,color=Gender)) + geom_density()
What other factors might be associated with BMI?
AllSubjects
as your Factor
cat_vars = c("AllSubjects","Gender","Diabetes","Smoke100n","Race1","Race3","PhysActive","TVHrsDay","SleepTrouble","AgeDecade") selectInput("select_factor1", "Factor", choices = cat_vars, selected = "AllSubjects") selectInput("select_cont1", "Continous X-axis", choices = cont_vars) plotOutput("pointPlot1") plotOutput("boxPlot1")
nhanes_filtered$AllSubjects = "All" output$pointPlot1 <- renderPlot( nhanes_filtered %>% ggplot(aes(x=get(input$select_cont1),y=BMI, color=get(input$select_factor1))) + geom_point(alpha=0.1) + facet_wrap(~get(input$select_factor1)) + xlab(input$select_cont1) + geom_smooth(method="lm")+scale_color_discrete(name=input$select_factor1) ) output$boxPlot1 <- renderPlot( nhanes_filtered %>% ggplot(aes(x=get(input$select_factor1),y=BMI,color=get(input$select_factor1))) + geom_boxplot() + xlab("") + scale_color_discrete(name=input$select_factor1) )
Often we believe our measure or outcome of interest (i.e. BMI) is associated with several other variables at once. For instance, BMI could be associated with diabetes, blood pressure, physical activity, diet, among many other characteristics.
Assumptions and model:
Mathematically, we are modeling our outcome $\ Y$
$$\ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon$$
Suppose we want to determine if systolic blood pressure is associated with BMI after adjusting for diabetes status.
We would fit a model:
$$\ BMI = \beta_0 + \beta_1 (Pulse) + \beta_2 (Diabetes = Yes) + \epsilon$$ and test whether the coefficient $\ \beta_1$ is equal to 0 or not.
We can visualize this:
nhanes_filtered %>% ggplot(aes(x=Pulse,y=BMI,color=Diabetes,alpha=Diabetes)) + geom_point() + geom_smooth(method="lm")
We can run a linear regression in R
and view the output:
summary(fit1 <- lm(BMI~Pulse+Diabetes,data=nhanes_filtered))
Pulse
as: for every one unit (beat per minute) increase in pulse, we expect the average BMI to increase by r round(fit1$coef[2],3)
kg/m^2.This idea of "how well can we predict our outcome" drives much of statistical modeling, including machine learning and predictive analytics.
We can include many predictors in our model and still not have a very good R-squared, as BMI is a very complex trait:
summary(fit1 <- lm(BMI~PhysActive+SmokeNow+Gender+Race3+BPSysAve+BPDiaAve+Pulse+Diabetes,data=nhanes_filtered))
What if we tried to predict BMI with weight?
summary(fit1 <- lm(BMI~Weight+Diabetes,data=nhanes_filtered))
nhanes_filtered %>% ggplot(aes(x=Weight,y=BMI,color=Diabetes,alpha=Diabetes)) + geom_point() + geom_smooth(method="lm")
Modify the R code in the lm()
function to build your own linear regression model. Here are the names of the NHANES data set:
names(nhanes_filtered)
nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI)) # edit the code in the formula by adding in variable names summary(fit <- lm(BMI~Pulse+Diabetes,data=nhanes_filtered))
nhanes_filtered = NHANES %>% filter(Age >=20, !is.na(Diabetes), !is.na(BMI)) # edit the ggplot code after x= and y= and color= to change the axes and the color nhanes_filtered %>% ggplot(aes(x=Pulse,y=BMI,color=Diabetes)) + geom_point() + geom_smooth(method="lm")
To learn more R coding and a bit more about EDA and statistical analysis, try out our Data Camp course:
Peng, Roger D. Exploratory Data Analysis with R. Leanpub.
Peng, Roger D. and Matsui, E. The Art of Data Science. Leanpub.
Please fill out our post-session survey.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.