Introduction

The Sha package is used for analysing sleep duration and other atrributes related to life health. The name Sha is the abbreviation of Sleep and Health Analysis. We may have curious about weather sleep duration will be influenced by the weight, tempreture, activities, day of the week, and heart rate. Intuitively, sleep time and the other attributes are not that correlated. But, we could use this package to conduct detail analysis to answer intereting questions, which may be counter intuitive. Additionally, I also wonder wheather having pet or not will affect the other features. Basicly, the activities will increase correspondingly by having a dog as you will have to do dog walking at least once a day. We can also conduct some interesting analysis to see how your pet will influence on your life, such as sleeping duration, body weight.

Description of Original Dataset

In order to discover the attributes that may have some associate with our personal sleep time, I choose to collect of health data of myself via 'googlesheets', which I made it published so you can fetch this dataset via "googlesheets" package. I believe that the attributes set well represent my interets on the research question. The measurement of the attributes are also important, which will be detail mentioned below.

library("googlesheets")
googlekey <- gs_url("https://docs.google.com/spreadsheets/d/1KdOw0FIxY2Kt8UgY16F3UMQSC7KJaBgtmewNZUGuLY8/edit#gid=0",lookup = FALSE,visibility = "public")
mydata <- gs_read(googlekey, skip = 3)
knitr::kable(head(mydata), format = "markdown", align = 'c')
no_of_col <- dim(mydata)[2]
var_names <- colnames(mydata)

The collected dataset contains r no_of_col variables related to one individual's sleep and health data and recoded each observation daily. The following is the variable explanation and how I measure them:

  1. r var_names[1]: in format of month/date/year started from Sep.11 2017
  2. r var_names[2]: the name of the day during a week (Mon,Tue,Wed,Thu,Fri,Sat,Sun)
  3. r var_names[3]: body weight in kg measured at erery morning when wake up using the same weighing machine
  4. r var_names[4]: total sleep time (hour) per night measured by "Sleep Cycle"" app on iphone
  5. r var_names[5]: daily average tempreture in Celsius from website https://www.wunderground.com/history/airport/KROC/2013/12/17/MonthlyHistory.html
  6. r var_names[6]: a measurement of activity in steps via iphone 'health' app
  7. r var_names[7]: heart rate in beats per minute (bpm) measured every morning when wake up via "Sleep Cycle"" app on iphone
  8. r var_names[8]: weather you have dog or not

Data Cleaning

Generate a New Variable from column "DayofWeek"

Intuitively, weekday should have similar pattern on sleep duration compared to that of weekend. So we could generate a new variable name as "DayType" to classify each day as weekday or weekend. The more convinient way is to use library tidyverse as showing below.

library(tidyverse)
mydata <- mydata %>% 
  mutate(DayType = ifelse(DayofWeek %in% c("Sat", "Sun"), "Weekend", "Weekday")) %>% 
  select(-DayofWeek) 
#assign the variable name with units of measurement
sch <- c("Date", "Weight(kg)", "Sleep Duration(hr)", "Tempreture(°C)", "Activity(steps)", 
         "Heart Rate(bpm)","Having Dog", "Day Type")
knitr::kable(head(mydata, 10), col.names = sch, align = "c")

Now, the following analysis are based on this dataset which is named as mydata.

Explotory Data Analysis

Basic Statistics and useful information

Here, we are going to use "tidyverse" library to manupilate the "mydata" dataset and get the useful information.

n <- mydata %>%
  summarize(n = n())
firstday <- mydata$Date[1]
lastday <- mydata$Date[length(mydata$Date)]  
avg_st <- round(mean(mydata$SleepDuration),2)
percent_avg_st <- round(avg_st/24,4)*100

There are r n observations in the mydata dataset recoded from the date of r firstday to r lastday. The average of my sleep time per day is approximately r avg_st hour, which account for r percent_avg_st % time a day.

count2 <- mydata %>% 
  group_by(DayType) %>%
  tally()

During this period, there are r count2$n[1] weekdays, r count2$n[2] weekends. We are going to find out the average sleep duration on weekday and on weekend and arrange them to see which group has longer sleep duration.

sleep_by_day <- mydata %>% 
  group_by(DayType) %>% 
  summarise(avg_sleep = mean (SleepDuration)) %>% 
  mutate(avg_sleep = round (avg_sleep, 2)) %>% 
  arrange(desc(avg_sleep))

The average of sleep duration per day in weekend is r sleep_by_day$avg_sleep[1] hour, which is longer than the weekday sleep time on average, which is r sleep_by_day$avg_sleep[2] hour based on our existing data. It sounds very natually as I had no classes on weekend, so I may get up late.

# Import the package *sha*
library(sha)

Box plot

We could also make box plot to visulise the differences of sleep duration between weekday and weekend by using plot_box function. It will also show you how disperse the data is distributed , how skew the distribution may be, and also tell you the outliers in the dataset.

plot_box(mydata, x = "DayType", y= "SleepDuration")

From the above boxplot, it tell us that the average sleep time by day in weekday is less than that in weekend, and there is one outlier in weekend group. I will not remove this outlier from the dataset, as I think this value represent my sleep pattern in some days.

To make the catergorical variables as factor variables for following analysis.

mydata$DayType <- as.factor(mydata$DayType)
mydata$HavingDog <- as.factor(mydata$HavingDog)

Scatter plot

You can use plot_scatter function to make scatter plot on two continue variables by group to check if they have linear relationship and how does the relationship accross differnty groups. The below graph shows us the relationship between seelp duration and activity level group by pet condition. It seems that there is very strong negative correlation between sleep duration and avtivity when I did not have a dog. Once I have dog, it seems there is no assosiation between my sleep time and activities. It is interesting as it shows that my activity steps has no linear relationship with sleep duration after I have a dog. Maybe because I have to do dog walking in the morning which make my sleep duration less variation.

plot_scatter(data = mydata,x = "Activity",y = "SleepDuration", color = "HavingDog")

Here is another example for looking at the relationship between activity steps and sleep duation. You can use the plot_scatter function to make the plots on any two of the continue variables in the dataset, and grouped by a categorical vatiable.

plot_scatter(data = mydata,x = "Weight",y = "HeartRate", color = "DayType")

From the above scatter plot, it seems that there is a positive linear relationship between weight and heart rate. And the slopes of weekday and weekend is very similar.

Checking statistically significantly mean difference across the groups

Now let's check weather there is statistically significantly difference on average sleep duration between weekday and weekend. We are using oneway-anova test so it is made strong assumption that the dependent variable observation is normal distributed and do not have outliers.

H0: The mean of sleep duration between weekday and weekend are equal

H1: The mean between the two groups are not equal

summary(aov(SleepDuration ~ DayType, data = mydata ))

From the above testing results, the p-value for this test is far less than the 0.05 significant level. So we reject the null hypothesis and conclude that there is significant difference in the average sleep duration between weekday and weekend under 95% confidence.

We again did the oneway-anova test on the sleep duration between having a dog and not having a dog.

H0: The mean of sleep duration between having a dog or not are equal

H1: The mean between the two groups are not equal

summary(aov(SleepDuration ~ HavingDog, data = mydata))

Since the p-value is larger than the 0.05 significant level, we fail to reject the null hypotheis that the mean of sleep duration between having a dog or not having a dog are equal. So there is no significant differences of sleep duration on weather having a dog or not on average under given data and evidence , with 95 % confidence.

I also wonder weather having a dog or not will lead to mean differences on my activity level.

summary(aov(Activity~ HavingDog, data = mydata))

The p-value is far less than 0.05, which means there is siginifiant difference of activity steps on average weather having a dog or not.

Correlation Matrix Plot

In the dataset, the weight , SleepDuration, Tempreture, Activity and Heart Rate are continues variables. So we can do a multiple correlation analysis on those continues variables to check the pair relationship. We can use the function cor_plot in "sha" package to calculate all the pair correlation among all the continues variable as cor_plot will automatically detect the continues variables and visulize the results in a plot. (see below)

cor_plot(mydata)

Additionally, we can use tidyverse library to calculate the correlation between two continues variables by set more condtions. For example, we are interested in the correlation between the sleep duration and the tempreture separately for each day type only for having a dog condition, and return a sorted list of those correlation coeffients rounded to two digits.

ans <- mydata %>% 
  filter(HavingDog == "Yes") %>% 
  group_by(DayType) %>% 
  summarise(r = cor(SleepDuration, Tempreture)) %>% 
  mutate(r = round(r, 2)) %>% 
  arrange(desc(r))

knitr::kable(ans, format = "markdown", align = "l")  

So the results is quite interesting as there is positive correlation between sleep duration and tempreture on weekend and there is negative correlation between them on weekday when I have a dog. However, both of the correlation coefficent are close to zero. So such association may not siginificant exist.

Modelling

Simple Linear Regression Model

According to the above correlation results, we could conduct a simple linear regression models to quantify the association between heart rate and tempreture.

slm_model <- slm_s(mydata, x = "Tempreture", y="HeartRate")
slm_model
modelplot(mydata, x = "Tempreture", y="HeartRate")

The equation of the above linear model can be constructed from the output:

Predicted Heart Rate = r round(coef(slm_model)[1], 2) + (r round(coef(slm_model)[2],2)) * Tempreture

The interpretation for the slope coefficient: With a unit celsius increase in tempreture,the heart rate, on average, decrease by approximately 0.5 bpm.

The interpretation for the intercept coefficient: When the tempreture is 0,the heart rate, on average, is 87 bpm.

There is an approximation of 8.817 the residuas tend to deviate around the regression line.

However, only approximately 7% of the variability in the heart rate variable is explained by the tempreture variable.

The intercept is extremely significant, while the slop p value is larger than 5% significant level, So the overall model is not significant.

We can further check the model assumption by Q-Q plot to see if the data are normal distributed.

qqnorm(slm_model$residuals)
qqline(slm_model$residuals)

It seemes that the data are not follow normal distribution, which violte the the Normality assumption. And the tail of the distribution has a lot of outliers, which may have huge impact on the shape of the distribution.

Multiple Linear Regression Model

Create a scatter matrix plot on all the continues variables.

panel.cor <- function(x, y, digits = 2, cex.cor, ...)
{
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  # correlation coefficient
  r <- cor(x, y)
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste("r= ", txt, sep = "")
  text(0.5, 0.6, txt)

  # p-value calculation
  p <- cor.test(x, y)$p.value
  txt2 <- format(c(p, 0.123456789), digits = digits)[1]
  txt2 <- paste("p= ", txt2, sep = "")
  if(p<0.05) txt2 <- paste("p= ", "<0.05", sep = "")
  text(0.5, 0.4, txt2)
}
par(ps = 12, cex = 1, cex.main = 1)
pairs(mydata %>% select(SleepDuration, Weight, Tempreture, Activity, HeartRate), upper.panel = panel.cor,
      pch=16, col="blue", main="Matrix Scatterplot of SleepDuration, Weight, Tempreture, Activity and HeartRate", labels = c("Sleep Duration(hr)","Weight(kg)","Tempreture(°C)","Activity(steps)","Heart Rate(bpm)"))

From the above graph, the variables seems that corrlated to each other, so it may arise multicollinearity problem. Also, the correlation between sleep duration and acitivity by steps are significant, while the rest of variables with sleep duration correlation is not siginifiicant.

Creating a multiple linear regression model included all variables except the date, becasue the date is a distinct identifier, like index.

ml.saturated = lm(SleepDuration ~  Weight + Tempreture + HeartRate + Activity + DayType + HavingDog, 
                  data= mydata)
mlr_sum <- summary(ml.saturated)
mlr_sum
knitr::kable(mlr_sum $coef, digits = c(4, 4, 3, 4), format = 'markdown')

The regression equation is :

SleepDuration = 18 + (-0.22) * Weight + 0.03 * Tempreture - 0.0001 * Activity - 0.0029 * HeartRate + 1.55 * Weekend + 1.16 * HavingDog(Yes)

Interpretion of the above equation under our problem context :

Although some of p-value of the attributes are less than 0.05, the overall regression is significant as the p-value is far less than 0.05.

The RSE is about 1.455, which is an estimate of the average deviation of the observations around the regression line.

More importanly, the adjusted R square is 0.2446, meaning that approximately 25% of the variation in sleep duration is accounted for by the variables in our model, which indicates that the model is kind of poor in predicating the duration of sleep. Maybe, we should investigate the assumptions of the model.

plot(ml.saturated, which = c(1,2))

According to the Q-Q plots, we will see that the data are follow the normal distributed. From the residual plot, we will also believe the data is normally distributed and it is independent. However, there is three outliers at the tail of the distribution which may have greate impact on the shape of the distributiopn.

Conclusion

From the above study analysis on my collected sleep and health dataset, we would conclude that the sleep time per day has little association with my body weight, tempreture and heart rate. While the average sleep duration has negative linear relationship with activity by steps. There is significant difference on average sleep duration between having a dog or not. So does the day type. The analysis tells me that my sleep duration in weekend is larger than that in weekday; having a dog will increase my sleep duration as well. Additionally, my body weight has negative relationship with tempreture; it has positive relationship with heart rate and acitivity by steps. Tempreture and heart rate also has negative realtionship, which means the tempreture getting lower, my heart rate will increase. Last, my body weight and activity by steps are positivie associated, whcih I found is very counter intuitive.

Limitation

My whole analysis try to answer the question of does other health attribute will affect my sleep. I use the bed time represent my daily sleep as estimator. And I also collect the other data using as health attribute's estimators. In the later part of the research, I added a new feature about the pet condition. But the problem is I only use having dog or not as the estimator to represent the pet condition, which is biased or not good to represent as population interets. Another problem is the sleep time I use iphone app to record, which is not that accurate, as I did not fall asleep immediately. So how to define the sleep time. Or maybe I also should consider to normalise the data by make it as a propotion of 24 hour.To check how the propotion of daily sleep time vary among the other attributes. The validity problem of the data also happend on the activity. I use iphone to record my walking steps. But sometimes, I may forget to bring the iphone with me when I go to walk, so it may lead to outlier values.

If I could do it again, I may consider to record the following attribute: the time I go to bed; the time I wake up; the time I am in deep sleep; generate a variable to measure my sleep quality.

I may consider to record my puppy information as well to see how pet will affect my life, such as dog body weight, the time of dog walking, the brand of food she is taking and how much did she eat per day.

sessionInfo()


poorbaby/r_package_sha documentation built on May 29, 2019, 7:38 a.m.