Introduction

In this vignette, we will demonstrate how to use the package, weightprog. This package will use data from the smartphone "Pillow" and "Health" app that provides sleep cycle and activity level statistics. The objective of the package is to allow easy visualization and analysis for comparisons of any statistical changes in the hours of sleep gotten as the semester progresses. Specifically, the package's illustrations include a multitude of two-way graphs that evaluate the underlying trend in different ways. Each different type of analysis will be described in the following secctions.

Description

When provided with weight changes along with data on sleep cycle and activity levels, the conventional analysis resorts to regression and t-test analysis. Simple ways to analyze such data would be to provide graphics through scatterplots but provide minimal information. Instead, to assess the progressions of weight, we developed weightprog which specifically provides analysis through visualization by introducing a third variable, known as our objective variable, and presenting progressions of our objective variable without needing to refer to a three dimensional graph.

To understand where weight fluctuations stem from, we specifically track any weight changes while simultaneously keeping track of daily activities such as walking and sleep cycles. In addition, to fully comprehend whether activity and sleep levels are associated with daily weight changes, weightprog uses multiple graphical estimators. For starters, the visualization graphics, embeds ggplots that specifically uses color and sizes as an aid to enhance basic two way graphs. Such basic twoway graphs include scatterplots, lineplots, and strip plots that help estimate the changes in weight by color categorization for any relationship between two variables. For more in depth analysis, visualization graphics specifically aim to observe distinctions between groups through pattern recognition. The inclusion of cluster and linear discriminant analysis provide different types of data categorization that allow us to recognize specific trends in the data the basic twoway graphs fails to capture. From manipulation of size and color to concrete identification in data, these types of visualizations will allow the attributes of a third variable to be assessed while contained in a two-dimensional graph.

For a more in-depth explanation of the data set, please refer to the Usage section.

Usage

There will be one primary data set that will be included in the package. The data includes an individual's weight progression with data on sleep cycles (i.e. hours of sleep, percent in deep sleep, etc.) and natural daily exercises (i.e. distance walked, steps in a day). The specific focus is on daily progression of weight, thus, the marginal change in weight from one day to the next will be the focal point of all visualization and analysis. First, we will load the package and the data. The data is imported from a publicized google document sheet.

library('weightprog')
library("googlesheets")
mykey <- gs_url("https://docs.google.com/spreadsheets/d/10KwAd2PL4FXbNrWkhLstzeuKvR9NYhfvegtPdjfELcM/edit?usp=sharing", 
                lookup=FALSE, visibility = "public")
hello <- gs_read(mykey)

Before proceeding with the visualizations, we will first need to find a presentable objective variable that will be the focal point of our statistical analysis. The objective variable chosen for the package is the marginal weight change from one day to the next. To determine the relative size of such changes, we will take the absolute value of the marginal weight change, and also determine whether such changes were a loss, gain, or maintain in weight. There will be further explanation on what we mean by objective variable when we introduce the data in the package.

Add Absolute Weight and Weight Change Groups

hello$absmargweight <- abs(hello$Marginal_Weight)
hello$posnegmarg <- c("Loss","Maintain","Gain")[sign(hello$Marginal_Weight)+2]
colnames(hello)[colnames(hello)=="absmargweight"] <- "Marginal_Change"
colnames(hello)[colnames(hello)=="posnegmarg"] <- "Class"

We will also have to change the format of the date in order to fit the needs of our visualization functions along with organizing the days of the week to follow the sequential order. This is a quick fix:

Changing Format of Date and Organizing Days of Week

dd = as.Date(hello$Date, format = "%m/%d/%Y")
hello$Day <- factor(hello$Day, levels=c("Monday", "Tuesday", 
                                        "Wednesday", "Thursday", 
                                        "Friday", "Saturday", "Sunday"))

Introduction to Package Data

Before introducing the functions in the package, let's take a closer look at the data that is built into the package. Since the statistics consist of activity, measured only by distance walked, functions will always use distance as the primary measurement. As for the sleep cycle, there are numerous categories of categories, and to take look at what each of the categories mean, we will first look at the summarization of the means for our sleep cycle statistics. We look at the time asleep, time in bed, time in deep sleep, time in light sleep, and time till falling asleep. The error bars denote the stadard error for each time statistic for that specific day. Since the mean does not tell us much about which category can be used as a primary measurement, we can proceed to using a linear regression and ANOVA.

library(plyr)
library(tidyverse)
library(tidyr)
base <- hello %>%
  mutate(Time_Deep = Deep_Sleep * Time_Asleep / 100) %>%
  mutate(Time_Light = Light_Sleep * Time_Asleep / 100) %>% 
  select("Day", "Time_Bed", "Time_Asleep", "Time_Sleep", "Time_Deep", "Time_Light") 
base <- gather(base, Time_Asleep, value, -Day) %>% 
        setNames(., c("Day", "Time", "value")) 

summarybase <- ddply(base, c("Day", "Time"), summarise,
               N    = length(value),
               mean = mean(value),
               sd   = sd(value),
               se   = sd / sqrt(N)
)

ggplot(summarybase, aes(Day, mean, fill = Time)) +   
geom_bar(position = position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), width=.3, size = 0.3, 
              position = position_dodge(0.9)) +
theme(axis.text.x=element_text(angle=45,hjust=1)) +
  labs(title = "Average Time Statistics by Day of Week", x = "Day of Week", 
       y = "Hours", fill = "Time Statistic",  caption = "Error Bars show Std. Error") +
  scale_fill_manual(labels = c("Time Asleep", "Time in Bed", "Time in Deep Sleep", 
                               "Time in Light Sleep", "Time to Fall Asleep"), 
                    values = c("blue", "red", "green", "purple", "yellow"))

Choosing our Primary Variables

With plenty of variables that seem to overlap and express similar characteristics, we will specifically run a multiple regression of marginal weight change on the distance, time asleep, and time in deep sleep and light sleep. When we run a linear model through these variables, the only significant variable is that of the time in deep sleep. Hence, we will specifically use deep sleep as our primary measurement for our sleep cycle (occasionally, we will use time asleep because it is a more general measurement). Likewise, when we run an ANOVA on the change in weight with variables of distance walked (measuring activity level) and time in deep sleep (measuring sleep cycle), we observe that deep sleep is on the borderline of being a statistically significant variable. Therefore, there is convincing evidence of a statistically significant difference in the time of deep sleep amongst changes in weight. Therefore, the two primary measurements that will be utilized will be distance for activity level, and deep sleep (can be in either hours or percentage), along with the occasional use of time asleep as a more general measurement. (Note: For future changes, we will will surpress the output for the linear regression and ANOVA.)

base2 <- hello %>%
  mutate(Time_Deep = Deep_Sleep * Time_Asleep / 100) %>%
  mutate(Time_Light = Light_Sleep * Time_Asleep / 100) %>% 
  select("Distance","Time_Bed", "Time_Asleep", "Time_Deep", "Time_Light", "Marginal_Weight") 

fit <- lm(Marginal_Weight ~ Distance + Time_Asleep + Time_Deep + 
                            Time_Light + Time_Bed, data=base2)
library(pander)
pander(fit)

weightprog1 = aov(Marginal_Weight ~ Distance*Time_Deep, data = base2)
pander(summary(weightprog1))

Understanding the Objective Variable

With the primary variables chosen to represent activity level and sleep cycle, we can now shift our focus to seeing if specific characteristics in distance and time in deep sleep seem to result in some type of pattern for the changes in weight. In this case, the variable of focus will be called the objective variable, consisting of the marginal changes in weight, with subgroups (called Class) of "Gain", "Loss", and "Maintain" in weight. Here is a scatterplot that shows a general overview of the two primary variables by the objective variable's subgroups:

intro <- hello %>%
  mutate(Time_Deep = Deep_Sleep * Time_Asleep / 100) %>%
  select("Distance", "Time_Deep", "Class") %>% ggplot(aes(x=Distance, y=Time_Deep, col = Class)) +
  geom_point() + xlab("Distance (km)") + ylab("Time in Deep Sleep (hrs)") +
  facet_wrap(~Class , ncol=1, scales="fixed", strip.position="right") 
intro

Once we have gotten a better understanding of the built-in dataset in the package, we may proceed to the Experimental Design utilizing these variables.

Experimental Design

As mentioned previously above in the introduction to the data, we specifically look at how varying activity and sleep levels affect any weight changes over time. We will collect data by utilizing the "Health" application on the iPhone, which will automatically track the number of steps and distance walked today, along with the "Pillow" application that will track various sleep variables. To track any weight changes, a standard weighing scale will be used in order to weigh myself every morning after waking up.

First and foremost, with this type of experimental design, inference cannot be applied to a population, given that it is data from one individual, limiting any statistical inference to myself. In addition, the experimental design contains a low level of reliability in the data. Specifically, the measurement from the applications are by no means accurate, because it may not be an well-validated reflection of distance and sleep, but it is the closest we can get given that the phone application is only a tool of estimating our variables. This provides us the closest way to see whether or not distance walked, and sleep patterns play a role in any possible weight changes when limiting the observations to daily activities. However, we would have liked to include even more information that would have helped provide an even more convincing case for weight changes over time. Out of the many confounders, there exists confounders such as food intake (i.e calories) and exercise outside of walking that are not captured in the data. These potential confounders are very likely to play a role in weight changes. Since there is practically no aspect of the experiment that involves any treatment, we cannot blind or randomize anything without the participant (myself) knowing.

Example Features

Two Way Scatter Feature

There are two different graphs within the two way scatter feature, twoway_scatterg(), and twoway_scatterc(). Both of these functions utilize a simple scatter plot between two variables of interest, and adds an additional layer for the objective variable. The two way scatter of objective variable by groups is the first of two functions, specifically aimed at using the objective variable's data points' color and size to provide information about the attributes. In addition, it runs a linear model through the specific subgroups of the objective variable. Meanwhile, the latter of the two functions, two way scatter by color, only displays any statistical changes in the objective variable through a sequential color palette.

library(ggplot2)
#source('/Users/Benjamin Hsu/Desktop/weightprog/R/twoway_scatter.R')
tws_plot <- twoway_scatterg(hello, hello$Distance, hello$Time_Asleep,
                            hello$Marginal_Change, hello$Class,
                            "Distance Walked (km)", "Time Asleep (hrs)", 
                            "Change in Weight", "Class")
tws_plot
tws_plot2 <- twoway_scatterc(hello, hello$Time_Asleep, hello$Deep_Sleep,
                  hello$Marginal_Weight, 16, 3,
                  "Time Asleep (hrs)", "Deep Sleep (%)",
                  "Change in Weight")
tws_plot2

For the data inside the package, the objective variable is the marginal weight, specifically focused on the magnitude (size) of the change in marginal weight, and whether or not such changes were gain, loss or maintain in weight (color). Both scatter plots have the base scatterplot, for twoway_scatterg(), it was Time Asleep vs. Distance Walked, and for twoway_scatterc(), it was %Deep Sleep vs. Time Asleep.

The speculation for the grouped scatter plot was that the more distance walked during the day, the more sleep an individual will need. For the two way scatter of the objective variable by groups, all three subgroups: gain, loss, and maintain in weight showed a poor linear fit. The general pattern consists of losses in weight (green), gain in weight (red), and maintain in weight (blue) randomly dispersed about the plot. Both these scatterplots result in the conclusion that there does not seem to be much pattern amongst Time Asleep vs. Distance, as well as %Deep Sleep vs. Time Asleep.

Two Way Strip Plot

The strip plot is similar to the simplified two way scatter plot, because the points' color denotes the attribute (i.e. day of the week, species, etc.) of y variable chosen. The strip plot provides a better overview of the data's patterns within a subgroup, while also providing a lineplot of the estimate mean of each group.

#source('/Users/Benjamin Hsu/Desktop/weightprog/R/twoway_strip.R')
twoway_strip(hello, hello$Day, hello$Distance, " ", 
             hello$Day, "Day of Week", "Distance Walked (km)")

We originally speculated that for different days of the week, we might travel a farther distance, which might affect the individual's marginal changes in weight. In the data, we specifically looked at the individual's activity level measured by distance walked in kilometers depending on the day of the week. We specifically see that the mean of distance walked seem to be higher on weekdays, while the weekends seem to have significantly lower activity levels.

Two Way Lineplot

There are two different two way lineplots that can are generated, twoway_lineplot() and twoway_lineplotf(). The first is a connected scatterplot of a y variable observed over a time period. This allows us to simply track the progression of a variable over time. Along with the lineplot are single points that are shown to be the mean of the variable by each week. The second lineplot is used for facetting by the specific characteristic of the objective variable. The plot creates a scatterplot and runs a linear regression model through each of the different subgroups (gain, loss, maintain) to identify any possible trends in the data. Both lineplots shows a basic summary of the statistical progression of the variable of interest.

require(reshape2)
base3 <- dcast(hello, Day ~ Week_Number, value.var="Weight")
base3m <- t(data.frame(colMeans(base3[2:14], na.rm=TRUE)))
rownames(base3m)[1] <- "Weight by Week Number"
pander(base3m) #Mean Weight of Each Week (Week_Average)
#source('/Users/Benjamin Hsu/Desktop/weightprog/R/twoway_lineplot.R')
twl <- twoway_lineplot(hello, date = hello$Day_Number, yname = hello$Weight, 
                xlabel = "Day Number", ylabel = "Weight Over Time", ymean = hello$Week_Average, 
                ymlabel = "Average Weight per Week", "Weight Statistic")
twl

twl2 <- twoway_lineplotf(hello, xname = "Distance", yname = "Deep_Sleep",
                 bycolor = "Class", bywrap = "Day", 
                 "Distance (km)", "Deep Sleep (%)")
twl2

The lineplot for the data used simply shows the progression of the individual's weight over time. We see a rather constant fluctuation in the weight, between the range of approximately 79 to 82 kilograms. We cannot say much about the overall trend of the individual's weight, but we do see that there was a small decrease in weight compared to the start of the data-tracking.

The second lineplot is facetted by the day of the week, and produces a linear regression model for each of the subgroups of the objective variable (that is: loss, gain, and maintain). Each day will present a linear fit of the trend between the Distance travelled and %Deep Sleep. By looking at the relationship bewteen distance walked and the percent of deep sleep per day, based on weight changes, we can better understand on which days are we more likely to gain, lose or maintain weight. Not much conclusive interpretation can be done on this, because there seems to be a variety of linear trends within each day's subgroups. This lineplot only allows for a basic observation of each day's progress of weight changes, and how this progression changes with relation to distance and deep sleep percentage.

K-Means Clustering

The k-means clustering allows for us to find groups that have not been explicitly defined in the data, specifically through two variables of interest. The function creates k centroids, allowing for k clusters to be created, while plotting all these different k-means clusters to allow for a decision on how many clusters to use. In addition, this allows for a confirmation of whether or not the objective variables' subgroups really have a cluster. Though this is a less explicit way of answering whether there is a relationship between two variables according to the objective variable, it allows us to observe almost all possibilities, from small to large amounts of clusters to confirm previous findings. (Note: this k-means is modified from the internet)

ggplot(hello, aes(x = hello$Distance, y = hello$Deep_Sleep)) +
  geom_point(aes(color = factor(hello$Class))) + xlab("Distance (km)") +
  ylab("% Deep Sleep") + guides(color = guide_legend(title = "Subgroups", title.position = "top"))

twoway_kmeans(hello$Distance, hello$Deep_Sleep, 5, 
              "Distance (km)", "Deep Sleep (%)", "Cluster(s) by Color")

To further see if an individual's weight changes is affected by the speculation of a relationship between distance walked and percent in deep sleep, the function can confirm how these weight changes are clustered. The scatter plot is put above the K-means clustering, so a comparison between how the clusters fare with the real data's subgroups can easily be made. From the data, we would theoretically say that higher activity levels might induce a more tired state subsequently higher percentage of deep sleep, which can then split into the subgroups of our objective variable. For gains in weight, we would speculate shorter distance (low activity), and losses in weight with longer distance (high activity), which would mean that for each subgroup, there should be a cluster.

Ideally, we would only need three clusters, however in the function, we do not see any clusters with a consistent subgroup, since the class (gain/loss/maintain) in each cluster seem to vary. From the scatterplot, we can see that the data itself has quite a few points for gain in weight from approximately 70 to 80% deep sleep. When we do look at the three clusters created, the cluster at the top does seem to consist of primarily gain, but with plenty of other classes as well. However, the other two clusters from the 3 clusters created, seem to be incorrectly clustered, since the data is in fact randomly dispersed about the plot. When we observe two clusters only, the bottom of the two clusters encounters the same problem, where the data is randomly scattered and cannot seem to be assigned to any cluster.

While there will always be mis-classified data, our data does not seem to have a specific subgroup that has a unified characteristic, thus we cannot generalize much about subgroups any of the classes of weight.

Linear Discriminant Analysis

The linear discriminant analysis (LDA) allows a more definitive pattern recognition method in comparison to cluster analysis. This type of machine learning focuses on finding a linear combination of features that help separate the groups as far as possible. In addition, linear discriminant analysis might be a better way compared to that of clustered analysis because instead of determining a pattern for grouping certain variables, LDA allows us to utilize the property that the membership to each classification is established already. To use the linear discriminant analysis, the underlying assumption here is that the data is normally distributed, and each of the subgroups have equivalent variances.

For the package data, the goal of using linear discriminant analysis was to find a linear combination for the selected variables that will provide the largest distinction from one group to the next. The variables used to provide the best separation for the class subgroups (gain/loss/maintain) included distance, time asleep, and percentage in deep and light sleep. For each of the two discriminant functions, a histogram will be used to provide visualization of each class' separation. Along with the histogram, a scatterplot of the first and second discriminant function's values for the classes will be provided. Both the histograms and scatterplots will provide visualizations to pinpoint the presence of any characteristic difference for each class.

library(broom)
twoway_lda(hello$Class, hello$Distance, hello$Time_Asleep,
           hello$Deep_Sleep, hello$Light_Sleep, hello, hello$Class)

In both the first and second discriminant function, none of the class seem to be distinctly separate from the other classes. From just these two histgrams we could say with confidence that there is not any distinct characteristics in the subgroups that have been picked up by LDA. When we continue and take a look at the scatterplot of the two discriminant functions, we observe that none of the groups seem to be separate from one another. Thus, there does not seem to exist any type of separation between the gain, loss, and maintain group with regards to the characteristics used in distance, time asleep, and percent in deep and light sleep.

Results

Since our main goal is to see if there seems to be an explanation for the marginal gain, loss and maintain in weight with respect to the variables reflecting activity and sleep levels, we will summarize the results obtained from above. We begin with some basic summarization graphics: the strip-plot illustrates that the activity level on weekends tend to be significantly lower than weekdays. Meanwhile, the lineplot also illustrates that there tends to be a more prevalent gain in weight on the weekends. Both of these visualizations only allow simple summarization, but no significant advances in answering our question.

To answer our question of interest, we look at the two-way scatter along with cluster and linear discriminant analysis. From our two way scatter feature, we see a randomized trend for the different subgroups for the objective variable. There does not seem to be any specific values of deep sleep and distance walked that seem produce a prevalent pattern for changes in weight for both the size and color utilized. We proceed to the cluster analysis and see that ideally we would need three clusters for each of the subgroups in our class, but this does not seem to provide much distinction between each group. Instead, it seems like such division in the cluster provides an arbitrary clustering that will not help identify specific traits of deep sleep and distance that chracterize the subgroups. To help with maximizing the separation between each group, we resort to the linear discriminant analysis. This allows for a more definitive pattern recognition method based on finding an optimal linear combination that will separate each subgroup. When we look at the histogram for both linear combinations, there does not seem to be any distinction between the groups as they are randomly dispersed with overlapping areas. By plotting the linear combinations, we also see that such subgroups seem to be randomly dispersed, without any distinct pattern recognized. Thus, we conclude that through our visualizations, we are convinced that there does not seem to be any specific values of deep sleep and distance amongst the different subgroups for the marginal change in weight.

Limitations and Future Analysis

With a relatively small sample size from the data collected over time in the package, and the fact that the data is not necessarily representative of our population of interest, we are limited to any statistical inference to myself only. In addition, in our data collection process we specifically chose two variables that we thought were representative of any potential relationship to weight change. By using these two variables, we failed to explore the many different that could have been paired together. Any of these other variables could have potentially influenced the objective variable. Overall, even with the variables we have already, a variety of variables including food intake and other exercise would have helped given a more concrete explanation of weight changes. In short, these limitations of the experiment seem to have hindered any type of inference we could have gotten from the results. Perhaps if given the chance to re-do the entire experiment, more subjects would have been chosen along with the addition of more variables.

Session Info

sessionInfo()


bhsu4/weightprog documentation built on May 28, 2019, 7:10 p.m.