In alanarnholt/PDS: Data Sets for Passion Driven Statistics

options(useFancyQuotes="UTF-8")
knitr::opts_chunk$set(comment = NA, message = FALSE, warning = FALSE, prompt = TRUE)

Using PDS

Once the PDS package is installed, you will need to load the package using the library function.

library(PDS)

If you need directions for installing the PDS package, see https://github.com/alanarnholt/PDS. Code Books for AddHealth, gapminder, marscrater, and NESARC are in the CodeBooks folder of the installed PDS package. To see where your system has installed the PDS package, type the following at the R prompt:

system.file(package = "PDS")

The Code Books are stored as PDFs. Note: To find a word/variable in a PDF use shift F with Windows or command F with OSX. Type the desired word/variable inside the box and press enter/return.

Once the PDS package is attached with the library() function, one may access the data in one of the data frames by typing the name of the data frame at the R prompt. For example, to see the first five values in the variable incomeperperson from the gapminder data frame enter

gapminder[1:5, 'incomeperperson']  # Show first five rows for incomeperperson
gapminder[1:5, 1:4]  # Show the first five rows for variables 1 through 4

Data Management Using the `dplyr` Package

The package dplyr must first be installed before it can be used. There are two ways one can install the package. The easiest way to install dplyr is to type

install.packages("dplyr")

at the R prompt. If you want the latest version of dplyr, you can install the development version from GitHub provided you have the required tools to build R packages installed on your machine (Rtools for Windows users, XCode for Mac users, etc.).

if (packageVersion("devtools") < 1.6) {
  install.packages("devtools")
}
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")

Subsetting Your Data

Create a subset of people 25 years old or younger who smoked in the last 12 months. To do this, one can use the dplyr package. Note that CHECK321 == 1 is used to see if a subject has smoked in the past 12 months and !is.na() is used to make sure the subset does not contain NAs for the chosen variables.

library(dplyr)
nesarc <- tbl_df(NESARC) %>% 
  filter(!is.na(CHECK321) & !is.na(AGE) & CHECK321 ==1 & AGE <= 25)
dim(nesarc)

The nesarc data frame contains r dim(nesarc)[2] variables.

Three different approaches to subsetting data will be given. The first approach is to use the dplyr function filter; the second approach is to use indices; and the third approach is to use the function subset. Consider creating a subset of the NESARC data set where a person indicates

He/she has smoked over 100 cigarettes (S3AQ1A == 1)
He/she has smoked in the past year (CHECK321 == 1)
He/she has typically smoked every day over the last year (S3AQ3B1 == 1)
He/she is less than or equal to 25 years old (AGE <= 25)

The first approach uses the filter function with the %>% function. Although it is not a requirement, the data frame NESARC is converted to a data frame tbl per the advice given in the dplyr vignette.

NESARCsub1 <- tbl_df(NESARC) %>%
  filter(S3AQ1A == 1 & CHECK321 == 1 & S3AQ3B1 == 1 & AGE <= 25)
dim(NESARCsub1)

The second approach uses standard indexing.

NESARCsub2 <- NESARC[NESARC$S3AQ1A == 1 & NESARC$CHECK321 == 1 & 
                       NESARC$S3AQ3B1 == 1 & NESARC$AGE <= 25, ]
dim(NESARCsub2)

The third approach uses the subset function.

NESARCsub3 <- subset(NESARC, subset = S3AQ1A == 1 & CHECK321 == 1 & 
                       S3AQ3B1 == 1 & AGE <= 25)
dim(NESARCsub3)

Use the verb select from the dplyr package to select the variables CHECK321, AGE (age of subject), TAB12MDX (tobacco dependence past 12 months), S3AQ3B1 (usual smoking quantity), ETHRACE2A (ethnicity), SEX (gender), MAJORDEPLIFE (Dysthymia-Lifetime), and S3AQ3C1 (usual smoking frequency).

Renaming Variables

Renaming the variables IDNUM, ETOTLCA2, ETHRACE2A, S3AQ3B1, AGE, MAJORDEPLIFE, DYSLIFE, TAB12MDX, and S3AQ3C1 to UniqueID, EthanolConsumption, Ethnicity, SmokingFrequency, Age, MajorDepression, DysthymiaLifetime, TobaccoDependence, and DailyCigsSmoked respectively, is done with the rename function. Although it is not a requirement, the data frame NESARC is converted to a data frame tbl per the advice given in the dplyr vignette. The rename function takes a data frame or data frame tbl as its first argument while the syntax new_name = old_name is used as the second argument.

NESARC[1:5, 1:5]  # Show first 5 rows and first 5 columns of NESARC
NESARCtbl <- tbl_df(NESARC) %>% 
  rename(UniqueID = IDNUM, EthanolConsumption = ETOTLCA2, Ethnicity = ETHRACE2A, 
         SmokingFrequency = S3AQ3B1, Age = AGE, MajorDepression = MAJORDEPLIFE, Sex = SEX, 
         DysthymiaLifetime = DYSLIFE, TobaccoDependence = TAB12MDX, DailyCigsSmoked = S3AQ3C1) %>% 
  select(UniqueID, EthanolConsumption, Ethnicity, SmokingFrequency, Age, MajorDepression, 
         DysthymiaLifetime, TobaccoDependence, DailyCigsSmoked, S3AQ1A, CHECK321, 
         SmokingFrequency, Sex)  
NESARCtbl

Coding Missing Values

Note that the variable S3AQ3B1 renamed to SmokingFrequency uses a 9 to record unknown for smoking frequency.

NESARCtbl$SmokingFrequency[NESARCtbl$SmokingFrequency == 9] <- NA
summary(NESARCtbl$SmokingFrequency)  # Note that 9 still appears
NESARCtbl$SmokingFrequency <- factor(NESARCtbl$SmokingFrequency)[, drop = TRUE]
summary(NESARCtbl$SmokingFrequency)  # Unused level no longer appears

One of the great things about the R language is how it handles a wide variety of data types.
In this case, the variable SmokingFrequency is a factor with numeric labels for the levels of the factor.
To change the numeric labels to text type the following:

NESARCtbl$SmokingFrequency <- factor(NESARCtbl$SmokingFrequency, 
                                     labels = c("Every Day", "5 to 6 Days/week", 
                                                "3 to 4 Days/week", "1 to 2 Days/week", 
                                                "2 to 3 Days/month", "Once a month or less"))
summary(NESARCtbl$SmokingFrequency)
xtabs(~SmokingFrequency, data = NESARCtbl) # Note how the NA's are not printed

The following ggplot2 code is used to graph the values in SmokingFrequency.

library(ggplot2)
ggplot(data = NESARCtbl, aes(x = SmokingFrequency)) + 
  geom_bar(fill = "lightgray") + 
  labs(x = "Smoking Frequency") +
  theme_bw()

If you do not want to see the NA's or have the labels on the $x$- axis overlap, use

ggplot(data = na.omit(NESARCtbl[ , "SmokingFrequency", drop = FALSE]), aes(x = SmokingFrequency)) + 
  geom_bar(fill = "lightgray") + 
  labs(x = "Smoking Frequency") +
  theme_bw() +
  theme(axis.text.x  = element_text(angle = 55, hjust = 1.0))

Collapsing Categories

Consider the variable S1Q6A which has 14 levels that record the highest level of education of the participant. To collapse the categories into a dichotomous variable that indicates the presence of a high school degree, use the ifelse function. The levels 1, 2, 3, 4, 5, 6, and 7 of the variable S1Q6A correspond to education levels less than completing high school.

NESARC$HS_DEGREE <- factor(ifelse(NESARC$S1Q6A %in% c("1", "2", "3", "4", "5", "6", "7"), 
                                  "No", "Yes"))
summary(NESARC$HS_DEGREE)

Creating a Factor from a Numeric Vector

To create a factor with three levels (young adults, adults, older adults) using the variable AGE (a numeric vector) use the function cut.

NESARC$AGEfac <- cut(NESARC$AGE, breaks = c(18, 30, 50, Inf), 
                     labels = c("Young Adult", "Adult", "Older Adult"), 
                     include.lowest = TRUE)
summary(NESARC$AGEfac)

Consider collapsing the numeric vector S3AQ3C1 (usual quantity when cigarettes are smoked) into a categorical variable with 5 levels.

NESARC$S3AQ3C1fac <- cut(NESARC$S3AQ3C1, breaks = c(0, 5, 10, 15, 20, 100), 
                         include.lowest = TRUE)
summary(NESARC$S3AQ3C1fac)

Aggregating Variables using `ifelse`

Consider creating a new variable DepressLife which is Yes if the variable MAJORLIFE or DYSLIFE is a 1.

NESARC$DepressLife <- factor(ifelse( (NESARC$MAJORDEPLIFE == 1 | NESARC$DYSLIFE == 1), "Yes", "No"))
summary(NESARC$DepressLife)

As another example, consider creating a variable PPpanic for past and present panic disorders. The variables APANDX12 and APANDXP12 record a panic disorder within the last 12 months and prior to the last 12 months, respectively, with agoraphobia. The variables PANDX12 and PANDXP12 record a panic disorder within the last 12 months and prior to the last 12 months, respectively, without agoraphobia.

NESARC$PPpanic <- factor(ifelse( (NESARC$APANDX12 == 1 | NESARC$APANDXP12 == 1 | 
                                    NESARC$PANDX12 == 1 | NESARC$PANDXP12 == 1 ), "Yes", "No"))
summary(NESARC$PPpanic)

Consider a new variable AllDeprSymp that records whether a subject exhibits all depression symptoms or not. There are 19 depression symptoms recorded in the variables S4AQ4A1 - S4AQ4A19. Note that the logical operator & is used to test if all depression symptoms are present.

NESARC$AllDeprSymp <- factor(ifelse( (NESARC$S4AQ4A1 == 1 & NESARC$S4AQ4A2 == 1 & 
                                        NESARC$S4AQ4A3 == 1 & NESARC$S4AQ4A4 == 1 & 
                                        NESARC$S4AQ4A5 == 1 & NESARC$S4AQ4A6 == 1 & 
                                        NESARC$S4AQ4A7 == 1 & NESARC$S4AQ4A8 == 1 & 
                                        NESARC$S4AQ4A9 == 1 & NESARC$S4AQ4A10 == 1 & 
                                        NESARC$S4AQ4A11 == 1 & NESARC$S4AQ4A12 == 1 & 
                                        NESARC$S4AQ4A13 == 1 & NESARC$S4AQ4A14 == 1 & 
                                        NESARC$S4AQ4A15 == 1 & NESARC$S4AQ4A16 == 1 & 
                                        NESARC$S4AQ4A17 == 1 & NESARC$S4AQ4A18 == 1 & 
                                        NESARC$S4AQ4A19 == 1), "Yes", "No"))
summary(NESARC$AllDeprSymp)

Creating a Composite Factor

The following code selects the 19 variables that deal with depression using the select function from dplyr. The variable nDS counts the number of depression symptoms a subject displays.

mysum <- function(x){sum(x == 1)}
myadd <- function(x){apply(x, 1, mysum)}
ndf <- NESARC %>%
  select(contains("S4AQ4A"))
nDS <- myadd(ndf)
ndf <- cbind(ndf, nDS)
xtabs(~nDS, data = ndf)

Creating a New Variable with `mutate`

MINI <- tbl_df(NESARC) %>% 
  select(S1Q24FT, S1Q24IN, S1Q24LB, SEX) %>% 
  filter(S1Q24FT < 99, S1Q24IN < 99, S1Q24LB < 999) %>% 
  mutate(Inches = (S1Q24FT*12 + S1Q24IN),
         Sex = factor(SEX, labels = c("Male", "Female"))) %>% 
  rename(Weight = S1Q24LB)
MINI

Convert a factor to numeric

Be careful when converting a factor to a vector of numeric values based on the factor labels. One may be tempted to use as.numeric to convert a factor to a numeric vector. However, using as.numeric(my_factor) returns a numeric vector of the index levels, not the actual values.

To estimate the total number of cigarettes a subject smokes per month, convert S3AQ3B1 renamed to SmokingFrequency in NESARCtbl (a factor with 6 levels) to a numeric variable using as.numeric. DaysSmoke estimates the days per month a subject smokes. The variable TotalCigsSmoked estimates the number of cigarettes a subject smokes per month by multiplying DaysSmoke times DailyCigsSmoked (S3AQ3C1). The numeric variable TotalCigsSmoked is converted into a factor with roughly equivalent numbers stored in each level of the factor CigsSmokedFac using the cut function.

summary(NESARCtbl$SmokingFrequency)
levels(NESARCtbl$SmokingFrequency)
table(as.numeric(NESARCtbl$SmokingFrequency))

NESARCtbl$DaysSmoke <- as.numeric(NESARCtbl$SmokingFrequency)
NESARCtbl$DaysSmoke[NESARCtbl$DaysSmoke == 1] <- 30
NESARCtbl$DaysSmoke[NESARCtbl$DaysSmoke == 2] <- 4*5.5
NESARCtbl$DaysSmoke[NESARCtbl$DaysSmoke == 3] <- 4*3.5
NESARCtbl$DaysSmoke[NESARCtbl$DaysSmoke == 4] <- 4*1.5
NESARCtbl$DaysSmoke[NESARCtbl$DaysSmoke == 5] <- 2.5
NESARCtbl$DaysSmoke[NESARCtbl$DaysSmoke == 6] <- 1
# Using dplyr again
NESARCtbl <- NESARCtbl %>% 
  mutate(TotalCigsSmoked = DaysSmoke*DailyCigsSmoked)
proportions <- quantile(NESARCtbl$TotalCigsSmoked, na.rm = TRUE)
proportions
NESARCtbl$CigsSmokedFac <- cut(NESARCtbl$TotalCigsSmoked, breaks = proportions, 
                               include.lowest = TRUE)
NESARCtbl

Subsetting Again

NESARCtbl <- NESARCtbl %>% 
  filter(S3AQ1A == 1 & CHECK321 == 1 & SmokingFrequency == "Every Day" & Age <= 25)
dim(NESARCtbl)
str(NESARCtbl)
summary(NESARCtbl)

Exploratory Graphs

Numeric data $\rightarrow$ histogram and density plots.

library(ggplot2)
ggplot(data = NESARCtbl, aes(x = EthanolConsumption) ) + 
  geom_histogram(binwidth = 1, fill = "pink") + 
  theme_bw() +
  labs(x = "Ethanol Consumption")
ggplot(data = NESARCtbl, aes(x = EthanolConsumption) ) + 
  geom_density(fill = "pink") + 
  theme_bw() +
  labs(x = "Ethanol Consumption")

Categorical data $\rightarrow$ bar graphs.

ggplot(data = NESARCtbl, aes(x = Ethnicity)) +
  geom_bar(fill = c("snow", "brown4", "red", "yellow", "tan"), color = "black") + 
  theme_bw()

We should provide more descriptive labels for the factors Ethnicity, MajorDepression, DysthymiaLifetime, and TobaccoDependence.

Changing Levels and Labeling Variables

NESARCtbl$TobaccoDependence <- factor(NESARCtbl$TobaccoDependence, 
                         labels = c("No Nicotine Dependence", "Nicotine Dependence"))
xtabs(~TobaccoDependence, data = NESARCtbl)
NESARCtbl$TobaccoDependence <- factor(NESARCtbl$TobaccoDependence, 
                         levels = c("Nicotine Dependence", "No Nicotine Dependence"))
xtabs(~TobaccoDependence, data = NESARCtbl)
NESARCtbl$Ethnicity <- factor(NESARCtbl$Ethnicity, 
                              labels = c("Caucasian", "African American", 
                                         "Native American", "Asian", "Hispanic"))
NESARCtbl$Sex <- factor(NESARCtbl$Sex, labels = c("Male", "Female"))
table(NESARCtbl$Sex)
NESARCtbl$Sex <- factor(NESARCtbl$Sex, levels = c("Female", "Male"))
table(NESARCtbl$Sex)
NESARCtbl$SmokingFrequency <- factor(NESARCtbl$SmokingFrequency, 
                         levels = c("Once a month or less", "2 to 3 Days/month", "1 to 2 Days/week",  
                                    "3 to 4 Days/week", "5 to 6 Days/week", "Every Day"))
NESARCtbl$MajorDepression <- factor(NESARCtbl$MajorDepression, 
                                    labels = c("No Depression", "Yes Depression"))
ggplot(data = NESARCtbl, aes(x = Ethnicity)) +
  geom_bar(fill = c("snow", "brown4", "red", "yellow", "tan"), color = "black") + 
  theme_bw() + 
  theme(axis.text.x  = element_text(angle = 55, hjust = 1)) + 
  labs(x = "")

T1 <- xtabs(~ TobaccoDependence + MajorDepression, data = NESARCtbl)
T1
ggplot(data = NESARCtbl, aes(x = MajorDepression, fill = TobaccoDependence)) +
  geom_bar() + 
  theme_bw()
T2 <- prop.table(T1, 2)
T2
ggplot(data = NESARCtbl, aes(x = MajorDepression, fill = TobaccoDependence)) +
  geom_bar(position = "fill") + 
  labs(x = "", y = "Fraction", 
       title = "Fraction of young adult daily smokers\nwith and without nicotine addiction\nby depression status") +
  theme_bw() +
  scale_fill_manual(values = c("red", "pink"), name = "Tobacco Addiction Status")
###
ggplot(data = NESARCtbl, aes(x = MajorDepression, fill = TobaccoDependence)) +
  geom_bar(position = "fill") + 
  labs(x = "", y = "Fraction", 
       title = "Fraction of young adult daily smokers\nwith and without nicotine addiction\nby depression status") +
  theme_bw() +
  scale_fill_manual(values = c("red", "pink"), name = "Tobacco Addiction Status") + 
  facet_grid(Sex ~ .)

ggplot(data = NESARCtbl, aes(x = MajorDepression, fill = TobaccoDependence)) +
  geom_bar(position = "fill") + 
  labs(x = "", y = "Fraction", 
       title = "Fraction of young adult daily smokers\nwith and without nicotine addiction\nby depression status") +
  theme_bw() +
  scale_fill_manual(values = c("red", "pink"), name = "Tobacco Addiction Status") + 
  facet_grid(Ethnicity ~ Sex) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Merging Data Sets

To merge AddHealth Wave I with addhealth_public4 Wave IV one can use the left_join() function from dplyr. The left_join function returns all rows from the first argument (data frame), and all columns from both the first and second arguments (both data frames). To use left_join, one needs a common variable in both data frames. Note: Although the code book for Wave IV uses upper case for variable names, the variable names in the data frame addhealth_public4 are all lower case.

library(dplyr)
AddHealthW4 <- tbl_df(addhealth_public4) %>% 
  rename(AID = aid)
AddHealthW1and4 <- left_join(AddHealthW4, AddHealth)
dim(AddHealthW1and4)

kable

knitr::kable(NESARCtbl[1300:1305, c("EthanolConsumption", "Sex", "MajorDepression")], 
             align = c("c","c","c"), caption = "Three Selected Columns")

knitr::kable(NESARCtbl[1300:1305, c("EthanolConsumption", "Sex", "MajorDepression")], 
             align = c("c","c","c"), caption = "Three Selected Columns", col.names = c("Ethanol Consuption", "Gender", "Depression"))

alanarnholt/PDS documentation built on May 10, 2019, 8:49 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

alanarnholt/PDS
Data Sets for Passion Driven Statistics

In alanarnholt/PDS: Data Sets for Passion Driven Statistics

Using PDS

Data Management Using the `dplyr` Package

Subsetting Your Data

Renaming Variables

Coding Missing Values

Collapsing Categories

Creating a Factor from a Numeric Vector

Aggregating Variables using `ifelse`

Creating a Composite Factor

Creating a New Variable with `mutate`

Convert a factor to numeric

Subsetting Again

Exploratory Graphs

Changing Levels and Labeling Variables

Merging Data Sets

kable

R Package Documentation

Browse R Packages

We want your feedback!

alanarnholt/PDS Data Sets for Passion Driven Statistics

In alanarnholt/PDS: Data Sets for Passion Driven Statistics

Using PDS

Data Management Using the dplyr Package

Subsetting Your Data

Renaming Variables

Coding Missing Values

Collapsing Categories

Creating a Factor from a Numeric Vector

Aggregating Variables using ifelse

Creating a Composite Factor

Creating a New Variable with mutate

Convert a factor to numeric

Subsetting Again

Exploratory Graphs

Changing Levels and Labeling Variables

Merging Data Sets

kable

R Package Documentation

Browse R Packages

We want your feedback!

alanarnholt/PDS
Data Sets for Passion Driven Statistics

Data Management Using the `dplyr` Package

Aggregating Variables using `ifelse`

Creating a New Variable with `mutate`