NlsyLinks

Manuscript purpose: Proposed Structure for manuscript (possibly submitted to JSS)

Abstract

Introduction

Structural and Topical information

The NlsyLinks package offers two types of datasets: topical and structural. Topical datasets contain predictor and outcome variables typically used to test a focused hypotheses. For instance, the NLSY79 Gen2 variables Rqqq.qq and Rqqq.qq are critical when studying the relationship between conduct disorder and menarche (e.g., Rodgers et al, 2015), but are not relevant to many hypotheses outside these fields.

In contrast, variables in structural datasets are not typically directly stated in the hypotheses, yet are essential to many NLSY-related investigations including:

The NlsyLinks includes small topical datasets which allows the vignettes and examples to be reproducible and more realistic. The structural datasets are intended to be the authoritative representations, and are the product of two NIH grants (for a complete history of the familial relationships, see Rodgers et al., 2016).

Terminology

The package pertains to multiple generations of the 'Nlsy79' and multiple generations of the 'Nlsy97'. Because the NlsyLinks package structures information within and between generations of the NLSY simultaneously, it requires slightly unconventional NLSY terminology to reduce ambiguity.

The 'Nlsy79 sample' refers to both the original 12,686 subjects interviewed in 1979, and their 11,500+ children (termed 'Nlsy79 Gen1' and 'Nlsy79 Gen2', respectively). Data for the 'Nlsy79 Gen1' comes from the original NLSY79 study, while data for the 'Nlsy Gen2' comes from both the NLSY-C study and the NLSY-YA study ('C' stands for children, and 'YA' stands for young adult). More specifically, the Gen2 subjects are the biological offspring of the Gen1 mothers; they initially completed the NLSY-C survey until roughly age 14, and then completed the NLSY-YA survey (the oldest 'Young Adult' respondent was QQ in the 2016 survey). Although the NLSY does not interview 'Nlsy79 Gen0' (the parents of Gen1) or 'Nlsy79 Gen3' (the children of Gen2), it does contain direct and indirect information about them.

The terminology for the 'Nlsy97' sample is similar yet simpler than the 'Nlsy79', because the explicit respondents come from a single generation (i.e., the 'Nlsy97 Gen1'). A few variables reflect Gen0 and Gen2. In contrast to the Nlsy79, the Nlsy97 contains more information about the housemates.

They both are nationally-representative samples.

Common NLSY79 terminology refers second generation subjects as 'children' when they are younger than age 15 (NLSYC), and as 'young adults' when they are 15 and older (NLSY79-YA); though they are the same respondents, different funding mechanisms and different survey items necessitate the distinction. This cohort is sometimes abbreviated as 'NLSY79-C', 'NLSY79C', 'NLSY-C' or 'NLSYC'. This packages uses 'Gen2' to refer to subjects of this generation, regardless of their age at the time of the survey.

Ambiguous Categories: In some cases, the kinship of relatives can be safely narrowed to two categories, but not one. The most common scenario involves the ambiguous siblings (among Nlsy Gen2 siblings); due to the sampling design, we know that the two participants share the same mother, so they are either full siblings (R=.50) or half siblings (R=.25). For the qqq% that we could not satisfactorily classify, we categorize them as "ambiguous siblings" and assign them R=.375. Another common scenario involves the ambiguous twins (among Nlsy97 and both generations of Nlsy79). For qqq% of the twins we comfortably classify their relationship as either MZ or DZ; the remaining qqq% percent are classified as "ambiguous twins" and assigned R=0.75. Empirically, this approach has been successful in the previous 20 years of our previous research, and is discussed further in (Rodgers, qqq).

Within our own team, we've mostly stopped using terms like 'NLSY79', 'NLSY79-C' and 'NLSY79-YA', because we conceptualize it as one big sample containing two related generations. It many senses, the responses collected from the second generation can be viewed as outcomes of the first generation. Likewise, the parents in the first generation provide many responses that can be viewed as explanatory variables for the 2nd generation. Depending on your research, there can be big advantages of using one cohort to augment the other. There are also survey items that provide information about the 3rd generation and the 0th generation.

The SubjectTag variable uniquely identify NLSY79 subjects when a dataset contains both generations. For Gen2 subjects, the SubjectTag is identical to their CID (i.e., C00001.00 -the ID assigned in the NLSY79-Children files). However for Gen1 subjects, the SubjectTag is their CaseID (i.e., R00001.00), with "00" appended. This manipulation is necessary to identify subjects uniquely in inter-generational datasets. A Gen1 subject with an ID of 43 becomes 4300. The SubjectTags of her four children remain 4301, 4302, 4303, and 4304.

The expected coefficient of relatedness of a pair of subjects is typically represented by the statistical variable R. Examples are: Monozygotic twins have R=1; dizygotic twins have R=0.5; full siblings (i.e., those who share both biological parents) have R=0.5; half-siblings (i.e., those who share exactly one biological parent) have R=0.25; adopted siblings have R=0.0. Other uncommon possibilities are mentioned the documentation for Links79Pair. The font (and hopefully their context) should distinguish the variable R from the software R. To make things slightly more confusing the computer variable for R in the Links79Pair dataset is written with a monospace font: R.

A subject's ExtendedID indicates their extended family. Two subjects will be in the same extended family if either: [1] they are Gen1 housemates, [2] they are Gen2 siblings, [3] they are Gen2 cousins (i.e., they have mothers who are Gen1 sisters in the NLSY79), [4] they are mother and child (in Gen1 and Gen2, respectively), or [5] they are (aunt|uncle) and (niece|nephew) (in Gen1 and Gen2, respectively).

An outcome variable is directly relevant to the applied researcher; these might represent constructs like height, IQ, and income. A plumbing variable is necessary to manage BG datasets; examples are R, a subject's ID, and the date of a subject's last survey.

An ACE model is the basic biometrical model used by Behavior Genetic researchers, where the genetic and environmental effects are assumed to be additive. The three primary variance components are (1) the proportion of variability due to a shared genetic influence (typically represented as $a^2$, or sometimes $h^2$), (2) the proportion of variability due to shared common environmental influence (typically $c^2$), and (3) the proportion of variability due to unexplained/residual/error influence (typically $e^2$).

The variables are scaled so that they account for all observed variability in the outcome variable; specifically: $a^2 + c^2 + e^2 = 1$. Using appropriate designs that can logically distinguish these different components (under carefully specified assumptions), the basic biometrical modeling strategy is to estimate the magnitude of $a^2$, $c^2$, and $e^2$ within the context of a particular model. For gentle introductions to Behavior Genetic research, we recommend Plomin (1990) and Carey (2003). For more in-depth ACE model-fitting strategies, we recommend Neale & Maes, (2004).

Retrieving Data with the NLS Investigator

When a researcher pursues a new idea, we suggest to start by exploring what the NLSY can offer by poking around the (a) vast online documentation and (b) NLS Investigator. The documentation online (www.qqq), and has general information (e.g., how to connect the nationally representative sample was collected), topical information (e.g., what medical and health information has been collected across survey waves and subject ages), and descriptive summaries (e.g., attrition over time for different race and ethnic groups). This material has helpful suggestions which variables are available and appropriate.

With these hints, it's time to identify and download the specific variables from the NLS Investigator. The NLS Investigator is described briefly here (see the NLS Investigator vignette for more detailed instruction). Researchers new to the NLSY should expect at least a dozen round trips as they iteratively improve and complete their set of variables. First, select the 'Study', such as 'NLSY79 Child & Young Adult' (which corresponds to 'Nlsy79 Gen2' in our terminology. Second, select your desired variables, out of the tens of thousands available ones.

Before starting the examples, first verify that the NlsyLinks package is installed correctly. If not, please refer to Appendix.

any(.packages(all.available = TRUE) == "NlsyLinks") #Should evaluate to TRUE.
library(NlsyLinks) #Load the package into the current session.

ACE DF Analysis of One Generation (NLSY79 Gen1)

{This very basic analysis that should provide an initial feel for the inputs, mechanics, and goals of the analysis.}

The first example uses a simple statistical model and all available Gen2 subjects. The CreatePairLinksDoubleEntered function will create a data frame where each represents one pair of siblings, respective of order (i.e., there is a row for Subjects 201 and 202, and a second row for Subjects 202 and 201). This function examines the subjects' IDs and determines who is related to whom (and by how much). By default, each row it produces has at least six values/columns: (i) ID for the older member of the kinship pair: Subject1Tag, (ii) ID for the younger member: Subject2Tag, (iii) ID for their extended family: ExtendedID, (iv) their estimated coefficient of genetic relatedness: R, (v and beyond) outcome values for the older member; (vi and beyond) outcome values for the younger member.

A DeFries-Fulker (DF) Analysis uses linear regression to estimate the $a^2$, $c^2$, and $e^2$ of a univariate biometric system. The interpretations of the DF analysis can be found in Rodgers & Kohler (2005) and Rodgers, Rowe, & Li (1999). This example uses the newest variation, which estimates two parameters; the corresponding function is called DeFriesFulkerMethod3. The steps are:

  1. Use the NLS Investigator to select and download a Gen2 dataset.

  2. Open R and create a new script (see Appendix: R Scripts and load the NlsyLinks package. If you haven't done so, install the NlsyLinks package). Within the R script, identify the locations of the downloaded data file, and load it into a data frame.

  3. Within the R script, load the linking dataset. Then select only Gen2 subjects. The 'Pair' version of the linking dataset is essentially an upper triangle of a symmetric sparse matrix.

  4. Load and assign the ExtraOutcomes79 dataset.
  5. Specify the outcome variable name and filter out all subjects who have a negative value in this variable. The NLSY typically uses negative values to indicate different types of missingness (see 'Further Information' below).
  6. Create a double-entered file by calling the 'CreatePairLinksDoubleEntered` function. At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s). (There are occasions when a single-entered file is more appropriate for a DF analysis. See Rodgers & Kohler, 2005, for additional information.)
  7. Use 'DeFriesFulkerMethod3` function (i.e., general linear model) to estimate the coefficients of the DF model.
### R Code for Example DF analysis with a simple outcome and Gen2 subjects
#Step 2: Load the package containing the linking routines.
library(NlsyLinks)

#Step 3: Load the LINKING dataset and filter for the Gen2 subjects
dsLinking <- subset(Links79Pair, RelationshipPath=="Gen2Siblings")
summary(dsLinking) #Notice there are 11,088 records (one for each unique pair).

#Step 4: Load the OUTCOMES dataset, and then examine the summary.
dsOutcomes <- ExtraOutcomes79 #'ds' stands for 'Data Set'
summary(dsOutcomes)

#Step 5: This step isn't necessary for this example, because Kelly Meredith already
#   groomed the values.  If the negative values (which represent NLSY missing or
#   skip patterns) still exist, then:
dsOutcomes$MathStandardized[dsOutcomes$MathStandardized < 0] <- NA

#Step 6: Create the double entered dataset.
dsDouble <- CreatePairLinksDoubleEntered(
  outcomeDataset   = dsOutcomes,
  linksPairDataset = dsLinking,
  outcomeNames     = c('MathStandardized')
)
summary(dsDouble) #Notice there are 22176=(2*11088) records now (two for each unique pair).

#Step 7: Estimate the ACE components with a DF Analysis
ace <- DeFriesFulkerMethod3(
    dataSet  = dsDouble,
    oName_S1 = "MathStandardized_S1",
    oName_S2 = "MathStandardized_S2")
ace

Further Information: If the different reasons of missingness are important, further work is necessary. For instance, some analyses that use item Y19940000 might need to distinguish a response of "Don't Know" (which is coded as -2) from "Missing" (which is coded as -7). For this example, we'll assume it's safe to clump the responses together.

ACE DF Analysis of One Generation (NLSY79 Gen2)

The second example differs from the previous example in two ways. First, the outcome variables are read from a CSV (comma separated values file) that was downloaded from the NLS Investigator. Second, the DF analysis is called through the function AceUnivariate; this function is a wrapper around some simple ACE methods, and will help us smoothly transition to more techniques later.

The steps are:

  1. Use the NLS Investigator to select and download a Gen2 dataset. Select the variables 'length of gestation of child in weeks' (C03280.00), 'weight of child at birth in ounces' (C03286.00), and 'length of child at birth' (C03288.00), and then download the *.zip file to your local computer.

  2. Open R and create a new script and load the NlsyLinks package.

  3. Within the R script, load the linking dataset. Then select only Gen2 subjects.
  4. Read the CSV into R as a data.frame using ReadCsvNlsy79Gen2.
  5. Verify the desired outcome column exists, and rename it something meaningful to your project. It is important that the data.frame is reassigned (i.e., ds <- RenameNlsyColumn(...)). In this example, we rename column C0328800 to BirthWeightInOunces.
  6. Filter out all subjects who have a negative BirthWeightInOunces value. See the 'Further Information' note in the previous example.
  7. Create a double-entered file by calling the CreatePairLinksDoubleEntered function. At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).
  8. Call the AceUnivariate function to estimate the coefficients.
### R Code for Example of a DF analysis with a simple outcome and Gen2 subjects
#Step 2: Load the package containing the linking routines.
library(NlsyLinks)

#Step 3: Load the linking dataset and filter for the Gen2 subjects
dsLinking <- subset(Links79Pair, RelationshipPath=="Gen2Siblings")

#Step 4: Load the outcomes dataset from the hard drive and then examine the summary.
#   Your path might be: filePathOutcomes <- 'C:/BGResearch/NlsExtracts/gen2-birth.csv'
filePathOutcomes <- file.path(path.package("NlsyLinks"), "extdata", "gen2-birth.csv")
dsOutcomes <- ReadCsvNlsy79Gen2(filePathOutcomes)
summary(dsOutcomes)

#Step 5: Verify and rename an existing column.
VerifyColumnExists(dsOutcomes, "C0328600") #Should return '10' in this example.
dsOutcomes <- RenameNlsyColumn(dsOutcomes, "C0328600", "BirthWeightInOunces")

#Step 6: For this item, a negative value indicates the parent refused, didn't know,
#   invalidly skipped, or was missing for some other reason.
#   For our present purposes, we'll treat these responses equivalently.
#   Then clip/Winsorized/truncate the weight to something reasonable.
dsOutcomes$BirthWeightInOunces[dsOutcomes$BirthWeightInOunces < 0] <- NA
dsOutcomes$BirthWeightInOunces <- pmin(dsOutcomes$BirthWeightInOunces, 200)

#Step 7: Create the double entered dataset.
dsDouble <- CreatePairLinksDoubleEntered(
  outcomeDataset   = dsOutcomes,
  linksPairDataset = dsLinking,
  outcomeNames     = c('BirthWeightInOunces')
)

#Step 8: Estimate the ACE components with a DF Analysis
ace <- AceUnivariate(
  method   = "DeFriesFulkerMethod3",
  dataSet  = dsDouble,
  oName_S1 = "BirthWeightInOunces_S1",
  oName_S2 = "BirthWeightInOunces_S2"
)
ace

For another example of incorporating CSVs downloaded from the NLS Investigator, please see the "Race and Gender Variables" entry in the FAQ.

ACE SEM of One Generation (NLSY79 Gen2)

The example differs from the first one by the statistical mechanism used to estimate the components. The first example uses multiple regression to estimate the influence of the shared genetic and environmental factors, while this example uses structural equation modeling (SEM).

The CreatePairLinksSingleEntered function will create a data.frame where each row represents one unique pair of siblings, irrespective of order. Other than producing half the number of rows, this function is identical to CreatePairLinksDoubleEntered.

The steps are:

(Steps 1-5 proceed identically to the first example.)

  1. Create a single-entered file by calling the CreatePairLinksSingleEntered function. At minimum, pass the (i) outcome dataset, the (ii) linking dataset, and the (iii) name(s) of the outcome variable(s).

  2. Declare the names of the outcome variables corresponding to the two members in each pair. Assuming the variable is called 'ZZZ' and the preceding steps have been followed, the variable 'ZZZ_S1' corresponds to the first members and ZZZ_S2' corresponds to the second members.

  3. Create a GroupSummary data.frame, which identifies the R groups that should be considered by the model. Inspect the output to see if the groups show unexpected or fishy differences.

  4. Create a data.frame with cleaned variables to pass to the SEM function. This data.frame contains only the three necessary rows and columns.

  5. Estimate the SEM with the \pkg{lavaan} package. The function returns an S4 object, which shows the basic ACE information.

  6. Inspect details of the SEM, beyond the ACE components. In this example, we look at the fit stats and the parameter estimates. The \pkg{lavaan} package has additional methods that may be useful for your purposes.

### R Code for Example lavaan estimation analysis with a simple outcome and Gen2 subjects
#Steps 1-5 are explained in the vignette's first example:
library(NlsyLinks)
dsLinking <- subset(Links79Pair, RelationshipPath=="Gen2Siblings")
dsOutcomes <- ExtraOutcomes79
dsOutcomes$MathStandardized[dsOutcomes$MathStandardized < 0] <- NA

#Step 6: Create the single entered dataset.
dsSingle <- CreatePairLinksSingleEntered(
  outcomeDataset   = dsOutcomes,
  linksPairDataset = dsLinking,
  outcomeNames     = c('MathStandardized')
)

#Step 7: Declare the names for the two outcome variables.
oName_S1 <- "MathStandardized_S1" #Stands for Outcome1
oName_S2 <- "MathStandardized_S2" #Stands for Outcome2

#Step 8: Summarize the R groups and determine which groups can be estimated.
dsGroupSummary <- RGroupSummary(dsSingle, oName_S1, oName_S2)
dsGroupSummary

#Step 9: Create a cleaned dataset
dsClean <- CleanSemAceDataset(dsDirty=dsSingle, dsGroupSummary, oName_S1, oName_S2)

#Step 10: Run the model
ace <- AceLavaanGroup(dsClean)
ace
#Notice the 'CaseCount' is 8,390 instead of 17,440.
#  This is because (a) one pair with R=.75 was excluded, and
#  (b) the SEM uses a single-entered dataset instead of double-entered.
#
#Step 11: Inspect the output further
library(lavaan) #Load the package to access methods of the lavaan class.
GetDetails(ace)
#Examine fit stats like Chi-Squared, RMSEA, CFI, etc.
fitMeasures(GetDetails(ace)) #'fitMeasures' is defined in the lavaan package.

#Examine low-level details like each group's individual parameter estimates and standard
#  errors.  Uncomment the next line to view the entire output (which is roughly 4 pages).
#summary(GetDetails(ace))

ACE SEM of Two Generations

{This is a more moderately difficult analysis with a more common estimation mechanism.}

{Benefits of cross-generational analysis}

The example differs from the previous example in one way --all possible pairs are considered for the analysis. Pairs are only excluded (a) if they belong to one of the small R groups that are difficult to estimate, or (b) if the value for adult height is missing. This includes all \Sexpr{scales::comma(nrow(Links79Pair))} relationships in the follow five types of NLSY79 relationships.

xt <- xtable(table(Links79Pair$RelationshipPath, dnn=c("Relationship Frequency")),
             caption="Number of NLSY79 relationship, by `RelationshipPath`.(Recall that 'AuntNiece' also contains uncles and nephews.)")
print.xtable(xt, format.args=list(big.mark=","), type = "html")

In our opinion, using the intergenerational links is one of the most exciting new opportunities for NLSY researchers to pursue. We will be happy to facilitate such research through consult or collaboration, or even by generating new data structures that may be of value. The complete kinship linking file facilitates many different kinds of cross-generational research, using both biometrical and other kinds of modeling methods.

### R Code for Example lavaan estimation analysis with a simple outcome and Gen1 subjects
#Steps 1-5 are explained in the vignette's first example:
library(NlsyLinks)
dsLinking <- subset(Links79Pair, RelationshipPath %in%
                      c("Gen1Housemates", "Gen2Siblings", "Gen2Cousins",
                        "ParentChild", "AuntNiece"))
#Because all five paths are specified, the line above is equivalent to:
#dsLinking <- Links79Pair

dsOutcomes <- ExtraOutcomes79
#The HeightZGenderAge variable is already groomed

#Step 6: Create the single entered dataset.
dsSingle <- CreatePairLinksSingleEntered(
  outcomeDataset   = dsOutcomes,
  linksPairDataset = dsLinking,
  outcomeNames     = c('HeightZGenderAge'))

#Step 7: Declare the names for the two outcome variables.
oName_S1 <- "HeightZGenderAge_S1"
oName_S2 <- "HeightZGenderAge_S2"

#Step 8: Summarize the R groups and determine which groups can be estimated.
dsGroupSummary <- RGroupSummary(dsSingle, oName_S1, oName_S2)
dsGroupSummary

#Step 9: Create a cleaned dataset
dsClean <- CleanSemAceDataset(dsDirty=dsSingle, dsGroupSummary, oName_S1, oName_S2)

#Step 10: Run the model
ace <- AceLavaanGroup(dsClean)
ace

#Step 11: Inspect the output further (see the final step two examples above).

Notice the ACE estimates are very similar to the previous exampl, but the number of pairs has increased by 6x --from 4,185 to 24,700. The number of subjects doubles when Gen2 is added, and the number of relationship pairs really takes off. When an extended family's entire pedigree is considered by the model, many more types of links are possible than if just nuclear families are considered. This increased statistical power is even more important when the population's $a^2$ is small or moderate, instead of something large like 0.7.

You may notice that the analysis has r scales::comma(ace@CaseCount) relationships instead of the entire r scales::comma(nrow(Links79Pair)). This is primarily because not all subjects have a value for 'adult height' (and that's mostly because a lot of Gen2 subjects are too young). There are r scales::comma(sum(!is.na(Links79PairExpanded$RFull))) pairs with a nonmissing value in RFull, meaning that r round(mean(!is.na(Links79PairExpanded$RFull))*100, 1) are classified. We feel comfortable claiming that if a researcher has a phenotype for both members of a pair, there's a 99+% chance we have an RFull for it. For a description of the R and RFull variables, please see the Links79Pair entry in the package reference manual.

More Advanced ACE Analyses

Data Manipulation and Non-Biometric Analyses

SAS Analogues

Additional Resources

Authors & Affiliation

References



nlsy-links/NlsyLinks documentation built on March 13, 2024, 4:05 a.m.