README.md

ALSPAC data interface

Motivation

To obtain ALSPAC variables, the general procedure is:

  1. search through PDFs to find the variable(s)
  2. search through STATA files to extract
  3. load into R files to use

This package combines the search and extraction procedure into two functions, this makes the work a bit more reproducible. It works for the curated data in the R:/Current/ and R:/Useful_data directories.

The extracted data has withdrawn consent individuals removed automatically.

You can browse the variables here: http://variables.alspac.bris.ac.uk/

Limitations

Credits etc

Please report issues or suggestions to Gibran Hemani. Thanks to Tom Gaunt, Matt Suderman, Andrew Simpkin, Paul Yousefi, ALSPAC team for help in developing this package.

Installation

To install version 0.6.1:

install.packages("devtools")
library(devtools)
install_github("explodecomputer/alspac")

You should then be able to load the package:

library(alspac)

Finding variables

Browsing the variables manually

There are two data objects that come with the package - current and useful - that contain all the variables available in the R:/Current/ and R:/Useful_data directories, respectively. You can search through them manually by loading them directly, e.g. to load the current variables:

data(current)

and to load the useful variables:

data(useful)

The top 6 rows of current look like this:

   obj name                   lab counts    type    cat1  cat2   cat3 cat4         path
1 a_3c  aln  Pregnancy identifier  13545 integer Current Quest Mother <NA> Current/Quest/Mother
2 a_3c a001 Questionnaire version  13545  factor Current Quest Mother <NA> Current/Quest/Mother
3 a_3c a002    Time lived in Avon  13545  factor Current Quest Mother <NA> Current/Quest/Mother
4 a_3c a003 Years since last move  13545 integer Current Quest Mother <NA> Current/Quest/Mother
5 a_3c a004 Weeks since last move  13545 integer Current Quest Mother <NA> Current/Quest/Mother
6 a_3c a005  NO of moves in 5 YRS  13545 integer Current Quest Mother <NA> Current/Quest/Mother

Using the search function

A simple search function findVars is a simple helper for searching current (default) or useful. Documentation can be retrieved using:

?findVars

For example, to search for all variables with the word 'height' in the description from the current data:

vars <- findVars("height")

If I want any with height OR length in the description then:

vars <- findVars("height", "length", logic="any", whole.word=TRUE, ignore.case=TRUE)

If I want any with height AND length:

vars <- findVars("height", "length", logic="and", whole.word=TRUE, ignore.case=TRUE)

If I want to find anything with sleep somewhere (not necessarily a whole word) I might do:

vars <- findVars("sleep", "slept", logic="any", whole.word=FALSE, ignore.case=TRUE)

To find all variables that have the term "difficulties" from the useful data:

vars <- findVars("difficulties", dictionary="useful")

Some of these arguments have defaults but just writing them out for illustration.

Filtering a list of variables

findVars may identify multiple variables with the same name. The filterVars function can be used to select among these duplicates.

For example, searching for variables "kz021", "kz011b" and "c645a" will return multiple variables with the same name.

varnames <- c("kz021","kz011b","ype9670", "c645a")
vars <- findVars(varnames)

As a first clean-up step, I remove any variables whose names do not exactly match one of the variable names we are looking for.

vars <- subset(vars, subset=tolower(name) %in% varnames)

I then require that the "kz021" variable come from a STATA file name starting with "kz" ("obj" column in vars), "kz011b" comes from a file name starting with "cp" and the description of the variable ("lab" column in vars) include the word "Participant", and "c645a" comes from a questionnaire ("cat2" column in vars).

vars <- filterVars(vars,
                   kz021=c(obj="^kz"),
                   kz011b=c(obj="^cp", lab="Participant"),
                   c645a=c(cat2="Quest")) 

So once you have a list of variables in the required format (i.e. the output from findVars) you can extract those variables:

Extracting variables

For this you need to have mounted the R:/Data drive on your computer. When you load the package (library(alspac)), if you have the R drive loaded then you should get a message like this:

The data directory has been recognised

Sometimes this might not work - the package tries to guess where the R drive will be mounted but it might guess wrong. If you receive an error message instead and you are already connected to the R drive then run the following command:

setDataDir("/path/to/R drive/data/")

Once you have received the message The data directory has been recognised you are able to extract the variables you need from the R drive.

results <- extractVars(vars)

Or you can just extract the row or rows relevant to you:

results <- extractVars(vars[1:3, ])
results <- extractVars(subset(vars, some_conditions_here))

Important note on IDs

Suppose we extract a variable measured in each of mothers, children, fathers and partners. e.g.

x <- subset(current, name %in% c("cf010", "ff1a005a", "fm1a010a", "pc013"))
y <- extractVars(x)

returning e.g.

> head(y, 10)
   alnqlet   aln qlet cf010 mult_dad ff1a005a mult_mum fm1a010a pc013
1    30001 30001 <NA>    NA     <NA>       NA     <NA>       NA     2
2    30004 30004 <NA>    NA     <NA>       NA     <NA>       NA     2
3    30006 30006 <NA>    NA     <NA>       NA     <NA>       NA     2
4    30008 30008 <NA>    NA       No        7       No        4     2
5    30010 30010 <NA>    NA     <NA>       NA       No        1    NA
6    30012 30012 <NA>    NA     <NA>       NA       No        7    NA
7    30013 30013 <NA>    NA     <NA>       NA       No       10     2
8   30013A 30013    A    18     <NA>       NA     <NA>       NA    NA
9    30017 30017 <NA>    NA     <NA>       NA       No        7    NA
10   30019 30019 <NA>    NA       No        7       No        4     1

This has returned the variables requested, along with some other columns -

If you have a better way to present these data do contact me.

Using the website to browse variables

You can browse the variables at https://alspac-example.shinyapps.io/alspac-dt/. This contains both the 'Current' and 'Useful_data' variables.

You can use this to help extract variables also.

  1. Select the variables that you want from the table, then click 'Download variable list'. This will download a csv file containing information about the variables chosen.
  2. Next, you can use the R package to extract those variables. Use the extractWebOutput function, specifying the name of the file that you just downloaded. For example
extractWebOutput("data-2017-08-22.csv")

Dictionary Maintenance

From time to time the R:\Data\Current\ directory is updated with new files. The variable dictionaries that the package uses can be updated using the createDictionary function.

current <- createDictionary("Current", name="current")
useful <- createDictionary("Useful_data", name="useful")

These updated dictionaries will be saved within the R package for use in later R sessions. In other words, an update will only need to be peformed one time.

To update the shiny variable app see https://github.com/explodecomputer/alspac-shiny



explodecomputer/alspac documentation built on Sept. 14, 2020, 1:10 a.m.