In LCBC-UiO/MOAS: MOAS convenience functions

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

There are currently two functions for the epigenetic data, one for getting the epigenetic data as is, and one for adding them to a MOAS-type data frame (either the full MOAS or a subsetted MOAS). They function much like the other add and get functions already in this package. These functions are at the moment in development, or rather, the data is in active cleaning procedures. Because of the lack of collection time for some of the samples, any analyses including these data must be interpreted with caution, and at the moment only be used as preliminary results. Updates on the state of the epigenetic cleaning/time matching will be posted in the #moas-updates slack channel.

library(MOAS)

To be able to use these functions you must know where to files exist:

the epigenetic file you want to read in data from (currently only supports xlsx file)
the genetic ID to CrossProject_ID match file (that is in MOAS/data-raw/DNA/gID_MOAS_match.tsv on the lagringshotell)

The `epigen_get` function

The get function will retrieve a cleaned epiegenetics file for you, altering the genetic ID's to proper CrossProject IDs. As opposed to the dbs-functions, this will return multiple rows for any participant that we have epigenetic data on from several data collection waves. You may use the returned data frame from this function to merge with your own data, if you wish to do so your self.

epigen_get(file_path = "../Genetics/LCBC_EPIgenetic_Data/batch1_190919/DNAm_horvath/18092019epiage_lifebrain.xlsx", 
           match_path = "data-raw/DNA/gID_MOAS_match.tsv")

# A tibble: 1,335 x 11
   CrossProject_ID Project_Name Project_Wave EpiGen_DNAmAge EpiGen_noMissingPe… EpiGen_meanMethByS… EpiGen_minMethByS… EpiGen_maxMethByS… EpiGen_predicted… EpiGen_meanXchro… Genetic_ID
             <dbl> <chr>               <dbl>          <dbl>               <dbl>               <dbl>              <dbl>              <dbl> <chr>                         <dbl> <chr>     
 1         1100133 MemP                    3           23.0                1444               0.255           0.000121              0.996 male                          0.293 110013301 
 2         1100241 MemP                    3           21.6                1442               0.256           0.000626              0.995 female                        0.452 110024101 
 3         1100268 MemP                    3           42.8                1447               0.256           0.000001              0.995 female                        0.455 110026801 
 4         1100326 MemP                    3           40.1                1445               0.259           0.000642              0.992 male                          0.293 110032601 
 5         1100334 MemP                    3           43.6                1441               0.255           0.000001              0.997 female                        0.444 110033401 
 6         1100741 MemP                    3           28.3                1450               0.253           0.00200               0.992 male                          0.277 110074101 
 7         1100760 MemP                    3           42.0                1444               0.253           0.00143               0.992 male                          0.288 110076001 
 8         1100768 MemP                    3           40.3                1443               0.245           0.000683              0.996 female                        0.430 110076801 
 9         1100211 MemP                    3           64.2                1444               0.250           0.000741              0.997 female                        0.453 110021101 
10         1100322 MemP                    3           22.6                1453               0.252           0.00127               0.992 male                          0.274 110032201 
# … with 1,325 more rows

This function is also convenient if you are looking to get extra information regarding the samples the data is based on, by toggling the debug argument to TRUE you will get extra information regarding the sample from the matching file. I recommend doing this if youare using the add function and are seeing things you are not expecting (and let me know if there in unexpected behaviour of the function).

epigen_get("../Genetics/LCBC_EPIgenetic_Data/batch1_190919/DNAm_horvath/18092019epiage_lifebrain.xlsx", 
           match_path = "data-raw/DNA/gID_MOAS_match.tsv", 
           debug=TRUE)

# A tibble: 1,335 x 24
   CrossProject_ID Project_Name Project_Wave FID   EpiGen_DNAmAge EpiGen_noMissin… EpiGen_meanMeth… EpiGen_minMethB… EpiGen_maxMethB… EpiGen_predicte… EpiGen_meanXchr… EpiGen_debug_IID
             <dbl> <chr>               <dbl> <chr>          <dbl>            <dbl>            <dbl>            <dbl>            <dbl> <chr>                       <dbl> <chr>           
 1         1100133 MemP                    3 1100…           23.0             1444            0.255         0.000121            0.996 male                        0.293 110013301       
 2         1100241 MemP                    3 1100…           21.6             1442            0.256         0.000626            0.995 female                      0.452 110024101       
 3         1100268 MemP                    3 1100…           42.8             1447            0.256         0.000001            0.995 female                      0.455 110026801       
 4         1100326 MemP                    3 1100…           40.1             1445            0.259         0.000642            0.992 male                        0.293 110032601       
 5         1100334 MemP                    3 1100…           43.6             1441            0.255         0.000001            0.997 female                      0.444 110033401       
 6         1100741 MemP                    3 1100…           28.3             1450            0.253         0.00200             0.992 male                        0.277 110074101       
 7         1100760 MemP                    3 1100…           42.0             1444            0.253         0.00143             0.992 male                        0.288 110076001       
 8         1100768 MemP                    3 1100…           40.3             1443            0.245         0.000683            0.996 female                      0.430 110076801       
 9         1100211 MemP                    3 1100…           64.2             1444            0.250         0.000741            0.997 female                      0.453 110021101       
10         1100322 MemP                    3 1100…           22.6             1453            0.252         0.00127             0.992 male                        0.274 110032201       
# … with 1,325 more rows, and 12 more variables: Genetic_ID <chr>, EpiGen_debug_batch <dbl>, EpiGen_debug_trusted <dbl>, EpiGen_debug_for_gwas <dbl>, EpiGen_debug_for_ewas <dbl>,
#   EpiGen_debug_duplicated <dbl>, EpiGen_debug_genetic_duplicate <chr>, EpiGen_debug_european <dbl>, EpiGen_debug_sample_type <chr>, EpiGen_debug_content <chr>,
#   EpiGen_debug_date <chr>, EpiGen_debug_comment <chr>

The `epigen_add` function

Given you already have a version of the MOAS at hand (either the entire data frame or a subset of it), you can use the epigen_add function to directly add the epigenetic data to it. Simply explained, this function only really call on the get funtion and runs a merging join function to add only the epigenetic data that can be matched to the MOAS-like data provided. This means that if you have already subsetted the MOAS-rows to what you are intereted in, this function will not add any new rows of data. Any timepoint you have previously omitted from the MOAS will not be re-added, and the epigenetic data volumns will be added at end of the data frame.

 MOAS %>% 
  select(1:8) %>%  #reducing the data for the example, taking only first 8 columns
  epigen_add(file_path="../Genetics/LCBC_EPIgenetic_Data/batch1_190919/DNAm_horvath/18092019epiage_lifebrain.xlsx", 
             match_path = "data-raw/DNA/gID_MOAS_match.tsv",)

# A tibble: 4,613 x 14
   CrossProject_ID Birth_Date Sex   Subject_Timepoi… Project_Name Project_Wave EpiGen_DNAmAge EpiGen_noMissin… EpiGen_meanMeth… EpiGen_minMethB… EpiGen_maxMethB… EpiGen_predicte…
   <fct>           <date>     <chr>            <dbl> <chr>               <dbl>          <dbl>            <dbl>            <dbl>            <dbl>            <dbl> <chr>           
 1 1000401         1990-05-18 Fema…                1 NDev                    1           NA                 NA           NA            NA                  NA     NA              
 2 1000401         1990-05-18 Fema…                2 NDev                    3           32.0             1448            0.241         0.000242            0.991 female          
 3 1000401         1990-05-18 Fema…                2 NDev                    3           32.0             1448            0.241         0.000242            0.991 female          
 4 1000402         1995-12-07 Fema…                1 NDev                    1           NA                 NA           NA            NA                  NA     NA              
 5 1000403         1992-04-03 Fema…                1 NDev                    1           NA                 NA           NA            NA                  NA     NA              
 6 1000403         1992-04-03 Fema…                2 NDev                    2           NA                 NA           NA            NA                  NA     NA              
 7 1000403         1992-04-03 Fema…                3 NDev                    3           37.4             1442            0.243         0.000232            0.996 female          
 8 1000403         1992-04-03 Fema…                3 NDev                    3           37.4             1442            0.243         0.000232            0.996 female          
 9 1000404         1989-09-28 Fema…                1 NDev                    1           NA                 NA           NA            NA                  NA     NA              
10 1000404         1989-09-28 Fema…                2 NDev                    2           NA                 NA           NA            NA                  NA     NA              
# … with 4,603 more rows, and 2 more variables: EpiGen_meanXchromosome <dbl>, Genetic_ID <chr>

There are three necessary columns needed in the MOAS-like data for the merging to work correctly. It will fail if these are not all present

MOAS %>% 
  select(1:5) %>% # This will omit the Project_Wave column
  epigen_add(file_path="../Genetics/LCBC_EPIgenetic_Data/batch1_190919/DNAm_horvath/18092019epiage_lifebrain.xlsx", 
             match_path = "data-raw/DNA/gID_MOAS_match.tsv",)
 Error: One of 'CrossProject_ID', 'Project_Name', 'Project_Wave' is missing from the MOAS-like data. These are needed for merging.