knitr::opts_chunk$set(echo = TRUE)
Goal: Check small molecules data for qtl2shiny and other uses.
See Box folder MS data sets for DO project and subfolders
See also Dan Gatti's AttieMetabolomics GitHub and Dan Gatti's FTP site.
Types of molecular measurements:
Here are the files I think are relevant:
size | type | name ---- | -------------- | ------------ 961K | mouse by trait | attie_liver_lipids_zscore_normalized.rds 225K | mouse by trait | attie_liver_metabolites_zscore_normalized.rds 1.2M | mouse by trait | attie_plasma_lipids_zscore_normalized.rds 1.6M | mouse by trait | attie_cecum_lipids_zscore_normalized.rds 176K | peaks | cecum_lipids_jax_norm_qtl_summary_thresh_6.csv 132K | peaks | liver_lipids_jax_norm_qtl_summary_thresh_6.csv 20K | peaks | liver_metabolites_jax_norm_qtl_summary_thresh_6.csv 193K | peaks | plasma_lipids_jax_norm_qtl_summary_thresh_6.csv 55K | description | cecum_lipid_pheno.descr.rds 31K | description | liver_lipid_pheno.descr.rds 2.2K | description | liver_metabolite_pheno.descr.rds 41K | description | plasma_lipids_pheno.descr.rds | untransformed | attie_liver_lipids_normalized.rds | genome scan | liver_lipids_jax_norm_all_qtl.rds
library(stringr) library(dplyr) library(tidyr)
datapath <- "~/Documents/Research/attie_alan/DO/data/DerivedData/"
The pheno.descr
files are different from what Karl used as dictionary, so will have to construct that manually. Phenotypes follow covariates in the mouse by trait tables, and the Mouse.ID
is DO-nnn
, as apposed to nn
and nnn
in Karl's work. All fixable but will take careful examination.
liverz <- readRDS(file.path(datapath, "molecule", "attie_liver_lipids_zscore_normalized.rds")) metabz <- readRDS(file.path(datapath, "molecule", "attie_liver_metabolites_zscore_normalized.rds")) plasmaz <- readRDS(file.path(datapath, "molecule", "attie_plasma_lipids_zscore_normalized.rds")) cecumz <- readRDS(file.path(datapath, "molecule", "attie_cecum_lipids_zscore_normalized.rds"))
covar <- readRDS(file.path(datapath, "covar.rds")) if(!("DOwave" %in% colnames(covar))) { covar <- cbind(covar, DOwave = 1 + covar[,"DOwave2"] + 2 * covar[,"DOwave3"] + 3 * covar[,"DOwave4"] + 4 * covar[,"DOwave5"]) }
Check sex.
sexl <- liverz$sex names(sexl) <- as.numeric(str_replace(liverz$Mouse.ID, "DO-", "")) tmp <- qtl2scan::get_common_ids(covar, sexl) table(covar[tmp, "sex"], sexl[tmp])
check wave and generation
wave <- liverz$wave names(wave) <- as.numeric(str_replace(liverz$Mouse.ID, "DO-", "")) tmp <- qtl2scan::get_common_ids(covar, wave) table(covar[tmp, "DOwave"], wave[tmp])
gen <- liverz$generation names(gen) <- as.numeric(str_replace(liverz$Mouse.ID, "DO-", "")) tmp <- qtl2scan::get_common_ids(covar, gen) table(covar[tmp, "DOwave"], gen[tmp])
table(covar[,"DOwave"]) table(liverz$generation)
tmp2 <- covar[tmp, "DOwave"] < 3 & gen[tmp] == 19 tmp2 <- names(tmp2[tmp2])
covar[tmp2,]
liverz %>% filter(as.numeric(str_replace(Mouse.ID, "DO-", "")) %in% tmp2) %>% select(1:11)
We need the following to tranform into Karl type files:
analyses_tbl
: c("pheno","output","pheno_group","longname","pheno_type","model","transf","offset","winsorize",
"sex","diet_days","Ins_per_islet","Glu_0min","Ins_0min","weight_sac","oGTT_weight",
"DOwave2","DOwave3","DOwave4","DOwave5","Gcg_content")
peaks_tbl
: c("pheno","longname","output","pheno_group","pheno_type","chr","lod","pos")
pheno_tbl
: data frame with mouseID as rownames, pheno as colnamescovar_mx
: matrix with mouseID as rownmaes, covar as colnamesDetails to be added here. Need to modify covar
, save as data frame, and also modify analyses_tbl
. Need to coordinate change to data frame in particular with use of DOwave
as covariate.
Need to extract covariates, compare that Mouse.ID, sex, wave, chrM, chrY agree across all sets. Then need to give Batch unique names and combine.
Need to extract phenos and put in form to be reused.
Need to get analyses info right.
Need to go to CSV files to get peaks.
analyses_tbl
: c("pheno","output","pheno_group","longname","pheno_type","model","transf","offset","winsorize",
"sex","diet_days","Ins_per_islet","Glu_0min","Ins_0min","weight_sac","oGTT_weight",
"DOwave2","DOwave3","DOwave4","DOwave5","Gcg_content")
Strategy is to make covar
a data frame (convert from current format as numeric matrix)
and use model.matrix
. We will add columns for Batch
(note names above) and for chrM
and chrY
, which might be useful in the future. Also add coat_color
. All other covariates are ignored here.
Here, covariates are indicated in liverd$covar_list
as colon-separated names, but all are sex:generation:Batch
.
Batch
is distinct (I think) for each type of molecule.
generation
is DO generation, which should agree with DOwave
(1=17, 2=18, 3=19, 4=21, 5=2_).
Turns out generation
has 5 discrepancies. We will use DOwave
The covariate values are with phenotypes as first several columns and some extra columns for mitochondria and chr Y
(chrM
, chrY
).
liverz %>% count(chrM, chrY) %>% spread(chrM, n)
Comparisons are made below between liver lipids and the other three analytes.
all.equal(liverz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY), plasmaz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY))
table(liverz$Batch,plasmaz$Batch)
all.equal(liverz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY), metabz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY))
matchM <- qtl2scan::get_common_ids(liverz$Mouse.ID, metabz$Mouse.ID) matchM2 <- which(diff(match(metabz$Mouse.ID, matchM)) != 1) matchM2 <- sort(unique(c(matchM2 - 1, matchM2, matchM2 + 1))) tmp <- rbind(metabz$Mouse.ID[matchM2], liverz$Mouse.ID[matchM2]) dimnames(tmp) <- list(c("metab","liver"), matchM2) tmp
matchM2 <- match(matchM, metabz$Mouse.ID) all.equal(liverz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY), metabz[matchM2,] %>% select(Mouse.ID,coat_color,Batch,chrM,chrY))
table(liverz$Batch,as.numeric(str_replace(metabz[matchM2,"Batch"], "Batch", "")))
Liver and metab differ in Batch
for r sum(!(liverz$Batch == as.numeric(str_replace(metabz[matchM2,"Batch"], "Batch", ""))))
mice.
This is a different mouse: r liverz$Mouse.ID[which(!(liverz$Batch == as.numeric(str_replace(metabz[matchM2,"Batch"], "Batch", ""))))]
.
Cecum has 3 fewer mice than the other assays. Batch is substantially different for cecum.
all.equal(liverz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY), cecumz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY))
matchC <- qtl2scan::get_common_ids(liverz$Mouse.ID, cecumz$Mouse.ID) matchC2 <- match(matchC, liverz$Mouse.ID) all.equal(liverz[matchC2,] %>% select(Mouse.ID,coat_color,Batch,chrM,chrY), cecumz %>% select(Mouse.ID,coat_color,Batch,chrM,chrY))
table(liverz[matchC2,"Batch"],cecumz$Batch)
Liver and cecum differ in Batch
for r sum(!(liverz[matchC2,"Batch"] == cecumz$Batch))
mice.
Conclusions:
Batch
for r sum(!(liverz$Batch == plasmaz$Batch))
mice.Mouse.ID
DO-191
in the wrong place. Batch
is coded as Batchnn
rather than nn
,
and differs for mouse r liverz$Mouse.ID[which(!(liverz$Batch == as.numeric(str_replace(metabz[matchM2,"Batch"], "Batch", ""))))]
.Batch
differs for r sum(!(liverz[matchC2,"Batch"] == cecumz$Batch))
mice.Need to add generation
and Batch
, but there will be four Batch
(at least):
Batch_liver_lipid
, Batch_plasma_lipids
, Batch_cecum_lipids
, Batch_liver_metabolites
.
Now need to create analyses_tbl object here. First have to get peaks
Phenotypes are in files with names like attie_liver_lipids_zscore_normalized.rds
. That is, they are Z-scores of Jax-normalized data. These files have phenotypes prepended by covariates and some extra columns for mitochondria and chr Y
(chrM
, chrY
).
Approach here is to match Mouse.ID
and pick up phenotypes. These are large files (1Gb).
library(ggplot2) metabn <- readRDS(file.path(datapath, "molecule", "attie_liver_metabolites_zscore_uwisc_normalized.rds"))
dat <- data.frame(x = unlist(metabn[104,-(1:13)]), y = unlist(metabz[104,-(1:13)]))
ggplot(dat, aes(x,y)) + geom_point()
dat <- data.frame(x = unlist(metabn[161,-(1:13)]), y = unlist(metabz[106,-(1:13)]))
ggplot(dat, aes(x,y)) + geom_point()
dat <- data.frame(x = unlist(metabn[106,-(1:13)]), y = unlist(metabz[106,-(1:13)]))
ggplot(dat, aes(x,y)) + geom_point()
dat <- data.frame(x = unlist(metabn[161,-(1:13)]), y = unlist(metabz[161,-(1:13)]))
ggplot(dat, aes(x,y)) + geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.