README.md

ozymandias

There are a number of obvious QC/label matching steps missing from currently available/visible DNA methylation microarray pipelines, which while not being as fashionable or expensive as single-cell BSseq etc., benefit from tens of thousands of existing samples in hundreds of existing experiments. After a few dozen kicks and stings I ended up writing a package to handle batches of DNA methylation measurements (which, if ascertained from Illumina arrays, are accompanied by copy number information and high-MAF SNPs), possibly accompanied by matched RNAseq or mRNA microarray measurements, translocation/inversion/mutation covariates, and tissue type indicators. A great many of these steps have been addressed elsewhere, or previously; the only genuine innovation here is to make them the default.

People rarely do what they know to be right; they do what is convenient, then repent. (Bob Dylan said this first) Therefore, if you want people (yourself, for example) to do the right thing, make it the most convenient thing. For example, if the chrX/chrY copy number doesn't agree with the X inactivation status by DNA methylation, that gets flagged. If the high-MAF SNPs on an array don't agree across samples supposedly from the same person, that gets flagged. If the "epigenetic age" of normal samples is wildly different from their specified age, that gets flagged. If copy number aberrations for a supposedly identical sample are radically different between the RNA and DNA samples, that gets flagged. All that's left for the user to do is to reconcile the covariates with reality (sometimes chrY does fall off! Sometimes Xi suddenly becomes active again! Only you can say if this is expected.) In an ideal world, you'll have the raw IDAT (and/or CEL and/or BAM/bigWig/h5) files handy. I'm happy to accept patches to accomodate groups that throw away thousands of dollars' worth of information, but get the raw data if you possibly can: it takes less time to do it right than to do it all over.

This is all terribly boring, which might explain why many groups seem not to bother with it. That, in turn, might have something to do with the difficulty reported in reproducing flashy findings in glamor journals. So if you've never been burned by label swaps or horrible batch effects or human error, go right ahead and ignore all of these things. What could possibly go wrong? (Other than losing a year or two of your life chasing ghosts, that is)



RamsinghLab/ozymandias documentation built on May 9, 2019, 9:21 a.m.