set.seed(22062017) library(omicsPrint) library(BiocStyle) library(GEOquery) library(SummarizedExperiment)
Here we illustrate how to detect data linkage errors with the
r Biocpkg("omicsPrint")
package using both artificially generated data and
experimental data. Several additional vignettes are available that
show the use of the package on further experimental data in different
settings, i.e., 450k DNA methylation and imputed
genotypes.
Here we generate a single vector with 100 randomly drawn integers from the set; 1, 2, 3, representing 100 SNP calls from a single individual. Three additional individuals are generated by randomly swapping a certain fraction of the SNPs. Swapping 5 SNPs will introduce a few mismatches mimicking a situation where the same individual was measured twice (replicate) but with measurement error. Swapping 50% of the SNPs will be similar to the difference in genotypes between parents and offspring. Swapping all SNPs will result in a situation similar to comparing two unrelated individuals.
swap <- function(x, frac=0.05) { n <- length(x) k <- floor(n*frac) x1 <- sample(1:n,k) x2 <- sample(1:n,k) ##could be overlapping x[x2] <- x[x1] x } x1 <- 1 + rbinom(100, size=2, prob=1/3) x2 <- swap(x1, 0.05) ##strongly related e.g. replicate x3 <- swap(x1, 0.5) ##related e.g. parent off spring x4 <- swap(x1, 1) ##unrelated x <- cbind(x1, x2, x3, x4)
Now x
contains the 100 SNPs for the four individuals using head
we
can inspect the first six SNPs.
head(x)
allelesharing
algorithmWe use Identity by State (IBS) for the set of SNPs to infer sample relations. See @Abecasis2001, for the description of this approach applied to genetic data. Briefly, between each sample pair, the IBS-vector is calculated, which is a measure of genetic distance between individuals. Next, the vector is summarized by its mean and variance. A mean of 2 and variance of 0 indicates that the samples are identical.
data <- alleleSharing(x, verbose=TRUE)
The set of SNPs may contain uninformative SNPs, SNPs of bad quality or
even SNPs could be missing. The following pruning steps are
implemented to yield the most informative set of SNPs (thresholds can
be adjusted see ?alleleSharing
):
callRate = 0.95
)coverageRate = 2/3
)alpha = 0
).maf = 0
) Hardy-Weinberg test-statistics is calculated using a $\chi^2$-test and Bonferonni multiple testing correction is performed.
data
By default no relations are assumed except for the self-self relations.
The output is a data.frame
containing all pairwise comparisons with
the mean and variance of the IBS over the set of SNPs and the reported
sample relationship, including the identifiers.
Since we provided a list of known relations and assume that the majority is correct, we can build a classifier to discover misclassified relations. Linear discriminant analysis is used to generate a confusion-matrix, which is subsequently used to graphically represent the classification boundaries and generate an output file with misclassified sample pairs.
mismatches <- inferRelations(data) mismatches
There is one misclassified sample, namely the replicate that we
introduced but was not a priori specified as an existing
relationship. The true relationship with between sample x1
and
sample x2
is an identical relation. Furthermore, we see two sample
pairs with mean IBS of r round(mismatches$mean, 2)
and variance
r round(mismatches$var, 2)
which is an indication that also these pairs
share a considerable number of alleles. If known, such relationships
can be specified prior to analysis.
relations <- expand.grid(idx = colnames(x), idy= colnames(x)) relations$relation_type <- "unrelated" relations$relation_type[relations$idx == relations$idy] <- "identical" relations$relation_type[c(2,5)] <- "identical" ##replicate relations$relation_type[c(3,7,9,10)] <- "parent offspring" relations
Rerun the allelesharing algorithm now provided with the known relations.
data <- alleleSharing(x, relations=relations) data mismatches <- inferRelations(data) mismatches
All misclassified relations were resolved.
The previous example showed how to perform sample relationship
verification within a single omics data type. If a second set of SNPs
is available obtained from a different omic data type (and the SNPs
are partly overlapping), r Biocpkg("omicsPrint")
can be used to verify
relationships between omics types, e.g. to establish whether two omics
data types were indeed generated for the same individual in order to
exclude or detect sample mix-ups.
In this artificial example a random subset of 80 SNPs is selected as the set of SNPs from a different omic type. First running without providing the known relations.
rownames(x) <- paste0("rs", 1:100) y <- x[sample(1:100, 80),] data <- alleleSharing(x, y)
Note that pruning is performed on both data types and automatically a
set of overlapping SNPs (80) is used provided that the rownames of x
and y
are identical (this also holds for sample relations where the
relation identifiers idx
and idy
should match with the columnames
of x
and y
).
data mismatches <- inferRelations(data) mismatches
Now running with providing the known relations.
data <- alleleSharing(x, y, relations) data mismatches <- inferRelations(data) mismatches
Providing the known, true relationships thus yields no missclassified sample relationships.
SummarizedExperiment
Here we will show how you could varify sample relationships on publicly available DNA methylation data. The dataset used here contains pairs of monozygotic twins. We will extract the beta-value matrix from GEO GSE100940, paper in press.
First we extract the data from GEO using the GEOquery-package.
library(GEOquery) library(SummarizedExperiment) file <- tempfile(fileext = ".txt.gz") cnt <- 0 value <- -1 while(value != 0 & cnt < 25) { value = download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100940/matrix/GSE100940_series_matrix.txt.gz", file) cnt <- cnt + 1 } gset <- getGEO(filename=file, getGPL=FALSE)
library(GEOquery) library(SummarizedExperiment) file <- tempfile(fileext = ".txt.gz") download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE100nnn/GSE100940/matrix/GSE100940_series_matrix.txt.gz", file) gset <- getGEO(filename=file, getGPL=FALSE)
Next we convert the returned object into a SummarizedExperiment:
se <- makeSummarizedExperimentFromExpressionSet(gset) se
Sample data can be extracted from the SummarizedExperiment
-object using the
colData
-function and we can see which pair of twins each sample belongs to
through the source_name_ch1
field. Using this knowledge we can construct a
table of expected relationships:
r <- expand.grid(idx=colnames(se), idy=colnames(se)) r$Xpair <- sapply(strsplit(as.character(colData(se)[r$idx, "source_name_ch1"]), split = "_"), head, 1) r$Ypair <- sapply(strsplit(as.character(colData(se)[r$idy, "source_name_ch1"]), split = "_"), head, 1) r$relation_type <- "unrelated" r$relation_type[r$Xpair == r$Ypair] <- "twin" r$relation_type[r$idx == r$idy] <- "identical" head(r)
Several probes on the array contain SNPs occurring frequently in different populations[@Chen2013; @Zhou2016]. We can use these to verify the expected relationships. We have made these data available from inside of this package.
Now we make a selection of CpGs probably affected by polymorphic SNPS in populations from East Asian, as these samples are from South Korea:
data(hm450.manifest.pop.GoNL) cpgs <- names(hm450.manifest.pop.GoNL[ mcols(hm450.manifest.pop.GoNL)$MASK.snp5.EAS]) se <- se[cpgs,]
Next the beta-values are converted to genotypes using our enhanced K-means algorithm:
dnamCalls <- beta2genotype(se, assayName = "exprs") dim(dnamCalls) dnamCalls[1:5, 1:5]
The DNA methylation based genotype calls can be directly supplied to the allelesharing algorithm to perform the intra-omic sample matching:
data <- alleleSharing(dnamCalls, relations = r, verbose = TRUE) mismatches <- inferRelations(data) mismatches
The twins are predicted as being identical to each other. This is not unexpected as they are monozygotic.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.