Note that to get this to really be a vignette, I must insert this at the top:
<!-- %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{quick overview of use/flow of functions} -->
I am just writing this down while I am writing these functions, to remind me of how I am going to use them.
Here are the standard steps that I would take in prepping up all the variables. You will of course want to prepare by loading the library:
library(fullsniplings) library(microbenchmark) library(ggplot2)
snp_genos <- get_snp_genos(fs_dev_test_data$plain_snp_data) snp_genos
So a 0 would be (1,0,0), a 1 would be (0,1,0), a 2 would be (0,0,1), and an NA would be (NA, NA, NA)
snp_indics <- genos_to_indicators(g = snp_genos$mat) snp_indics[, , 1:3]
Note that this creates a 3-D array.
geno_counts <- count_genos(snp_indics) geno_counts
afreqs <- alle_freqs(geno_counts) afreqs
Here is just a little function that returns genotype frequencies under H-W equilibrium given allele frequencies:
gfreqs_from_afreqs(afreqs)
And, here is a function that computes the probabilities of a pair of unrelated genotypes. The output gets stored as a matrix with 9 rows and L columns. Note that this will give you the prior on parental genotypes.
upgf <- unrelated_pair_gfreqs(gfreqs_from_afreqs(afreqs)) upgf
We only need to make a single 3x3x3 array for these. Rows are Parent One, columns are Parent Two, and the next dimension is for Kids. We just have a simple function that fills such an array. Check out its output:
trans_probs()
If we know the genotype frequencies of an unrelated pair, we can combine those with the transmission probabilities from that pair to get the joint probabilities of a pair of full siblings. We have a function for that which actually just starts from the expected genotype frequencies in the population. It takes a genotyping error rate that recycles over loci:
gf <- gfreqs_from_afreqs(afreqs) fsp <- full_sibling_pair_gfreqs(gf, mu = .005) fsp[,]
compare this to the unrelated pair probs and see that there are higher probabilities for states with more indentity between the two individuals
upgf[,]
Of course, most of the time we will want to add genotyping error.
Getting these likelihoods is pretty straightforward. They don't depend on the allele frequencies. These are just the probability of an individuals observed genotype given the (typically unknown) true genotype. This, of course, depends on the genotyping error rate of each locus.
We make a big 3 x 3 x L array of the probs of observing 0, 1, or 2, (the columns) given the true genotype is 0, 1, or 2 (in the rows). Under our simple independent mutation model that we are using, where each gene copy has a chance $\mu$ of being detected incorrectly, our likelihood table for $P(\text{Observed geno}=y~|~\text{True geno}=x)$ for a single locus looks like this: [ \begin{array}{lr|ccc} & & & y & \\ & & 0 & 1 & 2 \\ \hline & 0 & (1-\mu)^2 & 2\mu(1-\mu) & \mu^2 \\ x~~& 1 & \mu(1-\mu) & \mu^2 + (1-\mu)^2 & \mu(1-\mu) \\ & 2 & \mu^2 & 2\mu(1-\mu) & (1-\mu)^2 \\ \end{array} ] It is important to stress that this is just a super simple genotyping error model. However, any error model that you might dream up (so long as it is independent between loci) can be parameterized by a matrix like this. So, if you want a different genotyping error model, it can be specified like this. Note that we can probably estimate the elements of this matrix with very large samples.
We compute this in the function lik_array_from_simple_geno_err()
, and though we think of the
result as a 3 x 3 x L array, we actually return it as a 3 x 3*L matrix as that will facilitate
matters later.
Here is an example of what it looks like to run it for three loci at three different genotyping error rates:
lik_array_from_simple_geno_err(L = 3, mu = c(0.0, .005, .1))
In the output, notice that the column names are such that mu_1.0
describes the situation in
which one observes a genotype of 0 at a locus with genotyping error rate mu_1
.
To actually get genotype likelihoods for each of the individuals, given their
observed genotypes, we call the function get_indiv_geno_lik()
which calls the function
lik_array_from_simple_geno_err
. There is some funky stuff to deal with NAs here, and
a little bit of weird vectorization, but it seems to be working nicely. As long as it doesn't
gobble up too much memory with really large data sets.
It would be reasonable to think of the object this function returns as a
G x L x N array where G is the number of genotypic states (with diploid SNPs
that is going to be 3, but if we ever got into different ploidies, it would be
different). However, in practice, we will want to store it as a matrix with G x L
rows and N columns. But it is nice to be able to print it out and see it as a
G x L x N array with all the dimnames and things on it. So, what we do is
return it as an object of an S3 class called ind_geno_lik_array
that has its matrix representation
but also carries around attributes igla_dimnames
and igla_dim
which can be used
to print it in a nice format.
Here it is in action:
geno_liks <- get_indiv_geno_lik( SI = snp_indics, mu = c(0.0, 0.001, 0.005, 0.01, 0.1) )
The object geno_liks
inherits from an S3 class of geno_qty_array
so when we print it we get a specialized
result since we have defined a print.geno_qty_array
function:
geno_liks
So, let's print it out for all 5 loci and individuals 1, 2, 19, and 20:
gqa_natural(geno_liks)[, , c(1, 2, 19, 20)]
And compare that to their observed genotypes to see how it all works together.
snp_genos$mat[, c(1:2, c(-1,0) + ncol(snp_genos$mat))]
Notice in particular, that the likelihood for the true genotype of an individual at a locus for which data is missing at that individual is constant (1).
In the preceding section we did a calculation that yielded the likelihood that an individual's true genotype is 0, 1, or 2, given the observed genotype of that individual. Here are ambitions are one generation greater. We want to compute the likelihood of the genotypes of the two parents of each individual given the individual's observed genotype. There are 9 possible genotypes of the pair (Paren One can be 0, 1, or 2, and so can parent Two.). These could be condensed to 6 equivalence classes, but that is not worth it here, I feel. So we won't. Here are some example probabilities that we are dealing with:
Parent One | Parent Two | Kid | Prob ----------- | ------------ | -------- | ------ 0 | 0 | 0 | 1 0 | 0 | 1 | 0 0 | 0 | 2 | 0 0 | 1 | 0 | $\frac{1}{2}$ 0 | 1 | 1 | $\frac{1}{2}$ 1 | 1 | 0 | $\frac{1}{4}$
and so forth.
This is something that can probably be done efficiently in R with some wrangling, but guessing that it will be easier in compiled code with some for loops, I did it with Rcpp.
marriage_liks <- per_kid_marriage_likelihoods(geno_liks, trans_probs()) marriage_liks # print it out. You get the first and last two individuals and the first two loci
Notice that I have not figured out how to propagate the dimnames very well in Rcpp, so I just do it there in R. Probably will leave it that way, too.
We can take care of all of the above in one function and send the result back in a list. That function is
prep_all_variables
. Here we prep all the variables for the mykiss data set that has
r dim(fs_dev_test_data$mykiss_genos)[1]
individuals typed at r dim(fs_dev_test_data$mykiss_genos)[2]/2
SNPs.
system.time(mykiss <- prep_all_variables(fs_dev_test_data$mykiss_genos, 0.005))
So, that happens pretty quickly (and it is about two times slower running under knitr
while making this document).
And we can also see what the memory requirements are here:
lapply(mykiss, function(x) format(object.size(x), units="auto"))
We see that this isn't looking too bad. We could toss some of those components, but then there is going to be other overhead. The total use is:
format(object.size(mykiss), units="auto")
So, let's round that to 20 Mb for 1446 fish. Thus, 200,000 fish will require about r 20 * (200000/1446) / 1000
Gb,
so that seems to be well within the "reasonable zone."
One of my ideas for the program is to only propose moving individuals into sibships that already include an individual that they have chance of being a full sib with. In order to figure out who we will say they "have a chance of being a full sib with" we will simulate a bunch of full sib genotypes from the underlying allele frequencies in the population and look at the distribution of their log likelihood ratios under the hypotheses of full sibling versus unrelated.
Here is a function that simulates genotypes of full sibling pairs an unrelated pairs and returns the loglikelihood ratios for those
mykiss_gf <- gfreqs_from_afreqs(mykiss$afreqs) system.time(logls <- simulate_sib_pair_logls(mykiss_gf, 0.005, 100000))
Once we have those we can plot them if we are so inclined:
ggplot(logls, aes(x = LogL_ratio, fill=Relat)) + geom_density(alpha = 0.3)
And, let's say that during Gibbs sampling for individuals we were going to only consider those sibships that included individuals having a Logl_ratio that was at least as high as 99.9% of the simulated values. If we did that, how many unrelated pairs would not make it in there, on average?
q1000 <- quantile(logls$LogL_ratio[logls$Relat=="Full_Sib_Pairs"], probs=.001) q1000 # and see how many of the unrelated pairs have logls below that: mean(logls$LogL_ratio[logls$Relat=="Unrelated_Pairs"] > q1000)
So, that appears to be about 3%. That is pretty significant. That is a pretty large decrease in computational requirements.
This function merely creates a list of full sibling groups. Each sibling group contains the base-0 index of the individuals that are in the sibling group. This function just puts everyone in their own sibling group.
If we have a list of the the (base-0) indices of all the kids belonging together in full sibling groups then we can compute the likelihood of the marriage nodes given all those kids. Here is what one of those sib lists might look like (in this case I have the names of the parents naming the list):
head(fs_dev_test_data$mykiss_sib_list)
And, of course we need to have a corresponding array of individual offspring genotype likelihoods. With those two ingredients we have a function (in Rcpp, as it seems that will be faster) that returns a 3 x 3 x L x M array of multi-kid marriage likelihoods. M is the number of marriages (and is the length of the sib list.)
# this is broken now that I have updated the way fs_dev_test_data$mykiss_sib_list is stored. #multi_mar_lik <- multi_kid_marriage_likelihoods(fs_dev_test_data$mykiss_sib_list, mykiss$pk_marriage_liks) #microbenchmark(multi_kid_marriage_likelihoods(fs_dev_test_data$mykiss_sib_list, mykiss$pk_marriage_liks) ) # now I do it like this: multi_mar_lik <- make_LMMFS_from_FSL_and_LMMI(fs_dev_test_data$mykiss_sib_list, mykiss$pk_marriage_liks) microbenchmark(make_LMMFS_from_FSL_and_LMMI(fs_dev_test_data$mykiss_sib_list, mykiss$pk_marriage_liks))
Note that make_LMMFS_from_FSL_and_LMMI potentially overallocates memory as it uses the dimensions of its LMMI argument to allocate to LMMFS. So, let's chop off the parts it does not need:
Also, I fear that accessing elements of a list is much slower in Rcpp than accessing elements of vectors, so perhaps I will rewrite things to not store the sib_list as a list of lists. We'll see though...
Let us see what the ouput looks like:
multi_mar_lik
Aha, let us print just a part of it. Maybe the first two marriages and the first 3 loci:
gqa_natural(multi_mar_lik)[ , , 1:3, 1:2]
So, I have written the function update_marriage_likelihoods_in_place
which I hope uses Rcpp's call be reference capabilities to modify the marriage likelihoods array in place. Let us test it out:
bb <- multi_mar_lik # make another copy promise bb[ , c(101, 201, 301)] <- 0 # change it. That forces copying the promise from multi_mar_lik, too aa <- bb # promise copy to aa # now let's see how this is looking: which(colSums(bb)==0) which(colSums(aa)==0) which(colSums(multi_mar_lik)==0)
OK, now we will try to reconstitute rows 101, 201, and 301:
update_marriage_likelihoods_in_place(fs_dev_test_data$mykiss_sib_list, mykiss$pk_marriage_liks, bb, c(100, 200, 300))
And see what changed:
which(colSums(bb)==0) which(colSums(aa)==0) which(colSums(multi_mar_lik)==0) identical(multi_mar_lik, bb)
Well, it looks like bb
was modified, and so was aa
. So, that is the dangerous part. Modifying via a pointer in Rcpp does not force
a promise to be copied. Interesting. So, clearly I should rewrite multi_kid_marriage_likelihoods
to just be an R wrapper for
update_marriage_likelihoods_in_place
, and deal with all the dimnames naming and what not in R. OK. I
will get around to that eventually.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.