knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(gRandma)

This will show how to load data into gRandma for analysis (i.e. create a gmaData object) and then describe the structure of the gmaData object.

Load data in

We start with a dataframe of population names, individual names, and then genotypes (codominant, diploid) in a two column per call format.

data_mh_snp[1:5,1:8]

While only microhaps are shown in these columns, this dataset has a mix of microhaps and SNPs (which are just microhaps with length 1 bp). There are seven populations and a total of 4121 individuals.

Consider that these samples from 8 populations represent your "baseline" and you are searching for their descendants. We are estimating error rates before we launch a project of collecting and genotyping the potential descendants to determine if this project is feasible. Now we can create a gmaData object.

We feed in this dataframe as the baseline in createGmaInput. We'll use a per allele error rate of .005 for all loci, and a dropout probability of .005 for all loci and all alleles. If you want to have variable error rates, these can be input as dataframes. See the detailed documentation for createGmaInput. We have a mixture of SNPs and microhaps, so for markerType we select "microhaps". For calculating error rates, we will set the alleleDistFunc to weight all alleles equally regardless of how similar they are.

gData_1 <- createGmaInput(baseline = data_mh_snp, perAlleleError = .005, dropoutProb = .005,
                                          markerType = "microhaps", alleleDistFunc = (function(x) return(1)))
gData_1

We see that we have 364 loci, 285 of them are biallelic, and there are some microhaps thrown in. We now have a gmaData object!

For most things, you will just take this object and use it as input for other functions in gRandma. But some users may want to access and/or manually edit the data being used. Or you may want to just inspect to make sure nothing unintended happened with the input data. Let's walk through the structure of the gmaData object.

gmaData objects are lists with nine entries:

gData_1$baseline[1:5,1:5]
gData_1$genotypeErrorRates[1]
gData_1$genotypeKeys[1]
gData_1$alleleKeys[1]
gData_1$baselineParams[[1]][1]
gData_1$missingParams[1:5]

Now let's pretend we have separate mixture and baseline populations. These should be two separate dataframes, with identical columns expect that the mixture dataframe should not have the column representing populations. We can split our example data into two:

mixtureData <- data_mh_snp[data_mh_snp$Pop == "Pop_4",]
mixtureData <- mixtureData[,-1] # remove Pop column from mixture
baselineData <- data_mh_snp[data_mh_snp$Pop != "Pop_4",]
mixtureData[1:5,1:5]
baselineData[1:5,1:5]

And now we can create a gmaData object with both a baseline and a mixture.

gData_2 <- createGmaInput(baseline = baselineData, mixture = mixtureData, perAlleleError = .005, dropoutProb = .005,
                                          markerType = "microhaps", alleleDistFunc = (function(x) return(1)))
gData_2


delomast/gRandma documentation built on March 8, 2024, 2:26 a.m.