createGmaInput: This creates a gmaData structure from user supplied...

View source: R/createGmaInput_flex.R

createGmaInputR Documentation

This creates a gmaData structure from user supplied dataframes

Description

This creates a gmaData structure from user supplied dataframes

Usage

createGmaInput(
  baseline,
  mixture = NULL,
  unsampledPops = NULL,
  perAlleleError = 0.005,
  dropoutProb = 0.005,
  markerType = c("microhaps", "microsats"),
  alleleDistFunc = NULL
)

Arguments

baseline

a dataframe of the baseline individuals to use to infer relationships. The first column should be the population each individual is from. The second column is the individual's identifier. Following columns are genotypes, with columns 3 and 4 being Allele1 and 2 for locus 1, columns 5 and 6 being locus 2, ... This is a "two column per call" type of organization. Microhap and SNP genotypes should be given as concatenated basecalls, with each base represented by a single character. Microsat genotypes should be given as either the allele length or the number of repeats. Some examples of microhap alleles are "ACA", "AAD", "CTCTGGA". Some SNP alleles (which are just microhaplotypes with a length of one base) are "A", "G", "D". In these examples, deletions are represented by "D". Missing genotypes must be NA.

mixture

a dataframe of the mixture individuals to infer relationships for. The first column is the individual's identifier. Following columns are genotypes, in the same manner as for baseline. The order and column names of the loci must be the same in both the baseline and mixture dataframes.

unsampledPops

THIS OPTION IS CURRENTLY EXPERIMENTAL a dataframe of the individuals sampled from the "unsampled populations" used to estimate allele frequencies in these populations. Column 1 has the name of the baseline population that an individual corresponds to, column 2 is the individual's identifier (not currently used, but must be present). The following columns are genotypes, in the same manner as for baseline. The order and column names of the loci must be the same as in the baseline and mixture dataframes.

perAlleleError

either a constant value representing the per allele error rate (probability of observing any allele other than the correct allele) to use across all loci, or a dataframe with each row representing a locus. Column 1 is the locus name, and column 2 is the error rate (probability of observing any allele other than the correct allele). The locus name must be the column name of allele 1 for the corresponding locus in the baseline and mixture dataframes

dropoutProb

either a constant value representing the probability that any allele in any locus drop out, or a dataframe with column 1 being locus name, column 2 being allele, and column 3 being dropout probability. The locus name must be the column name of allele 1 for the corresponding locus in the baseline and mixture dataframes

markerType

either "microhaps" if the dataset contains microhaps and/or SNPs, or "microsats" if your dataset contains microsats

alleleDistFunc

a function that takes the distance between two non-identical alleles (either number of basepair differences for snps/microhaps, or the numeric distance for microsats) as input, and outputs the weight to give the probabilty of misgenotyping one allele as the other. The probabilties of misgenotyping a given allele are the normalized weights (between the given allele and all others) multiplied by the perAlleleError for that locus. Default is for misgenotyping to be equal across all alleles. If you wanted, for example, the probabiltiy to be proportional to the distance, you could specify alleleDistFunc = function(x) return(1/x)


delomast/gRandma documentation built on March 8, 2024, 2:26 a.m.