Fit the multinomial OriGen model and place unknown individuals who may be admixed

Share:

Description

This function fits the multinomial OriGen model and places individuals of unknown origins who may be admixed. This function estimates admixture fractions at each location rather than the probability of coming from each location.

Usage

1
2
FitMultinomialAdmixedModelFindUnknowns(DataArray,SampleCoordinates,UnknownDataArray,
	MaxGridLength=20,RhoParameter=10,LambdaParameter=100,MaskWater=TRUE,NumberLoci=-1)

Arguments

DataArray

An array giving the number alleles grouped by sample sites for each locus. The dimension of this array is [MaxAlleles,SampleSites,NumberSNPs].

SampleCoordinates

This is an array which gives the longitude and latitude of each of the found sample sites. The dimension of this array is [SampleSites,2], where the second dimension represents longitude and latitude respectively.

UnknownDataArray

This is an array which gives the alleles for the individuals of unknown origin. The dimension of this array is [NumberUnknowns,2,NumberLoci], where 2 represents to 2 alleles each individual has at each locus. Note that these should not be allele lengths but rather the allele number matching the dimension in DataArray. Note that 0 or negative values here indicate unknown alleles and it is assumed that both are either known or unknown.

MaxGridLength

An integer giving the maximum number of boxes to fill the longer side of the region. Note that computation time increases quadratically as this number increases, but this number also should be high enough to separate different sample sites otherwise they will be binned together as a single site.

RhoParameter

This is a real precision parameter weighting the amount of smoothing in the alllele frequency surface. A higher value flattens out the surface while a lower value allows for more fluctuations. The default value of 10 was used in our analysis and should prove a good starting point. To choose a value by crossvalidation please see FindRhoParameterCrossValidation

LambdaParameter

This is a real precision parameter weighting the admixture fractions algorithm. For the most part, this does not need to be changed as it seems to only affect the time to convergence.

MaskWater

If TRUE, this logical parameter restricts the heat maps to land areas only.

NumberLoci

An integer value giving the number of loci to use in the analysis. If set to -1, which is the default, it uses all loci.

Value

List with the following components:

AdmixtureFractions

An array giving the admixture fraction from the given location. In other words this is the fractional contribution of the location to the unknown individuals genetic data. The dimension of this array is [NumberLongitudeDivisions, NumberLatitudeDivisions, NumberUnknowns], where either NumberLongitudeDivisions or NumberLatitudeDivisions is equal to MaxGridLength(an input to this function) and the other is scaled so that the geodesic distance between points horizontally and vertically is equal.

DataArray

An array giving the number alleles grouped by sample sites for each locus. The dimension of this array is [MaxAlleles,SampleSites,NumberSNPs].

NumberLoci

This shows the integer number of loci found.

GridLength

An array giving the number of longitudinal and latitudinal divisions. The dimension of this array is [2], where the first number is longitude and the second is latitude.

RhoParameter

A real value showing the inputted RhoParameter value.

SampleSites

This shows the integer number of sample sites found.

MaxGridLength

An integer giving the maximum number of boxes to fill the longer side of the region. Note that computation time increases quadratically as this number increases, but this number also should be high enough to separate different sample sites otherwise they will be binned together as a single site. This number was part of the inputs.

SampleCoordinates

This is an array which gives the longitude and latitude of each of the found sample sites. The dimension of this array is [SampleSites,2], where the second dimension represents longitude and latitude respectively.

GridCoordinates

An array showing the corresponding coordinates for each longitude and latitude division. The dimension of this array is [2,MaxGridLength], with longitude coordinates coming first and latitude second. Note that one of these rows may not be filled entirely. The associated output GridLength should be used to find the lengths of the two rows. Rows not filled in entirely will contain zeroes at the end.

NumberUnknowns

This is an integer value showing the number of unknowns found.

UnknownDataArray

This is an array which gives the alleles for the individuals of unknown origin. The dimension of this array is [NumberUnknowns,2,NumberLoci], where 2 represents to 2 alleles each individual has at each locus. Note that these should not be allele lengths but rather the allele number matching the dimension in DataArray.

IsLand

This is a logical valued array that is TRUE when the given coordinates are over land and FALSE when over water. The dimension of this array is [GridLength[1],GridLength[2]].

MaxAlleles

An integer giving the maximum number of alleles across all loci.

Author(s)

John Michael Ranola, John Novembre, and Kenneth Lange

References

Ranola J, Novembre J, Lange K (2014) Fast Spatial Ancestry via Flexible Allele Frequency Surfaces. Bioinformatics, in press.

See Also

ConvertUnknownPEDData for converting two Plink PED files (known and unknown)into a format appropriate for analysis,

FitOriGenModelFindUnknowns for fitting allele surfaces to the converted data and finding the locations of the given unknown individuals,

PlotUnknownHeatMap for a quick way to plot the resulting unknown heat map surfaces from FitOriGenModelFindUnknowns,;

FitMultinomialAdmixedModelFindUnknowns for fitting allele surfaces to the converted data and finding the locations of the given unknown individuals who may be admixed,

PlotAdmixedSurface for a quick way to plot the resulting admixture surfaces from FitAdmixedFindUnknowns,

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
#this example not run because it takes longer than 5 secs
#note - type example(FunctionName, run.dontrun=TRUE) to run the example where FunctionName is
#the name of the function

## Not run: 

##Data generation
SampleSites=5
NumberLoci=3
MaxAlleles=2
if(MaxAlleles==2){
	NumberAllelesAtEachLocus=rep(2,NumberLoci)
}else{
	NumberAllelesAtEachLocus=sample(2:MaxAlleles,NumberLoci,replace=TRUE)
}
TestData=array(0,dim=c(MaxAlleles,SampleSites,NumberLoci))
for(i in 1:NumberLoci){
	for(j in 1:NumberAllelesAtEachLocus[i]){
		TestData[j,,i]=sample(1:10,SampleSites,replace=TRUE)
	}
}
##This data is simulated in Europe which is around Longitude -9 to 38 and Latitude 34 to 60
TestCoordinates=array(0,dim=c(SampleSites,2))
TestCoordinates[,1]=runif(SampleSites,-9,38)
TestCoordinates[,2]=runif(SampleSites,34,60)

##This simulates the unknown data
NumberUnknowns=2
UnknownData=array(0,dim=c(NumberUnknowns,2,NumberLoci))
for(i in 1:NumberUnknowns){
	for(j in 1:NumberLoci){
		UnknownData[i,,j]=sample(1:NumberAllelesAtEachLocus[j],2)
	}
}

##MaxGridLength is the maximum number of boxes allowed 
##to span the region in either direction
##Note that this number was reduced to allow the example to run in less than 5 secs
##RhoParameter is a tuning constant
print("MaxGridLength is intentionally set really low for fast examples. 
	Meaningful results will most likely require a higher value.")

##Fits the allele frequency surfaces only
#SurfaceTrials=FitMultinomialModel(TestData,TestCoordinates,
#MaxGridLength=20,RhoParameter=10)
#str(SurfaceTrials)
##Plotting the model
#PlotAlleleFrequencySurface(SurfaceTrials,LocusNumber=1,AlleleNumber=1,
#	MaskWater=TRUE,Scale=FALSE)

##You can generate heatmaps of unknown individual's placements from with the allele
##surfaces using GenerateHeatMaps or use FitMultinomialModelFindUnknowns
#HeatMapTrials=GenerateHeatMaps(SurfaceTrials,UnknownData,NumberLoci=NumberLoci)
##Plotting the unknown heat map
#PlotUnknownHeatMap(HeatMapTrials,UnknownNumber=1,MaskWater=TRUE)
	
##Fitting the model and finding the unknown locations
#UnknownTrials=FitMultinomialModelFindUnknowns(TestData,TestCoordinates,
#	UnknownData,MaxGridLength=20,RhoParameter=10)
#str(UnknownTrials)
##Plotting the unknown heat map
#PlotUnknownHeatMap(UnknownTrials,UnknownNumber=1,MaskWater=TRUE)

##Fitting the admixed model
##Note that MaxGridLength is intentionally set unusably low so that the example
##runs in under 5 seconds.  The default value of 20 is more reasonable in general
AdmixedTrials=FitMultinomialAdmixedModelFindUnknowns(TestData,TestCoordinates,
	UnknownData,MaxGridLength=8,RhoParameter=10,MaskWater=TRUE)
		
##Plots the admixed surface disregarding fractions less than 0.01
PlotAdmixedSurface(AdmixedTrials,UnknownNumber=1)


## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.