# Fit the OriGen model and place unknown individuals

### Description

This function fits the OriGen model and places individuals of unknown origins.

### Usage

1 2 | ```
FitOriGenModelFindUnknowns(DataArray,SampleCoordinates,
UnknownData,MaxGridLength=20,RhoParameter=10)
``` |

### Arguments

`DataArray` |
An array giving the number of major/minor SNPs (defined as the most occuring in the dataset) grouped by sample sites for each SNP. The dimension of this array is [2,SampleSites,NumberSNPs]. |

`SampleCoordinates` |
This is an array which gives the longitude and latitude of each of the found sample sites. The dimension of this array is [SampleSites,2], where the second dimension represents longitude and latitude respectively. |

`UnknownData` |
An array showing the unknown individuals genetic data. The dimension of this array is [NumberUnknowns,NumberSNPs]. |

`MaxGridLength` |
An integer giving the maximum number of boxes to fill the longer side of the region. Note that computation time increases quadratically as this number increases, but this number also should be high enough to separate different sample sites otherwise they will be binned together as a single site. |

`RhoParameter` |
This is a real precision parameter weighting the amount of smoothing. A higher value flattens out the surface while a lower value allows for more fluctuations. The default value of 10 was used in our analysis and should prove a good starting point. To choose a value by crossvalidation please see |

### Value

List with the following components:

`UnknownGrids` |
An array giving the probability that an unknown individual comes from the given location. The dimension of this array is [NumberLongitudeDivisions, NumberLatitudeDivisions, NumberUnknowns], where either NumberLongitudeDivisions or NumberLatitudeDivisions is equal to MaxGridLength(an input to this function) and the other is scaled so that the geodesic distance between points horizontally and vertically is equal. |

`DataArray` |
An array giving the number of major/minor SNPs (defined as the most occuring in the dataset) grouped by sample sites for each SNP. The dimension of this array is [2,SampleSites,NumberSNPs]. |

`NumberSNPs` |
This shows the integer number of SNPs found. |

`GridLength` |
An array giving the number of longitudinal and latitudinal divisions. The dimension of this array is [2], where the first number is longitude and the second is latitude. |

`RhoParameter` |
A real value showing the inputted RhoParameter value. |

`SampleSites` |
This shows the integer number of sample sites found. |

`MaxGridLength` |
An integer giving the maximum number of boxes to fill the longer side of the region. Note that computation time increases quadratically as this number increases, but this number also should be high enough to separate different sample sites otherwise they will be binned together as a single site. This number was part of the inputs. |

`SampleCoordinates` |
This is an array which gives the longitude and latitude of each of the found sample sites. The dimension of this array is [SampleSites,2], where the second dimension represents longitude and latitude respectively. |

`NumberUnknowns` |
This is an integer value showing the number of unknowns found in the UnknownPEDFile. |

`UnknownData` |
An array showing the unknown individuals genetic data. The dimension of this array is [NumberUnknowns,NumberSNPs]. |

`GridCoordinates` |
An array showing the corresponding coordinates for each longitude and latitude division. The dimension of this array is [2,MaxGridLength], with longitude coordinates coming first and latitude second. Note that one of these rows may not be filled entirely. The associated output GridLength should be used to find the lengths of the two rows. Rows not filled in entirely will contain zeroes at the end. |

### Author(s)

John Michael Ranola, John Novembre, and Kenneth Lange

### References

Ranola J, Novembre J, Lange K (2014) Fast Spatial Ancestry via Flexible Allele Frequency Surfaces. Bioinformatics, in press.

### See Also

`ConvertUnknownPEDData`

for converting two Plink PED files (known and unknown)into a format appropriate for analysis,

`FitOriGenModelFindUnknowns`

for fitting allele surfaces to the converted data and finding the locations of the given unknown individuals,

`PlotUnknownHeatMap`

for a quick way to plot the resulting unknown heat map surfaces from `FitOriGenModelFindUnknowns`

,;

`FitAdmixedModelFindUnknowns`

for fitting allele surfaces to the converted data and finding the locations of the given unknown individuals who may be admixed,

`PlotAdmixedSurface`

for a quick way to plot the resulting admixture surfaces from `FitAdmixedFindUnknowns`

,

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ```
#this example not run because it takes slightly longer than 5 secs
#note - type example(FunctionName, run.dontrun=TRUE) to run the example where FunctionName is
#the name of the function
## Not run:
#Data generation
SampleSites=10
NumberSNPs=5
TestData=array(sample(2*(1:30),2*SampleSites*NumberSNPs,
replace=TRUE),dim=c(2,SampleSites,NumberSNPs))
#Europe is about -9 to 38 and 34 to 60
TestCoordinates=array(0,dim=c(SampleSites,2))
TestCoordinates[,1]=runif(SampleSites,-9,38)
TestCoordinates[,2]=runif(SampleSites,34,60)
#This code simulates the number of major alleles the unknown individuals have.
NumberUnknowns=2
TestUnknowns=array(sample(0:2,NumberUnknowns*NumberSNPs,
replace=TRUE),dim=c(NumberUnknowns,NumberSNPs))
#Fitting the model
#MaxGridLength is the maximum number of boxes allowed to span the region in either direction
#RhoParameter is a tuning constant
trials4=FitOriGenModelFindUnknowns(TestData,TestCoordinates,
TestUnknowns,MaxGridLength=20,RhoParameter=10)
str(trials4)
#Plotting the unknown heat map
PlotUnknownHeatMap(trials4,UnknownNumber=1,MaskWater=TRUE)
## End(Not run)
``` |