parallel_structure: FUNCTION TO RUN PARALLEL JOBS IN STRUCTURE

Description Usage Arguments Details Note Author(s) References Examples

View source: R/parallel_structure.R

Description

parallel_structure is a R function that calls STRUCTURE automatically for a set of predefined jobs. Jobs are distributed among all available computing units (cores or cpu) in order to make the best use of multi-core computer while running analysis of large data sets in STRUCTURE. Distribution of jobs to multiple cpu relies on mclapply function from the R parallel-package and might not be fully functional under Windows architecture, and should not be used in GUI or embedded environments, because it leads to several processes sharing the same GUI which will likely cause chaos (and possibly crashes). see mclapply in parallel-package for details.

TWO inputs are give to the MPI_structure function:

1-The function argument in R

2-The job specifications written in a "joblist" file, see example joblist structure_jobs.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
MPI_structure(joblist = NULL, n_cpu = NULL, structure_path = Mac_path, infile = NULL, outpath = NULL,

 numinds = NULL, numloci = NULL, plot_output = 1, label = 1, popdata = 1, popflag = 0, locdata = 0, 
 
 phenotypes = 0, markernames = 0, mapdist = 0, onerowperind = 0, phaseinfo = 0, recessivealleles = 0, 
 
 phased = 0, extracol = 0, missing = -9, ploidy = 2, noadmix = 0, linkage = 0, usepopinfo = 0, locprior = 0,
 
  inferalpha = 1, alpha = 1, popalphas = 0, unifprioralpha = 1, alphamax = 10, alphapropsd = 0.025, freqscorr = 1, 
  
  onefst = 0, fpriormean = 0.01, fpriorsd = 0.05, inferlambda = 0, lambda = 1, computeprob = 1, 
  
  pfromflagonly = 0, ancestdist = 0, startatpopinfo = 0, metrofreq = 10, updatefreq = 1, printqhat = 0,revert_convert=0,
  
  randomize=1)

Arguments

joblist

Name of the file where list of jobs is stored, see example data: joblist1

n_cpu

number of cpu cores to be used

structure_path

Location of the executable command line STRUCTURE program. "/Applications/Structure.app/Contents/Resources/Java/bin/" for MacOS.

infile

Location of the input datafile

outpath

location of folder to write the output files

numinds

Number of individuals in data file

numloci

Number of loci in data file

plot_output

If "1" each STRUCTURE job will generate a pdf format graph in "outpath" (printqhat must be 1). If "0" no graphic output is produced

label

Input file contains labels (names) for each individual. 1 = Yes; 0 = No

popdata

Input file contains a user-defined population-of-origin for each individual. 1 = Yes; 0 = No

popflag

Input file contains an indicator variable which says whether to use popinfo when USEPOPINFO==1. 1 = Yes; 0 = No

phenotypes

Input file contains a column of phenotype information. 1 = Yes; 0 = No

markernames

Input file contains a row of marker names. 1 = Yes; 0 = No.

mapdist

The next row of the data file (or the first row if markernames==0) contains a list of mapdistances between neighboring loci

onerowperind

The data for each individual are arranged in a single row. E.g., for diploid data, this would mean that the two alleles for each locus are in consecutive order in the same row, rather than being arranged in the same column, in two consecutive rows

phaseinfo

The row(s) of genotype data for each individual are followed by a row of information about haplotype phase.

extracol

Input file contains an extra column of data 1 = Yes; 0 = No.

missing

code for missing genotype (Default=-9)

ploidy

Ploidy of the organism. Default is 2 (diploid).

usepopinfo

Use prior population information to assist clustering 1 = Yes; 0 = No.

revert_convert

If usepopinfo=1, will convert population IDs back into the original data file IDs. 1 = Yes; 0 = No.

printqhat

the point estimate for is not only printed into the main results file, but also into a separate file with suffix “q”. 1 = Yes; 0 = No. This file is required in order to run the companion program STRAT and to automatically generate graphic output (see plot_output).

locdata

1 = Yes; 0 = No.

recessivealleles

1 = Yes; 0 = No.

phased

1 = Yes; 0 = No.

noadmix

1 = Yes; 0 = No.

linkage

1 = Yes; 0 = No.

locprior

1 = Yes; 0 = No.

inferalpha

1 = Yes; 0 = No.

alpha

value of alpha (default = 1.0)

popalphas

infer separate alpha for each population 1 = Yes; 0 = No.

unifprioralpha

1 = Yes; 0 = No.

alphamax

(maximum value of alpha (default = 10.0))

alphapropsd
freqscorr
onefst
fpriormean
fpriorsd
inferlambda

1 = Yes; 0 = No.

lambda
computeprob
pfromflagonly

1 = Yes; 0 = No. makes it possible to update the allele frequencies, P , using only a prespecified subset of the individuals. To use this, include a POPFLAG column, and set POPFLAG=1 for individuals who should be used to update , and POPFLAG=0 for individuals who should not be used to update P

ancestdist
startatpopinfo
randomize

1=yes; 0= No. Use a different random number seed for each run (this is taken from the system clock).

metrofreq
updatefreq
...

see http://pritch.bsd.uchicago.edu/software/readme_2_1/node33.html for complete list of arguments when running STRUCTURE from command line.

Details

The function uses parallel-package to pilot serial STRUCTURE jobs in parallel: parallel-package is distributed with R since version 2.14.0

Note

see example data file and joblist file:

structure_data

structure_jobs

Author(s)

Francois Besnier, Kevin Glover

References

Besnier F, Glover KA (2013) ParallelStructure: A R Package to Distribute Parallel Runs of the Population Genetics Program STRUCTURE on Multi-Core Computers. PLoS ONE 8(7): e70651. doi:10.1371/journal.pone.0070651

parallel-package

STRUCTURE website: http://pritch.bsd.uchicago.edu/software.html

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
## Run according to your platform (windows/Unix) and version of STRUCTURE

##FOR UNIX:

#	system('mkdir structure_results') # create a directory to store results
#	data(structure_data)			  # call data file 
#	data(structure_jobs)			  # call joblist file

## write input files in current working directory ####
#	write(t(structure_jobs),ncol=length(structure_jobs[1,]),file='joblist1.txt')
#	write(t(structure_data),ncol=length(structure_data[1,]),file='example_data.txt')


### call parallel_structure for the given example
### output files are stored in "structure_results/"


## You  have added the path of STRUCTURE executable into /usr/local/bin:

#	parallel_structure(structure_path=NULL,joblist='joblist1.txt',n_cpu=4,infile='example_data.txt',outpath='structure_results/',numinds=987,numloci=9,printqhat=1)

##OTHERWISE

#	Mac_path="/Applications/Structure.app/Contents/Resources/Java/bin/"

#	parallel_structure(structure_path=Mac_path,joblist='joblist1.txt',n_cpu=4,infile='example_data.txt',outpath='structure_results/',numinds=987,numloci=9,printqhat=1)


##FOR WINDOWS:


#	shell('mkdir structure_results') # creat a directory to store results
#	data(structure_data)			  # call data file 
#	data(structure_jobs)			  # call joblist file

## write input files in current working directory ####
#	write(t(structure_jobs),ncol=length(structure_jobs[1,]),file='joblist1.txt')
#	write(t(structure_data),ncol=length(structure_data[1,]),file='example_data.txt')



### call parallel_structure for the given example
### output files are stored in "structure_results/"

## You  have added the path of STRUCTURE executable into your environment variable:

#	parallel_structure(structure_path=NULL,joblist='joblist1.txt',n_cpu=4,infile='example_data.txt',outpath='structure_results/',numinds=987,numloci=9,printqhat=1)

##OTHERWISE

#	Windows_path="c:/Program Files (x86)/Structure2.3.4/bin/"

# (Check the "Windows_path" variable is correct for your version of STRUCTURE):
#	parallel_structure(structure_path=Windows_path,joblist='joblist1.txt',n_cpu=4,infile='example_data.txt',outpath='structure_results/',numinds=987,numloci=9,printqhat=1)

ParallelStructure documentation built on May 2, 2019, 5:16 p.m.