ReadMarker: Read marker data.

Description Usage Arguments Details Value Examples

Description

A function for reading in marker data. Two types of data can be read.

Usage

1
2
ReadMarker(filename = NULL, type = "text", missing = NULL, AA = NULL,
  AB = NULL, BB = NULL, availmemGb = 16, quiet = TRUE)

Arguments

filename

contains the name of the marker file. The file name needs to be in quotes.

type

specify the type of file. Choices are 'text' (the default) and PLINK.

missing

the number or character for a missing genotype in the text file. There is no need to specify this for a PLINK ped file. Missing allele values in a PLINK file must be coded as '0' or '-'.

AA

the character or number corresponding to the 'AA' snp genotype in the marker genotype file. This need only be specified if the file type is 'text'. If a character then it must be in quotes.

AB

the character or number corresponding to the 'AB' snp genotype in the marker genotype file. This need only be specified if the file type is 'text'. This can be left unspecified if there are no heterozygous genotypes (i.e. the individuals are inbred). Only a single heterozygous genotype is allowed ('Eagle' does not distinguish between 'AB' and 'BA'). If specified and a character, it must be in quotes.

BB

the character or number corresponding to the 'BB' snp genotype in the marker genotype file. This need only be specified if the file type is 'text'. If a character, then it must be in quotes.

availmemGb

a numeric value. It specifies the amount of available memory (in Gigabytes). This should be set to be as large as possible for best performance.

quiet

a logical value. If set to TRUE, additional runtime output is printed.

Details

ReadMarker can handle two different types of marker data; namely, genotype data in a plain text file, and PLINK ped files.

Reading in a plain text file containing the marker genotypes

To load a text file that contains snp genotypes, run ReadMarker with filename set to the name of the file, and AA, AB, BB set to the corresponding genotype values. The genotype values in the text file can be numeric, character, or a mix of both.

We make the following assumptions

For example, suppose we have a space separated text file with marker genotype data collected from five snp loci on three individuals where the snp genotype AA has been coded 0, the snp genotype AB has been coded 1, the snp genotype BB has been coded 2, and missing genotypes are coded as 99

0 1 2 0 2
1 1 0 2 0
2 2 1 1 99

The file is called geno.txt and is located in the directory /my/dir/.

To load these data, we would use the command

1
geno_obj <- ReadMarker(filename='/my/dir/geno.txt', AA=0, AB=1, BB=2, type='text', missing=99)

where the results from running the function are placed in geno_obj.

As another example, suppose we have a space separated text file with marker genotype data collected from five snp loci on three individuals where the snp genotype AA has been coded a/a, the snp genotype AB has been coded a/b, and the snp genotype BB has been coded b/b

a/a a/b b/b a/a b/b
a/b a/b a/a b/b a/a
b/b b/b a/b a/b NA

The file is called geno.txt and is located in the same directory from which R is being run (i.e. the working directory).

To load these data, we would use the command

1
2
geno_obj <- ReadMarker(filename='geno.txt', AA='a/a', AB='a/b', BB='b/b', 
                                       type='text', missing = 'NA')

where the results from running the function are placed in geno_obj.

Reading in a PLINK ped file

PLINK is a well known toolkit for the analysis of genome-wide association data. See https://www.cog-genomics.org/plink2 for details.

Full details of PLINK ped files can be found https://www.cog-genomics.org/plink/1.9/formats#ped. Briefly, the PED file is a space delimited file (tabs are not allowed): the first six columns are mandatory:

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype

Here, these columns can be any values since ReadMarker ignores these columns.

Genotypes (column 7 onwards) can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic. All snps must have two alleles specified. Missing alleles (i.e 0 or -) are allowed. No column headings should be given.

As an example, suppose we have data on three individuals genotyped for four snp loci

FAM001 101 0 0 1 0 A G C C C G A A
FAM001 201 0 0 2 0 A A C T G G T A
FAM001 300 101 201 2 0 G A T T C G A T

Then to load these data, we would use the command

1
geno_obj <- ReadMarker(filename='PLINK.ped', type='PLINK')

where geno_obj is used by AM, and the file PLINK.ped is located in the working directory (i.e. the directory from which R is being run).

Reading in other formats

Having first installed the stand-alone PLINK software, it is possible to convert other file formats into PLINK ped files. See https://www.cog-genomics.org/plink/1.9/formats for details.

For example, to convert vcf file into a PLINK ped file, at the unix prompt, use the PLINK command

1
PLINK --vcf filename.vcf --recode --out newfilename

and to convert a binary ped file (bed) into a ped file, use the PLINK command

1
PLINK --bfile filename --recode --tab --out newfilename

Value

To allow AM to handle data larger than the memory capacity of a machine, ReadMarker doesn't load the marker data into memory. Instead, it creates a reformatted file of the marker data and its transpose. The object returned by ReadMarker is a list object with the elements asciifileM , asciifileMt, and dim_of_ascii_M which is the full file name (name and path) of the reformatted file for the marker data, the full file name of the reformatted file for the transpose of the marker data, and a 2 element vector with the first element the number of individuals and the second element the number of marker loci.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
  #--------------------------------
  #  Example 1
  #-------------------------------
  #
  # Read in the genotype data contained in the text file geno.txt
  #
  # The function system.file() gives the full file name (name + full path).
  complete.name <- system.file('extdata', 'geno.txt', package='Eagle')
  # 
  # The full path and name of the file is
  print(complete.name)
  
  # Here, 0 values are being treated as genotype AA,
  # 1 values are being treated as genotype AB, 
  # and 2 values are being treated as genotype BB. 
  # 4 gigabytes of memory has been specified. 
  # The file is space separated with the rows the individuals
  # and the columns the snp loci.
  geno_obj <- ReadMarker(filename=complete.name, type='text', AA=0, AB=1, BB=2, availmemGb=4) 
   
  # view list contents of geno_obj
  print(geno_obj)

  #--------------------------------
  #  Example 2
  #-------------------------------
  #
  # Read in the allelic data contained in the PLINK ped file geno.ped
  #
  # The function system.file() gives the full file name (name + full path).
  complete.name <- system.file('extdata', 'geno.ped', package='Eagle')

  # 
  # The full path and name of the file is
  print(complete.name)
  
  # Here,  the first 6 columns are being ignored and the allelic 
  # information in columns 7 -  10002 is being converted into a reformatted file. 
  # 4 gigabytes of memory has been specified. 
  # The file is space separated with the rows the individuals
  # and the columns the snp loci.
  geno_obj <- ReadMarker(filename=complete.name, type='PLINK', availmemGb=4) 
   
  # view list contents of geno_obj
  print(geno_obj)

Eagle documentation built on May 2, 2019, 5:31 p.m.