read.data: Load data stored in the format of a structure file or similar...

View source: R/read.data.R

read.dataR Documentation

Load data stored in the format of a structure file or similar data table.

Description

Load data stored in the format of a structure file or similar data table.

Usage

read.data(
  file,
  mainparams = NULL,
  extracol.names = NULL,
  precol.headers = 1,
  nprecol,
  markername.dup = 0,
  NUMLOCI.autoAccept = TRUE,
  EXTRACOL = 0,
  INDLABEL = 1,
  LOCDATA = 0,
  MAPDISTANCE = 0,
  MARKERNAME = 1,
  MISSINGVAL,
  NUMINDS = 0,
  NUMLOCI = 0,
  ONEROW = 0,
  PHASED = 0,
  PHENOTYPE = 0,
  PLOIDY = 2,
  POPFLAG = 0,
  POPID = 1,
  RECESSIVEALLELE = 0,
  marker.info.file = NULL,
  sourceAbsent = FALSE
)

Arguments

file

Character string. The name of the data file to read in.

mainparams

Character string. The name of an associated ‘mainparams’ file to read in. Optional as mainparams info can be entered into the function call directly (see below).

extracol.names

Character string or vector. Names of extra (i.e. none of those specifically named here) non-marker columns in the data file. Optional; used when the data file does not contain pre-marker EXTRACOL headers and you wish to add them on input. Default is NULL.

precol.headers

Numeric. Presence (1) or absence (0) of headers for pre-marker columns in the header row of the data file. Default is 1. Set to 0 for ‘plink’ structure files.

nprecol

Numeric. Number of pre-marker columns in the data file. No default and must always be entered. Set to 2 for standard ‘plink’ structure files.

markername.dup

Numeric. Whether each marker name appears twice in the data file header row. Default is 0 (FALSE, suitable for ‘plink’ structure files), alternative is 1 (TRUE).

NUMLOCI.autoAccept

Logical. If the number of loci (NUMLOCI) is not entered, whether to require manual acceptance/rejection of the internally calculated NUMLOCI. Default is FALSE with a warning, in order not to interrupt the data analysis pipeline. If set to TRUE, the calculated NUMLOCI can be accepted with y or rejected with n, the latter of which stops the function with a blank error message and returns no output.

EXTRACOL

Numeric. Number of extra (i.e. none of those specifically named here) non-marker columns in the data file. Not needed if this info is uploaded in a ‘mainparams’ file. For compatibility with ‘structure’ mainparams info. Default is 0 (suitable for ‘plink’ structure files).

INDLABEL

Numeric. Presence (1) or absence (0) of a column of individual references in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 1, and this column must always be present. See ‘Details’. The first two columns of ‘plink’ structure files can be treated as INDLABEL and POPID columns.

LOCDATA

Numeric. Presence (1) or absence (0) of a LOCDATA column in the data file. For compatibility with ‘structure’. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 0.

MAPDISTANCE

Numeric. Presence (1) or absence (0) of a MAPDISTANCE row in the data file, which will be removed if present. Not needed if the info is uploaded in a ‘mainparams’ file. Set to 1 for ‘plink’ structure files that include this row below the marker names. Default is 0.

MARKERNAME

Numeric. Presence (1) or absence (0) of a header row containing marker names in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 1 and is suitable for ‘plink’ structure files.

MISSINGVAL

The identifier for missing values. No default. Not needed if the info is uploaded in a ‘mainparams’ file. Currently, 0 is not allowed and hence ‘plink’ structure files need to be modified before reading in, preferably replacing 0 with NA (no quotation marks needed).

NUMINDS

Numeric. Number of individuals in the data set. Required to check that the marker data are of the appropriate dimensions. Default set to 0.

NUMLOCI

Numeric. Number of loci in the data set. Not needed if the info is uploaded in a ‘mainparams’ file. If the default value (0) is used, the user has the option to manually accept/reject the internally calculated number of loci (calculated using NUMLOCI , PLOIDY and the dimensions of the marker data) using the NUMLOCI.autoAccept argument above, as a means of error-checking. If the input file is a structure file containing phasing rows (PHASED=1), the correct NUMLOCI must be entered.

ONEROW

Numeric. Whether (1) or not (0) there is a single data row per INDLABEL. If the value is 1, the number of columns per marker = PLOIDY, which is suitable for ‘plink’ structure files. If ONEROW = 0, there is one column per marker and the number of rows per INDLABEL = PLOIDY. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 0.

PHASED

Numeric. Presence (1) or absence (0) of phasing data rows in a ‘structure’ input file. These rows will be removed if present. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 0.

PHENOTYPE

Numeric. Presence (1) or absence (0) of a PHENOTYPE column in the data file. For compatibility with ‘structure’. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 0.

PLOIDY

Numeric. Maximum ploidy among the markers in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 2.

POPFLAG

Numeric. 0 or 1, for compatibility with ‘structure’. Default 0.

POPID

Numeric. Presence (1) or absence (0) of a column of population identifiers in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is 1. The first two columns of ‘plink’ structure files can be treated as INDLABEL and POPID columns respectively.

RECESSIVEALLELE

Numeric. 0 or 1, for compatibility with ‘structure’. Default 0.

marker.info.file

Character string. The name of an optional file of additional marker information to be read in, with one row per marker and any number of columns. This can be joined to the primary output table (and to the locus table if return.locus.table is set to TRUE) when using the data.prep function. The file should contain parental allele frequencies when these are not to be calculated from the data. In that case, it must contain columns entitled refAllele, alternateAllele, S0.prop_r and S1.prop_r, as well as the obligatory locus column. The first two columns identify an arbitrarily chosen reference allele and the alternate allele, while S0.prop_r and S1.prop_r indicate the allele frequency of the chosen reference allele in each source (parental reference) population. Default is NULL.

sourceAbsent

Logical. Whether source (parental reference) samples are absent from the dataset. If TRUE, the parental allele frequencies of a reference allele for each locus must be present in the marker.info.file in columns named S0.prop_r and S1.prop_r. Default is FALSE.

Details

read.data is designed to be compatible with data input files for the software structure, including those produced by plink. However, ‘plink’ structure files use zero to denote missing data, which is not currently allowed in the data.table package, although this limitation is currently being actioned. For the time being, zeroes should be replaced prior to use of the read.data function, preferably with NA (no quotations needed).

The simplest file format to read in is a rectangular data table with a complete header row, one column per marker and one row per allele copy (equivalent to ONEROW = 0). With this format the only required fields other than those with default settings are file, nprecol, NUMINDS and MISSINGVAL. Furthermore, allele identities are read in as character strings and therefore each allele can be given any name or number, as long as there are only two unique alleles per locus. As with structure, all non-marker columns should be to the left of the marker columns in the data file.

The data input file is required to have at least one column prior to the marker data columns. Typically there are two columns, the first (required) is referred to as ‘INDLABEL’ (but can be given a different column header) and represents the finest resolution of identification that may be required for the data set (usually a unique individual reference, but could for example be a population-level reference for poolseq data). The second is referred to as ‘POPID’ and indicates the population from which the individual was sampled. The main purpose of the ‘POPID’ column is to declare, downstream, which POPID values represent the parental reference samples, and it can also be useful for plotting hybrid index estimates. If the ‘POPID’ column is absent or if it is desired to use another column to identify parental reference samples, an alternative column must be identified downstream in the data.prep function.

Any ploidy is allowed, but the declared PLOIDY must be the same for all markers. So for example if data are haplodiploid, haploid markers should either be present as diploid homozygotes, or the second allele declared as missing data.

If an associated ‘mainparams’ file is read in, it should contain two columns and no header row, with the first column holding the field names and the second the field values. The only mandatory fields are (and must have one of these synonyms): PLOIDY, MISSINGVAL/MISSING, ONEROW/ONEROWPERIND, INDLABEL, POPID, EXTRACOL/EXTRACOLS, MARKERNAME/MARKERNAMES. If RECESSIVEALLELE or MAPDISTANCE are present, they can be pluralized.

read.data uses the fread function from data.table to rapidly read in the data file. The loaded data file is therefore of class data.table and data.frame. A data.table can be treated the same as a standard data.frame for those not familiar with the package.

Value

read.data returns a list with one component containing the loaded data in ONEROW = 0 format, in the form of a data.table and data.frame, and other components used in downstream functions or otherwise potentially useful to the user.

The list contains the following components:

mainparams

A data.table and data.frame with the inputted mainparams information (or default values if no information provided).

nprecols

Numeric. The number of non-marker columns to the left of the marker columns in the imported data set.

precols

Character vector. The names of non-marker columns to the left of the marker columns in the imported data set.

data

A data.table and data.frame. The imported data.

alleles

A data.table and data.frame. The names of the two alleles at each locus.

loci

Character vector. The names of all loci in the imported data set.

marker.info

A data.table and data.frame. The imported marker.info.file.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

Examples


## Not run: 
#First create example data files and save to the working directory#

#1. Genomic (SNP) data as a regular table with full header row#
ex <- "INDLABEL,POPID,chr1:001,chr1:002\nind1,pop1,A,A\nind1,pop1,A,B\nind2,pop1,B,B\nind2,pop1,B,B\nind3,pop2,NA,A\nind3,pop2,B,A\n"

#Save 'ex' to the working directory as 'ex.data', to be read in by read.data. Open in e.g. notepad or notepad++ to view in table format#
cat("INDLABEL POPID chr1:001 chr1:002","ind1 pop1 A A","ind1 pop1 A B","ind2 pop1 B B","ind2 pop1 B B","ind3 pop2 NA A","ind3 pop2 B A", file = "ex.data", sep = "\n")

#2. Now create and save a file, ex2.data, in plink structure format (but with missing data recoded from 0 to NA), with a MAPDISTANCE row present below the marker names#
cat("chr1:001 chr1:002","-3 3 2 4","ind1 pop1 A A A B","ind2 pop1 B B B B","ind3 pop2 NA B A A", file = "ex2.data", sep = "\n")

#Load ex.data, in this case without specifying the number of loci (useful if this is not known in advance). These are the minimum arguments required for a file in this format#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,MISSINGVAL=NA)#NUMLOCI is not specified so will be calculated, with a warning#

#The same but specifying the number of loci. The resulting object is identical#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA)#No warning this time as NUMLOCI is specified#

#The minimum arguments required to read in the PLINK structure format file with missing data recoded as 'NA' (gives a warning because NUMLOCI is not specified)#
dat2 <- read.data(file="ex2.data",MISSINGVAL=NA,NUMINDS=3,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Same, but displaying all arguments including those where the default works for this file format#
dat2 <- read.data(
  file="ex2.data",
  mainparams = NULL,              #Default#
  extracol.names = NULL,          #Default#
  precol.headers = 0,
  nprecol=2,
  markername.dup = 0,             #Default#
  NUMLOCI.autoAccept = TRUE,      #Default#
  EXTRACOL = 0,                   #Default#
  INDLABEL = 1,                   #Default#
  LOCDATA = 0,                    #Default#
  MAPDISTANCE = 1,
  MARKERNAME = 1,                 #Default#
  MISSINGVAL = NA,                #Cannot be zero for the time being#
  NUMINDS = 3,
  NUMLOCI = 2,                    #Only mandatory when PHASED=1; otherwise will be calculated internally if not known#
  ONEROW = 1,
  PHASED = 0,                     #Default#
  PHENOTYPE = 0,                  #Default#
  PLOIDY = 2,                     #Default#
  POPFLAG = 0,                    #Default#
  POPID = 1,                      #Default#
  RECESSIVEALLELE = 0,            #Default#
  marker.info.file = NULL,        #Default#
  sourceAbsent = FALSE            #Default#
)

#If NUMLOCI is not specified, the number of loci will be calculated with a warning, which will not interfere with downstream processes. 
#However, if 'NUMLOCI.autoAccept = FALSE' is set, the user is required to manually accept the calculated number of loci. This option is 
#included in case the user wishes to verify that the calculated number of loci is accurate. Stops with an error if the estimated NUMLOCI 
#is manually rejected. Example:
dat2 <- read.data(file="ex2.data",NUMLOCI.autoAccept = FALSE,MISSINGVAL=NA,NUMINDS=3,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Uploading a marker info file.

#Example: Create a marker info file indicating whether the locus is intronic or exonic, and save to the working directory#
cat("locus type","chr1:001 intronic","chr1:002 exonic", file = "ex_marker_info.data", sep = "\n")

#Read it in alongside the SNP data#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info.data")

#Loading a data file that does not include parental reference samples. For the situation where parental (S0 and S1) allele frequencies are 
#to be loaded, but not parental reference samples. A marker.info.file can be loaded regardless of whether S0 and S1 samples are present in 
#the dataset, but it is obligatory if they are absent. When 'sourceAbsent = TRUE', as a minimum the following columns with the exact headers 
#in the first set of quotation marks below are required in the marker.info.file (more columns are allowed).
cat("locus refAllele alternateAllele S0.prop_r S1.prop_r type","chr1:001 A B 0.1 0.9 intronic","chr1:002 B A 0.8 0.2 exonic", file = "ex_marker_info2.data", sep = "\n")

#"...prop_r" means the allele frequency of the reference allele. Choice of reference and alternate alleles is arbitrary#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info2.data", sourceAbsent = TRUE)

#The data.prep function will then determine which allele has higher frequency in S1, as it would if parental reference samples were included#

unlink("ex_marker_info.data")#Tidy up#
unlink("ex_marker_info2.data")#Tidy up#
unlink("ex.data")#Tidy up#
unlink("ex2.data")#Tidy up#

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.