read.data: Load data stored in the format of a structure file or similar...
In ribailey/gghybrid: Evolutionary Analysis of Hybrids and Hybrid Zones

read.data

R Documentation

Load data stored in the format of a structure file or similar data table.

Description

Load data stored in the format of a structure file or similar data table.

Usage

read.data(
  file,
  mainparams = NULL,
  extracol.names = NULL,
  precol.headers = 1,
  nprecol,
  markername.dup = 0,
  NUMLOCI.autoAccept = TRUE,
  EXTRACOL = 0,
  INDLABEL = 1,
  LOCDATA = 0,
  MAPDISTANCE = 0,
  MARKERNAME = 1,
  MISSINGVAL,
  NUMINDS = 0,
  NUMLOCI = 0,
  ONEROW = 0,
  PHASED = 0,
  PHENOTYPE = 0,
  PLOIDY = 2,
  POPFLAG = 0,
  POPID = 1,
  RECESSIVEALLELE = 0,
  marker.info.file = NULL,
  sourceAbsent = FALSE
)

Arguments

`file`	Character string. The name of the data file to read in.
`mainparams`	Character string. The name of an associated ‘mainparams’ file to read in. Optional as `mainparams` info can be entered into the function call directly (see below).
`extracol.names`	Character string or vector. Names of extra (i.e. none of those specifically named here) non-marker columns in the data file. Optional; used when the data file does not contain pre-marker EXTRACOL headers and you wish to add them on input. Default is `NULL`.
`precol.headers`	Numeric. Presence (`1`) or absence (`0`) of headers for pre-marker columns in the header row of the data file. Default is `1`. Set to `0` for ‘plink’ structure files.
`nprecol`	Numeric. Number of pre-marker columns in the data file. No default and must always be entered. Set to `2` for standard ‘plink’ structure files.
`markername.dup`	Numeric. Whether each marker name appears twice in the data file header row. Default is `0` (FALSE, suitable for ‘plink’ structure files), alternative is `1` (TRUE).
`NUMLOCI.autoAccept`	Logical. If the number of loci (NUMLOCI) is not entered, whether to require manual acceptance/rejection of the internally calculated NUMLOCI. Default is `FALSE` with a warning, in order not to interrupt the data analysis pipeline. If set to `TRUE`, the calculated NUMLOCI can be accepted with `y` or rejected with `n`, the latter of which stops the function with a blank error message and returns no output.
`EXTRACOL`	Numeric. Number of extra (i.e. none of those specifically named here) non-marker columns in the data file. Not needed if this info is uploaded in a ‘mainparams’ file. For compatibility with ‘structure’ `mainparams` info. Default is `0` (suitable for ‘plink’ structure files).
`INDLABEL`	Numeric. Presence (`1`) or absence (`0`) of a column of individual references in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `1`, and this column must always be present. See ‘Details’. The first two columns of ‘plink’ structure files can be treated as INDLABEL and POPID columns.
`LOCDATA`	Numeric. Presence (`1`) or absence (`0`) of a LOCDATA column in the data file. For compatibility with ‘structure’. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `0`.
`MAPDISTANCE`	Numeric. Presence (`1`) or absence (`0`) of a `MAPDISTANCE` row in the data file, which will be removed if present. Not needed if the info is uploaded in a ‘mainparams’ file. Set to `1` for ‘plink’ structure files that include this row below the marker names. Default is `0`.
`MARKERNAME`	Numeric. Presence (`1`) or absence (`0`) of a header row containing marker names in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `1` and is suitable for ‘plink’ structure files.
`MISSINGVAL`	The identifier for missing values. No default. Not needed if the info is uploaded in a ‘mainparams’ file. Currently, `0` is not allowed and hence ‘plink’ structure files need to be modified before reading in, preferably replacing `0` with `NA` (no quotation marks needed).
`NUMINDS`	Numeric. Number of individuals in the data set. Required to check that the marker data are of the appropriate dimensions. Default set to `0`.
`NUMLOCI`	Numeric. Number of loci in the data set. Not needed if the info is uploaded in a ‘mainparams’ file. If the default value (`0`) is used, the user has the option to manually accept/reject the internally calculated number of loci (calculated using NUMLOCI , PLOIDY and the dimensions of the marker data) using the NUMLOCI.autoAccept argument above, as a means of error-checking. If the input file is a structure file containing phasing rows (`PHASED=1`), the correct NUMLOCI must be entered.
`ONEROW`	Numeric. Whether (`1`) or not (`0`) there is a single data row per INDLABEL. If the value is `1`, the number of columns per marker = PLOIDY, which is suitable for ‘plink’ structure files. If `ONEROW = 0`, there is one column per marker and the number of rows per INDLABEL = PLOIDY. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `0`.
`PHASED`	Numeric. Presence (`1`) or absence (`0`) of phasing data rows in a ‘structure’ input file. These rows will be removed if present. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `0`.
`PHENOTYPE`	Numeric. Presence (`1`) or absence (`0`) of a `PHENOTYPE` column in the data file. For compatibility with ‘structure’. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `0`.
`PLOIDY`	Numeric. Maximum ploidy among the markers in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `2`.
`POPFLAG`	Numeric. `0` or `1`, for compatibility with ‘structure’. Default `0`.
`POPID`	Numeric. Presence (`1`) or absence (`0`) of a column of population identifiers in the data file. Not needed if the info is uploaded in a ‘mainparams’ file. Default is `1`. The first two columns of ‘plink’ structure files can be treated as INDLABEL and POPID columns respectively.
`RECESSIVEALLELE`	Numeric. `0` or `1`, for compatibility with ‘structure’. Default `0`.
`marker.info.file`	Character string. The name of an optional file of additional marker information to be read in, with one row per marker and any number of columns. This can be joined to the primary output table (and to the locus table if `return.locus.table` is set to `TRUE`) when using the `data.prep` function. The file should contain parental allele frequencies when these are not to be calculated from the data. In that case, it must contain columns entitled `refAllele`, `alternateAllele`, `S0.prop_r` and `S1.prop_r`, as well as the obligatory `locus` column. The first two columns identify an arbitrarily chosen reference allele and the alternate allele, while `S0.prop_r` and `S1.prop_r` indicate the allele frequency of the chosen reference allele in each source (parental reference) population. Default is `NULL`.
`sourceAbsent`	Logical. Whether source (parental reference) samples are absent from the dataset. If `TRUE`, the parental allele frequencies of a reference allele for each locus must be present in the marker.info.file in columns named `S0.prop_r` and `S1.prop_r`. Default is `FALSE`.

Details

read.data is designed to be compatible with data input files for the software structure, including those produced by plink. However, ‘plink’ structure files use zero to denote missing data, which is not currently allowed in the data.table package, although this limitation is currently being actioned. For the time being, zeroes should be replaced prior to use of the read.data function, preferably with NA (no quotations needed).

The simplest file format to read in is a rectangular data table with a complete header row, one column per marker and one row per allele copy (equivalent to ONEROW = 0). With this format the only required fields other than those with default settings are file, nprecol, NUMINDS and MISSINGVAL. Furthermore, allele identities are read in as character strings and therefore each allele can be given any name or number, as long as there are only two unique alleles per locus. As with structure, all non-marker columns should be to the left of the marker columns in the data file.

The data input file is required to have at least one column prior to the marker data columns. Typically there are two columns, the first (required) is referred to as ‘INDLABEL’ (but can be given a different column header) and represents the finest resolution of identification that may be required for the data set (usually a unique individual reference, but could for example be a population-level reference for poolseq data). The second is referred to as ‘POPID’ and indicates the population from which the individual was sampled. The main purpose of the ‘POPID’ column is to declare, downstream, which POPID values represent the parental reference samples, and it can also be useful for plotting hybrid index estimates. If the ‘POPID’ column is absent or if it is desired to use another column to identify parental reference samples, an alternative column must be identified downstream in the data.prep function.

Any ploidy is allowed, but the declared PLOIDY must be the same for all markers. So for example if data are haplodiploid, haploid markers should either be present as diploid homozygotes, or the second allele declared as missing data.

If an associated ‘mainparams’ file is read in, it should contain two columns and no header row, with the first column holding the field names and the second the field values. The only mandatory fields are (and must have one of these synonyms): PLOIDY, MISSINGVAL/MISSING, ONEROW/ONEROWPERIND, INDLABEL, POPID, EXTRACOL/EXTRACOLS, MARKERNAME/MARKERNAMES. If RECESSIVEALLELE or MAPDISTANCE are present, they can be pluralized.

read.data uses the fread function from data.table to rapidly read in the data file. The loaded data file is therefore of class data.table and data.frame. A data.table can be treated the same as a standard data.frame for those not familiar with the package.

Value

read.data returns a list with one component containing the loaded data in ONEROW = 0 format, in the form of a data.table and data.frame, and other components used in downstream functions or otherwise potentially useful to the user.

The list contains the following components:

`mainparams`	A `data.table` and `data.frame` with the inputted `mainparams` information (or default values if no information provided).
`nprecols`	Numeric. The number of non-marker columns to the left of the marker columns in the imported data set.
`precols`	Character vector. The names of non-marker columns to the left of the marker columns in the imported data set.
`data`	A `data.table` and `data.frame`. The imported data.
`alleles`	A `data.table` and `data.frame`. The names of the two alleles at each locus.
`loci`	Character vector. The names of all loci in the imported data set.
`marker.info`	A `data.table` and `data.frame`. The imported marker.info.file.

Author(s)

Richard Ian Bailey richardianbailey@gmail.com

Examples


## Not run: 
#First create example data files and save to the working directory#

#1. Genomic (SNP) data as a regular table with full header row#
ex <- "INDLABEL,POPID,chr1:001,chr1:002\nind1,pop1,A,A\nind1,pop1,A,B\nind2,pop1,B,B\nind2,pop1,B,B\nind3,pop2,NA,A\nind3,pop2,B,A\n"

#Save 'ex' to the working directory as 'ex.data', to be read in by read.data. Open in e.g. notepad or notepad++ to view in table format#
cat("INDLABEL POPID chr1:001 chr1:002","ind1 pop1 A A","ind1 pop1 A B","ind2 pop1 B B","ind2 pop1 B B","ind3 pop2 NA A","ind3 pop2 B A", file = "ex.data", sep = "\n")

#2. Now create and save a file, ex2.data, in plink structure format (but with missing data recoded from 0 to NA), with a MAPDISTANCE row present below the marker names#
cat("chr1:001 chr1:002","-3 3 2 4","ind1 pop1 A A A B","ind2 pop1 B B B B","ind3 pop2 NA B A A", file = "ex2.data", sep = "\n")

#Load ex.data, in this case without specifying the number of loci (useful if this is not known in advance). These are the minimum arguments required for a file in this format#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,MISSINGVAL=NA)#NUMLOCI is not specified so will be calculated, with a warning#

#The same but specifying the number of loci. The resulting object is identical#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA)#No warning this time as NUMLOCI is specified#

#The minimum arguments required to read in the PLINK structure format file with missing data recoded as 'NA' (gives a warning because NUMLOCI is not specified)#
dat2 <- read.data(file="ex2.data",MISSINGVAL=NA,NUMINDS=3,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Same, but displaying all arguments including those where the default works for this file format#
dat2 <- read.data(
  file="ex2.data",
  mainparams = NULL,              #Default#
  extracol.names = NULL,          #Default#
  precol.headers = 0,
  nprecol=2,
  markername.dup = 0,             #Default#
  NUMLOCI.autoAccept = TRUE,      #Default#
  EXTRACOL = 0,                   #Default#
  INDLABEL = 1,                   #Default#
  LOCDATA = 0,                    #Default#
  MAPDISTANCE = 1,
  MARKERNAME = 1,                 #Default#
  MISSINGVAL = NA,                #Cannot be zero for the time being#
  NUMINDS = 3,
  NUMLOCI = 2,                    #Only mandatory when PHASED=1; otherwise will be calculated internally if not known#
  ONEROW = 1,
  PHASED = 0,                     #Default#
  PHENOTYPE = 0,                  #Default#
  PLOIDY = 2,                     #Default#
  POPFLAG = 0,                    #Default#
  POPID = 1,                      #Default#
  RECESSIVEALLELE = 0,            #Default#
  marker.info.file = NULL,        #Default#
  sourceAbsent = FALSE            #Default#
)

#If NUMLOCI is not specified, the number of loci will be calculated with a warning, which will not interfere with downstream processes. 
#However, if 'NUMLOCI.autoAccept = FALSE' is set, the user is required to manually accept the calculated number of loci. This option is 
#included in case the user wishes to verify that the calculated number of loci is accurate. Stops with an error if the estimated NUMLOCI 
#is manually rejected. Example:
dat2 <- read.data(file="ex2.data",NUMLOCI.autoAccept = FALSE,MISSINGVAL=NA,NUMINDS=3,nprecol=2,ONEROW=1,MAPDISTANCE=1,precol.headers=0)

#Uploading a marker info file.

#Example: Create a marker info file indicating whether the locus is intronic or exonic, and save to the working directory#
cat("locus type","chr1:001 intronic","chr1:002 exonic", file = "ex_marker_info.data", sep = "\n")

#Read it in alongside the SNP data#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info.data")

#Loading a data file that does not include parental reference samples. For the situation where parental (S0 and S1) allele frequencies are 
#to be loaded, but not parental reference samples. A marker.info.file can be loaded regardless of whether S0 and S1 samples are present in 
#the dataset, but it is obligatory if they are absent. When 'sourceAbsent = TRUE', as a minimum the following columns with the exact headers 
#in the first set of quotation marks below are required in the marker.info.file (more columns are allowed).
cat("locus refAllele alternateAllele S0.prop_r S1.prop_r type","chr1:001 A B 0.1 0.9 intronic","chr1:002 B A 0.8 0.2 exonic", file = "ex_marker_info2.data", sep = "\n")

#"...prop_r" means the allele frequency of the reference allele. Choice of reference and alternate alleles is arbitrary#
dat <- read.data(file="ex.data",nprecol=2,NUMINDS=3,NUMLOCI=2,MISSINGVAL=NA,marker.info.file = "ex_marker_info2.data", sourceAbsent = TRUE)

#The data.prep function will then determine which allele has higher frequency in S1, as it would if parental reference samples were included#

unlink("ex_marker_info.data")#Tidy up#
unlink("ex_marker_info2.data")#Tidy up#
unlink("ex.data")#Tidy up#
unlink("ex2.data")#Tidy up#

## End(Not run)

ribailey/gghybrid documentation built on Feb. 2, 2024, 12:53 a.m.