In andrewparkermorgan/argyle: Basic interface and QC for genotypes from Illumina Infinium arrays

knitr::opts_chunk$set(
  fig.align = "center",
  fig.width = 6, fig.height = 6
)

Introduction

High-throughput genotyping of tens of thousands of SNPs using microarrays is common practice in both laboratory and population genetics. Genotypes at a dense panel of biallelic markers with a low rate of missing data are a valuable resource for breeding, marker-assisted selection, genetic mapping and analyses of population structure. The Illumina Infinium system is one popular and cost-effective ($\sim 100$/sample) platform. Custom Illumina arrays are available for many organisms of research, agricultural or ecological interest including mouse [@Ccc2012], dog, chicken, cow, pig, horse, sheep, salmon [@Johnston2013] and cotton Hulse-Kemp et al. (2015).

The argyle package provides basic functionality for loading biallelic SNP datasets directly from (1) the output of Illumina BeadStudio; or (2) PLINK binary filesets. For genotypes from Illumina arrays, hybridization intensity data is stored in parallel with genotype calls for use in quality control and downstream analyses. Several functions for quality control at the level of both hybridization intensity (where available) and genotype calls are provided. Data can be exported from argyle to PLINK binary format; or to formats compatible with the R/qtl package for genetic mapping in experimental crosses [@Broman2003], or the DOQTL package [@Gatti2014] for mapping in multi-founder outbred populations.

The design of argyle is inspired by the PLINK software [@Purcell2007]. A PLINK fileset has three parts: a genotype matrix, a marker map and a "pedigree" (sample and family metadata) file. Likewise the central data structure in this package (the genotypes object) stores a genotype matrix in parallel with a marker map and sample metadata.

Note: argyle was designed with mouse genetics in mind, but should be applicable to any diploid organism with an X-Y sex chromosome system (with males XY and females XX.)

Caveats and other software

We created argyle to fill a specific niche: quality control and basic exploratory analysis of genotype data obtained from the Mouse Universal Genotyping Array series (MUGA, MegaMUGA and GigaMUGA). These are custom Illumina Infinium arrays processed by Neogen Inc (Lincoln, NE). The design and content of the MUGA arrays are described elsewhere (Morgan & Welsh (2015)).

This package explicitly favors simplicity and readability of code over raw efficiency. It is appropriate for the "medium-sized" data -- say, tens of thousands of markers and hundreds of individuals -- regularly encountered in experimental genetics. Users with larger datasets routinely collected in human genetics -- say, millions of markers and thousands of individuals -- which do not fit comfortably in memory as plain-vanilla R objects probably want to explore more sophisticated R packages (such as the GenABEL suite).

A zoo of R packages already exists for normalization of raw intensity data from Illumina platforms: see beadarray [@Dunning2007], and crlmm[@Ritchie2009], among others. Users interested in exploiting the copy-number signal from Illumina arrays should consult these packages and related references. Gross copy-number aberrations affecting hundreds or thousands of markers, such as (partial) aneuploidy, can be identified in argyle; see Intensity-based analyses for more.

Although argyle implements some basic frequency calculations for use in genotyping quality control, it makes no effort to duplicate or replace the functionality of existing R packages for statistical and population genetics. More serious calculations -- marker LD, Hardy-Weinbery equilibrium tests, association tests -- can be accessed through the pacakge's PLINK wrappers. (These calls don't require holding the genotypes themselves in memory in the R session, so they can be applied to quite large datasets.)

The `genotypes` object

The central datastructure in the argyle package is the genotypes object. It is simply a matrix of genotypes (markers in rows, samples in columns) with a marker map and sample metadata stored as attributes. Row names (marker names) and column names are strictly required in order to keep the various pieces of the object unambiguously in sync. Because a genotypes is a matrix (ie. is.matrix(x) == TRUE), any function which accepts a matrix will also accept a genotypes. Standard attributes of the genotypes object are as follows.

attr(,"map") -- the marker map, in PLINK format. Required columns are
- chr -- chromosome identifier (anything containing "X" assumed to be chrX; anything with a "Y", chrY); NA for missing
- marker -- globally-unique marker identifier without whitespace
- cM -- genetic position in centimorgans; 0 for missing
- pos -- physical position in basepairs; 0 for missing
- A1 -- allele 1, arbitrarily defined (but we will label it the REF allele)
- A2 -- allele 2, arbitrarily defined (but we will label it ALT)
attr(,"ped") -- sample information, in PLINK format. Required columns are
- fid -- the "family" or group (or batch, or strain, ...) label
- iid -- globally-unnique individual identifier without whitespace
- mom -- individual ID of this sample's mother; 0 for missing
- dad -- individual ID of this sample's mother; 0 for missing
- sex -- sex; 1 = male, 2 = female, 0 = missing
- pheno -- phenotype; 0 or -9 = missing, 1 = "control", 2 = "case"; or any floating-point value for a quantitative trait
attr(,"alleles") -- allele encoding (see Allele encoding schemes later in this vignette), one of "native", "01", "relative" or "parent".
attr(,"filter.sites") -- character vector of length equal to number of rows of genotype matrix, with non-blank values flagging suspicious markers
attr(,"filter.samples") -- character vector of length equal to number of columns of genotype matrix, with non-blank values flagging suspicious samples

When hybridization intensity data are included, the object has several additional attributes.

attr(,"intensity") -- a named list of length 2, whose elements $x and $y are matrices of $x$ and $y$ hybridization intensities. These matrices have the same shape, and the same row and column names, as the main genotypes matrix.
attr(,"normalized") -- logical scalar indicating of any normalization has been applied to the intensity matrices

If the thresholded quantile-normalization procedure (tQN; see section Particulars for Illumina arrays later in this section) has been performed, the result will be stored in two additional attributes. attr(,"baf") -- matrix of B-allele frequency (BAF) values attr(,"lrr") -- matrix of log2-intensity ratio (LRR) values

Several R generics are implemented for working genotypes objects. To demonstrate them, we will use an example dataset containing genotypes from $116$ mouse samples at $14319$ markers across three autosomes (chr17, chr18, chr19) plus the Y chromosome.

library(plyr)
library(argyle)
data(ex)

The summary() method returns a brief overview of the contents of a genotypes: count of samples and markers; how alleles are encoded; whether underlying hybridization-intensity is also included; and any quality filters which have been set.

summary(ex)

The print() method returns the same information as summary(), plus a count of markers by chromosome. It is included only to prevent accidentally flooding an interactive terminal with the contents of a big object.

print(ex) # or just type 'ex'

Use head() to peek at the first $k$ markers in a genotypes, and to see underlying marker and sample

head(ex, n = 6, nsamples = 6)

Several accessor methods provide convenient shortcuts to obtain marker map, sample metadata, etc. without need for the awkward attr(x, "attribute") syntax. These methods offer read-only access to object attributes.

Technical note: Accessor functions are read-only for two reasons. The first is to raise the barrier to accidental overwriting of data. The second is attr<- is an R primitive, so that setting attributes can be performed without making a new copy of an object. However, user-defined functions for setting object attributes would have to pass the object by value rather than by reference, making at least one new copy in the process. This performance hit becomes significant when handling objects containing hundreds of samples genotyped at tens of thousands of markers.

To pull out the marker map, use

## see the marker map
map <- markers(ex)
head(map)

The result is a dataframe whose rows are parallel to those in the genotypes matrix. The map must have rownames which match the rownames of the parent object. Columns 1 through 6 are the required columns for a PLINK marker file (*.bim); remaining columns can store any extra marker metadata of interest.

To pull out sample information, use

## see sample metadata
mice <- samples(ex)
head(mice)

The result is a dataframe whose rows are parallel to the columns of the genotype matrix, with matching names. The first 6 columns are the required columns for a PLINK "family" file (*.fam, *.tfam): group ID, individual ID, mother ID, father ID, sex (1 = male, 2 = female), and phenotype (0/-9 = missing, 1 = control, 2 = case; or any floating-point number for a quantitative trait). For downstream compatibility with PLINK, missing values are encoded as 0 rather than NA. Note that the rownames of the sample metadata must match the column iid (unique individual ID), and the individual IDs cannot contain whitespace characters.

Subsets of a genotypes object can be obtained in two ways. First, the generic [ indexing operator has been implemented for genotypes to allow for taking "slices" of the genotypes matrix and to have that slicing propagated to all the object's attributes. As usual, the index argument(s) to [ can be logical vectors (TRUE to include a row or column in the result), character vectors (row or column names to include), or integer vectors (row or column indices to include). For example, we can extract the first $1000$ markers in the example dataset and call summary() to check the result.

first1k <- ex[ 1:1000, ]
summary(first1k)

We could also subset by samples (columns); here we pick a random $10$ columns.

summary( ex[ ,sample.int(ncol(ex), 10) ] )

A common task is to extract markers or samples by groups defined in the marker map (eg. all markers in some genomic region) or sample metadata (eg. only female samples). The generic subset() has been implemented for genotypes to make the syntax for such operations simpler. Similar to the subset(x, ...) method for dataframes, the indexing expression in ... is evaluated in the scope of the marker map (the default, by = "markers") or sample metadata (by = "samples"). For example, the following pairs of code fragments produce the same result.

## get only markers on chrY
x <- subset(ex, chr == "chrY")
y <- ex[ (markers(ex)$chr == "chrY"), ]
identical(x,y)

## get only the female samples
x <- subset(ex, sex == 2, by = "samples")
y <- ex[ ,(samples(ex)$sex == 2) ]
identical(x,y)

Technical note: If using the subset() syntax, unexpected results can occur if a variable defined in the current environment shares a name with a column in the relevant dataframe. This is due to underlying use of R's "non-standard evaluation" in subset(). Just be careful.

Technical note: A side effect of overloading the [ operator for genotypes is that any operation which repeatedly slices a matrix (including functions in the apply family) will incur the extra cost of slicing the marker map, sample metadata, intensity matrices, etc. on every pass. If this is a problem, use the unexported argyle:::.copy.matrix.noattr() function to make a copy of the genotypes matrix which preseves row and column names but dumps all the other attributes.

The argyle package borrows its notion of a "filter" from the VCF format: blank (in our case "") unless there is evidence for low quality. Unlike VCF, which allows filters for sites (in our case markers) only, argyle defines filters for both sites and samples. A marker (site) or sample can be marked as suspicious by adding single-character codes to its entry in attr(,"filter.sites") or attr(,"filter.samples") respectively. Let us check how many filters are set in the example dataset.

## see quality filters (like VCF's 'FILTER' field)
summarize.filters(ex)

Set filters for the first $10$ sites and check them.

attr(ex, "filter.sites")[1:10] <- "N"
summarize.filters(ex)

To drop all filtered sites and/or samples, use apply.filters(). The default behavior is to apply filters to both sites and samples (apply.to = "both").

fl <- apply.filters(ex, "both")
summary(fl)

After application of filters, all filters in the resulting genotypes are (trivially) set to blanks.

Allele encoding

The argyle package supports both character and numeric representation of genotypes. Character genotypes (the "native" encoding) are expected to take the values ACGTHN, where H represents a heterozygous call and N a no-call (ie. missing data). Since all markers are expected to be biallelic SNPs, character genotypes can be converted to a computationally more convenient numeric encoding in one of two ways.

By reference alleles ("01"). Genotypes are encoded 0 if they match column A1 in the marker map, 1 if heterozygous (H), 2 if they match column A2, and missing (NA) otherwise. This encoding is most useful for analyses of experimental crosses.
By frequency ("relative"). Genotypes are encoded 0 if they match the majority homozygous call (the major allele) at this marker within this genotypes object, 1 if heterozygous, 2 if they match the other homozygous call (the minor allele), and NA if missing. This encoding is most useful for population-based analyses.

All genetic analyses in argyle require that genotypes be first converted to a numeric encoding. Since both allele encoding and reference alleles (A1 and A2) are both required fields in a genotypes object, inter-conversion between the "01" and "native" encodings entails no loss of information. Conversion from "relative" to "native" is not supported, mostly to avoid the ambiguity that results when minor alleles are defined against a genotype matrix which is then subsetted (possibly destroying the definition of the "minor allele.")

To convert between encodings, use recode():

## convert from character ('native') to numeric ('01')
ex.recode <- recode(ex, "01")
summary(ex.recode)

And now convert again, back to the original encoding, and prove that no information was lost.

## convert back
ex.rerecode <- recode(ex.recode, "native")
identical(ex, ex.rerecode)

Particulars for Illumina arrays

The technical aspects of the Illumina Infinium genotyping array platform are described at length in [@Steemers2006]. Briefly, oligonucleotide probes are designed to target $k$ invariant base pairs adjacent to a biallelic SNP. These oligos are conjugated to silica beads which are addressed into a microarray. Sample DNA is hyridized to the array, and a single-base extension reaction is performed at the target SNP using fluorescently-labelled nucleotides. Two-channel fluorescence intensity signals are captured by a scanner.

Raw fluorescence signals are post-processed and normalized prior to genotype-calling. Many normalization algorithms have been proposed that take advantage of specific properties of the Illumina platform to estimate and remove technical artefacts; see [@Du2008] for discussion. Illumina's own BeadStudio software use a procedure described in [@Peiffer2006] that attempts to place samples into three clusters: two (presumably) homozygous clusters, one near the $x$- and another near the $y$-axes, and a (presumably) heterozygous cluster along an arc between them.

Hybridization-intensity signals have carry information about both genotype and copy number. Discrete genotypes for a group of samples can be inferred by first clustering samples in the $x,y$ plane at each marker; then identifying the three canonical clusters (corresponding to genotypes AA, AB and BB at the target SNP); and finally assigning a most probably cluster membership to each sample. The total hybridization intensity along both $x$ and $y$ dimensions -- and, at heterozygous markers, the relative magnitude of $x$ and $y$ signals -- is informative for copy number. "Total intensity" is defined by argyle not as $x+y$ but as $d = \sqrt{x^2+y^2}$, the total distance from the origin. This is ancedotally superior to the simple sum if intensities are saturated along either dimension.

Note The few intensity-based analyses in argyle have BeadStudio output in mind. They are appropriate for identifying failed or contaminated samples, but not much more. The package was designed to complement a custom array for mouse (the Mouse Universal Genotyping Array (MUGA) series) supplied by Neogen Inc. That vendor supplies BeadStudio output to its customers. Users with deeper interest in normalization and copy-number estimation should consult the Bioconductor "Microarray analysis" task view for more information.

For well-performing markers, genotype is obvious in the $x,y$ coordinate system, as demonstrated by the plot below (which we will call a "cluster plot"). The function plot.clusters() takes a genotypes and a list of markers (or any valid row-indexing vector for the genotypes matrix) and generates a plot with one panel per marker.

plot.clusters(ex, "JAX00429559")

We can check that the example dataset has intensity matrices attached as follows.

has.intensity(ex)

To extract the full hybridization intensity matrices from a genotypes, use intensity(). The result is a named list of length 2 with elements $x and $y. Here we just check the dimensions of the result, to see that they match each other and those of the parent genotypes matrix.

## get intensity matrices (x and y)
intens <- intensity(ex)
lapply(intens, dim)

Use of intensity data for genotyping quality control is discussed in Quality control below.

Data import

Genotypes can be bundled into to a genotypes object three ways: manually from an R matrix using the genotypes() constructor; from Illumina BeadStudio output; or from a PLINK binary fileset.

First we demonstrate making a genotypes from scratch. The genotypes() constructor requires a genotypes matrix with appropriate row (marker) and column (sample) names; a properly-formatted marker map; and an explicitly-specified allele encoding. (This is important.) If the dataframe of sample metadata is absent, a mostly-uninformative one will be automatically generated from the column names of the genotypes matrix. In this example we will explicitly provide sample metadata. Recall that sex is encoded as the number of X chromosomes present: 1=male, 2=female.

sm <- c("id1","id2","id3")
mk <- c("snp1","snp2","snp3","snp4")
map <- data.frame(chr = "chr1", marker = mk,
                  cM = c(0.5, 1.0, 1.5, 2.0),
                  pos = c(0.5, 1.0, 1.5, 2.0)*1e6,
                  A1 = c("A","G","A","C"),
                  A2 = c("C","T","G","G"))
rownames(map) <- mk

fam <- data.frame(iid = sm, fid = "testers", sex = c(2,2,1))
rownames(fam) <- sm

G <- matrix( c("A","C","A",
               "T","G","T",
               "A","G","A",
               "C","G","C"),
             byrow = TRUE, nrow = 4, ncol = 3,
             dimnames = list(mk, sm) )

geno <- genotypes(G, map = map, ped = fam, alleles = "native")

To confirm that the object was created as expected, check the summary:

summary(geno)

And check the contents:

head(geno)

From Illumina BeadStudio

Import of genotypes and hybridization intensities from BeadStudio reports is achieved with the read.beadstudio() function. Although the format of the human-readable output from BeadStudio can be customized by the user, argyle assumes that it has the format supplied to customers of Neogen Inc. That format involves two files:

Sample_Map.zip -- a sample manifest, with column headers (at least) Name and Gender.
{prefix_}FinalReport.zip -- the genotyping results. This file has a 9-line header section followed by a line of column headings (expected to have the order below), and then the data itself. The first 6 columns should be:
- SNP Name -- marker identifier, globally-unique
- Sample ID -- sample identifier matching contents of column Name in Sample_Map
- X -- transformed $x$-intensity
- Y -- transformed $y$-intensity
- Allele1 - Forward -- genotype call for allele 1 (one of ACTG-)
- Allele2 - Forward -- genotype call for allele 2 (one of ACTG-)

The - character is assumed to represent a missing genotype, and rows with - in either Allele1 or Allele2 will be marked as missing.

The data.table package, which provides memory-efficient re-implementation of the base-R dataframe, is used to read from these files. It can comfortably handle several million rows at reasonable speed on a modern laptop.

Decompressing the files is not necesssary if command-line zip is available -- argyle will decompress on-the-fly. For platforms without command-line zip (eg. Windows), FinalReport.zip must be unzipped but Sample_Map should not be. (This is due to a quirk of data.table.)

In addition to the files from BeadStudio, a dataframe containing a valid marker map (as discussed in the Introduction) is required to perform the import. Pre-computed ones for the MUGA series of arrays are available from [URL]. Users of other arrays will have to prepare their own.

The running time of read.beadstudio() for a realistic dataset (say $80,000$ markers and $96$ samples) is about $\sim 1$ minute on my 2014 MacBook Air. That's inconveniently long for this vignette but quite reasonable in practice. The code below demonstrates the command, but is not actually run. The parmater prefix corresponds to the * in the filename of *FinalReport.zip.

data(snps) # the marker map
geno <- read.beadstudio(prefix = "", snps, in.path = "path/to/containing/folder")

From `PLINK` a fileset

Reading a PLINK binary fileset with argyle is simple with read.plink(). As an example, we can load genotypes of 28 wild mice from the Mouse Diversity Array [@Yang2011], provided in this package's "data/" directory. Genotypes from PLINK filesets are always read directly to the "01" encoding.

infile <- system.file("extdata", "wild.chr19.bed", package = "argyle")
wild <- read.plink(infile)

print(wild)

Check the contents to see that import worked:

head(wild, n = 6, nsamples = 6)

From your own database

Users who routinely genotype hundreds or thousands of samples will probably want to store genotypes an a relational database. No (direct) database interface has been implemented in argyle yet. However, once a matrix of genotypes is available, a genotypes can be constructed using the "from scracth" method outlined at the top of this section.

Quality control

Careful quality control and sanity checking of microarray genotypes is an essential prerequisite to further analysis. QC can (and should) be performed both marker-wise and sample-wise. When hybridization-intensity data is available, QC should be performed on both intensities -- they are the primary data -- and the genotype calls. The argyle package provides a suite of functions implementing best practices developed in our group for QC of genotypes from the MUGA arrays.

For sample-wise QC, we recommend checking (at least) the following three metrics.

Call rate. Arrays which fail due to poor input DNA quality, mixed or contaminated samples, or technical problems at the hybridization step have an excess of heterozygous (H) and missing (N) calls relative to expectations. Of course those expectations are key: highly inbred samples should have relatively few H calls, while samples from outbred or natural populations should have many Hs.
Intensity distribution. Failed arrays have lower mean intensity and greater variance in intensity across probes than successful arrays. Comparing the distribution of total intensity ($d = \sqrt{x^2 + y^2}$; see Introduction) by sample within a batch, and between batches when multiple batches are available, can reveal failed samples.
Sex-chromosome concordance. Nominally male samples with an excess of heterozygous calls on the X-chromosome, or female samples with nonmissing calls on the Y-chromosome, represent potential failures or sample swaps.

We will demonstrate these procedures on the example dataset in ex, which includes intensity data. To run sample-wise QC checks, use run.sample.qc(), which returns a copy of the input with a new attribute attr(,"qc") containing the results.

data(ex)
ex <- run.sample.qc(ex)

Use qcplot() to generate the default QC plots. (If run.sample.qc() had not been run already, it would be automatically called here.)

qcplot(ex)

The upper panel shows the count of A, B, H and N calls by sample. The lower plot shows intensity percentiles (contours) across all samples. Samples are sorted by their median intensity in both panels. In this example dataset, there are no outlying intensity profiles, but the samples in the left half of the plot have many more H calls than those in the right half despite having similar number of no-calls. This is expected: about half of the samples in the example dataset are inbred mouse strains, and the other half are F1s between them.

We can also check for concordance between the nominal sex of the samples and their sex as predicted by genotype. The current version of argyle uses only a crude threshold for number of non-missing Y-chromosome calls to predict sex. The defaults are calibrated for the GigaMUGA array (from which the example dataset is taken). More sophisticated sex-chromosome predictions based on hybridization intensity will be implemented in the future.

sexing <- predict.sex(ex)
xtabs(~ predicted + nominal, data = sexing)

There is perfect concordance between the nominal and inferred sex of these samples.

By default, run.sample.qc() does not apply any filters. In practice it will be useful to flag samples with excess N and H calls using the arguments max.N and max.H respectively. While the number of no-calls in our dataset should be approximately constant across samples, the number of heterozygous calls will be very different for inbred samples and outbred samples. The argyle package supports either constant thresholds or group-specific thresholds which are applied based on a sample's fid (recall that this is the "family ID" in attr(,"ped")). The snippet below demonstrates the alternative syntax.

## apply same 'N' filter to all samples (max.N is a constant)
## but different 'H' filter to F1 and inbred samples (max.H is a named list)
ex <- run.sample.qc(ex, max.N = 1200, max.H = list(inbred = 80, F1 = 10e3))
qcplot(ex)

Single-character flags are used to describe the thresholds failed by a sample or a marker. The default flags are as follows.

| code | description | samples | markers | |------|-------------|:-------:|:-------:| | N | excess no-calls | x | x | | H | excess heterozygosity | x | x | | I | aberrant intensity pattern | x | | | F | aberrant allele frequency | | x |

After removing failed samples we can identify low-performing markers. A low-performing marker is one with (1) high rate of missingness; (2) low or zero minor-allele frequency in the population of interest; or (3) higher-than-expected heterozygosity given its minor-allele frequencies. (The Illumina genotype-caller is prone to misclassifying no-calls as heterozygous calls.) The freqplot() function generates a graphical summary of call frequencies at markers falling below defined thresholds. A text summary is written to the console.

## plot low-performing markers
freqplot(subset(ex, chr != "chrY"), max.N = 0.2, max.H = 0.5, min.maf = 0.1)

Note that we have excluded Y-linked markers from this analysis: the Y chromosome requires bespoke QC beyond the scop eof this vignette.

Note: Criteria for excluding samples or markers from an analysis depend completely on the samples and the experiment at hand.

Intensity-based analyses

Hybridization intensity to Illumina arrays carries information about both genotype and copy number. Whereas the $x,y$ intensities at a single probe are (usually) sufficient to determine genotype at the corresponding target SNP, noise in the hybridization signal means that intensities from adjacent probes must be aggregated to assess copy number. Inter-probe variability in $x$, $y$ and their relative magnitude is accomodated by two transformations introduced by [@Peiffer2006]: the B-allele (pseudo-)frequency (BAF) and log2-intensity ratio (LRR). Both quantities are calculated with respect to intensity values derived from a large set of reference samples.

The B-allele frequency is not, in fact, anything to do with a true allele frequency, but rather a measure of the distance of a sample from the BB homozygous reference cluster, scaled and truncated to fall in $[0,1]$. At homozygous markers BAF takes values near $0$ or $1$; at heterozygous markers it takes values near $0.5$. The LRR is $\log2{(R/R_{ref})}$ where $R = x + y$ and $R_{ref} = x_{ref} + y_{ref}$, and in turn $x_{ref},y_{ref}$ are the centroid of the reference cluster to which the same is nearest. It is expected to have mean zero across the array, in the absence of aneuploidy or large structural variants.

Technical artefacts at the hybridization or scanning steps can introduce bias in BAF and LRR. The function argyle::tQN() (re-)implements a normalization procedure described in [@Staaf2008], modelled on code from the CLASP package [@Didion2014]. Briefly, quantile normalization (see [@Bolstad2003]) is applied to the array-wide $x$ and $y$ intensities per sample to account for systematic differences in signal intensity from the two fluorophores used in the Infinium chemistry. Relative differences between the pre- and post-normalization intensities are subject to a pre-specified threshold (by default, $1.5$). Then BAF and LRR are computed using a matrix of (AA, AB, BB) reference cluster centroids.

The tQN procedure is somewhat time-consuming; for the purposes of this vignette we show the relevant code but do not actually execute it.

## not run
load("clusters.Rdata")
ex.norm <- tQN(ex, clusters = clusters)

Use the function bafplot() to see the result for a single sample, with smoothed fits superposed in red.

## not run
bafplot(ex.norm, "AD_15423_F")

Analysis of experimental crosses

To demonstrate the use of argyle for designing and analysing a laboratory experiment, we consider a hypothetical F2 cross between the inbred mouse strains A/J (coded 'A' in the example dataset) and NOD/ShiLtJ (coded 'D'). To analyze genotypes from the F2 offspring, we need to know the following:

which markers are homozygous in the parents
which markers are predicted to be segregating between the parents
which markers work as predicted (homozygous in the parents, predicted to segregate, heterozygous in the F1)

First, see that r sum(grepl("^AA", samples(ex)$iid)) A/J samples (AA) and sum(grepl("^DD", samples(ex)$iid)) NOD/ShiLtJ samples (DD), and and r sum(grepl("^[AD][AD]", samples(ex)$iid)) corresponding F1 hybrids are present in the sample dataset.

xtabs(~ fid, subset(samples(ex), grepl("^[AD][AD]", iid)))

Before going any further, genotypes must be converted to numeric encoding. For convenience, we also set the fid field to a sample's genotype.

ex <- recode(ex, "01")
attr(ex, "ped")$fid <- substr(samples(ex)$iid, 1, 2)

The genoapply() function is a generalization of R's apply family for genotypes objects. Just as apply(x, 1, ...) applies a function over the rows of a matrix, genoapply(x, margin = 1, grp, fn, ...) applies fn() by markers (rows in the genotypes matrix.) Unlike apply(), the genoapply() function takes an additional grouping expression grp to apply a function by marker groups (with margin = 1) or sample groups (with margin = 2) defined by the value of grp. We demonstrate genoapply() to compute the consensus genotype of the parental strains in our example cross. For these analyses we will, of course, ignore the Y-chromosome.

ex <- subset(ex, chr != "Y")
parents <- subset(ex, fid == "AA" | fid == "DD", by = "samples")
cons <- genoapply(parents, 2, .(fid), consensus)

Note that, since the grouping expression was wrapped in .(), it was evaluated in the context of the sample data. This behavior is intended to make the syntax less clunky. The consensus() function takes a genotypes as input and returns the consensus genotype across all samples at each marker. (See ?consensus for details.)

The object cons is a list of numeric vectors of consensus genotypes for the parent strains. Combine them into a new genotypes as follows:

cons <- genotypes( do.call(cbind, cons),
                    map = markers(parents),
                    alleles = "01" )

Identify markers with fixed differences between the parents using fixed.diffs(), and count them.

is.diff <- fixed.diffs(cons[ ,c("AA","DD") ])
sum(is.diff)

There are up to r sum(is.diff) markers informative in an F2 cross between these strains, of r nrow(ex) total markers on the array.

Now predict the genotype of the F1s and check how many of those predictions match the observed genotype in the (A/JxNOD/ShiLtJ)F1 sample in dataset. Note that this prediction includes markers with the same homozygous genotype in both parents (which should be homozygous for that same genotype in the offspring), and with opposite homozygous genotypes between the parents (which should be heterozygous in the offspring.) Markers heterozygous in the parents will be marked as missing in the offspring. Really we only care about the informative markers:

f1s <- predict.f1(cons)
table(f1s[is.diff] == ex[ is.diff,"AD_15423_F" ], useNA = "always")

All but a few of the r sum(is.diff) informative markers perform as expected in the heterozygous state.

The default genotype encoding (either arbitrary or with respect to the major allele in some population) is not very useful for genetic analysis of an experimental cross. The argyle package provides recode.to.parent() to encode genotypes with respect to a "parent" sample (in general, an inbred strain.) This is the encoding scheme used by R/qtl.

by.mom <- recode.to.parent(ex[ ,grepl("AD", colnames(x)) ], cons[,"AA"])
summary(by.mom)

Once genotypes are in the "parent" encoding, we could convert them to an R/qtl::cross object to use that package for genetic mapping.

fake.cross <- argyle:::as.rqtl(by.mom)

Population-based analyses (and `PLINK`)

To demonstrate the use of argyle on population-genetic data, we use genotypes (on chr19) from 28 wild mice genotyped on the Mouse Diversity Array [@Yang2011]. These are provided as a PLINK fileset (wild.chr19.*) in this package's extdata/ directory. The fid column of these samples indicates their mouse subspecies of origin.

First load the genotypes from the PLINK fileset. Genotypes from PLINK are always loaded in the "01" encoding.

ff <- system.file("extdata", "wild.chr19.bed", package = "argyle")
wild <- read.plink(ff)

summary(wild)

Check for gross population structure using PCA on the genotypes matrix using pca(). The return object is a dataframe with samples projected onto the top $K$ PCs, and has class pca.result.

pc <- pca(wild, K = 5)

head(pc)

The plot() generic is implemented for pca.result objects.

plot(pc, screeplot = TRUE)

The panel at left shows the usual PCA plot, of samples projected onto principal components of the genotypes matrix. The panel at right is the "scree plot" of the relative magnitude of successive eigenvalues associated with the principal components.

With screeplot = FALSE we can suppress the "scree plot". In this case a ggplot object is returned, to which we can add layers to make a more informative plot.

library(ggplot2)

pc$fid <- factor(pc$fid, levels = c("dom","mus","cas"),
                 labels = c("M.m. domesticus","M.m. musculus","M.m. castaneus"))
fid.cols <- c("#377eb8","#e41a1c","#4daf4a")

plot(pc, screeplot = FALSE) +
    geom_point(aes(colour = fid)) +
    scale_colour_manual("subspecies", values = fid.cols)

The PCA reveals the expected differentiation between the three mouse subspecies: Mus musculus domesticus, M. m. musculus and M. m. castaneus.

We could also, for example, inspect the distribution of minor-allele frequencies within each subspecies by wrapping the function maf() in a call to genoapply().

wild <- recode(wild, "relative")
mafs.by.ss <- genoapply(wild, 2, .(fid), maf)

mafs <- do.call(cbind, mafs.by.ss)
mafs <- as.data.frame(mafs)
mafs$marker <- rownames(mafs)

mafs.m <- reshape2::melt(mafs, id.var = "marker")
mafs.m$variable <- factor(mafs.m$variable, levels = c("dom","mus","cas"),
                          labels = c("M.m. domesticus","M.m. musculus","M.m. castaneus"))

ggplot(mafs.m) +
    geom_freqpoly(aes(x = value, colour = variable), binwidth = 0.1) +
    scale_colour_manual("subspecies", values = fid.cols) +
    theme_classic(base_size = 9) +
    xlab("\nminor-allele frequency")

Wrappers for several PLINK commands are provided in argyle:

pca.plink(...) -- perform PCA
mds.plink(...) -- perform classical multidimensional scaling (MDS)
filter.plink(...) -- filter genotypes by missingness, minor-allele frequency, genomic position...
ld.plink(...) -- calculate pairwise LD via an EM algorithm
prune.plink(...) -- pruner markers by LD, to obtain a set in approximate linkage equilibrium
assoc.plink(...) -- perform association tests on binary or quantitative traits under a variety of models

These wrappers issue a system call to the plink executable (which must be in the users '$PATH' or Windows equivalent), then check that the expected output files are present. If so, they are read into the appropriate R object (usually dataframe or matrix); if not, argyle attempts to fail politely with a useful error message. Command-line output from PLINK is echoed to the R terminal.

Using the PLINK wrappers on an existing PLINK fileset does not require first loading the genotype matrix into the R session. Simply create a pointer to it using plinkfy():

#$ recall that 'ff' is the path to the PLINK fileset of wild-mouse genotypes
rm(wild)
wild <- plinkify(ff, where = tempdir())

summary(wild)

The resulting pointer has class plink. All the PLINK wrappers take such an object as their first argument. The where argument specifies the location to which output should be written; the default is the temporary directory corresponding to the current R session.

Safety warning: PLINK will, by default, place its output in the same directory as the input fileset. This creates risk of overwriting the input if the user is not careful. Unless there is good reason to do otherwise, leave where set to the default to avoid clobbering important data.

References

andrewparkermorgan/argyle documentation built on May 10, 2019, 11:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

andrewparkermorgan/argyle
Basic interface and QC for genotypes from Illumina Infinium arrays

In andrewparkermorgan/argyle: Basic interface and QC for genotypes from Illumina Infinium arrays

Introduction

Caveats and other software

The `genotypes` object

Allele encoding

Particulars for Illumina arrays

Data import

From Illumina BeadStudio

From `PLINK` a fileset

From your own database

Quality control

Intensity-based analyses

Analysis of experimental crosses

Population-based analyses (and `PLINK`)

References

R Package Documentation

Browse R Packages

We want your feedback!

andrewparkermorgan/argyle Basic interface and QC for genotypes from Illumina Infinium arrays

In andrewparkermorgan/argyle: Basic interface and QC for genotypes from Illumina Infinium arrays

Introduction

Caveats and other software

The genotypes object

Allele encoding

Particulars for Illumina arrays

Data import

From Illumina BeadStudio

From PLINK a fileset

From your own database

Quality control

Intensity-based analyses

Analysis of experimental crosses

Population-based analyses (and PLINK)

References

R Package Documentation

Browse R Packages

We want your feedback!

andrewparkermorgan/argyle
Basic interface and QC for genotypes from Illumina Infinium arrays

The `genotypes` object

From `PLINK` a fileset

Population-based analyses (and `PLINK`)