knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The fundamental tenet of tidy data that is that each observation should have its own row and each variable its own column, such that each value has its own cell. Applying this logic to population genetic data means that each individual should have its own row, with individual metadata (such as its population, sex, phenotype, etc) as the variables. Genotypes for each locus can also be thought of as variables, however, due to the large number of loci and the restricted values that each genotype can take, it would be very inefficient to store them as individual standard columns.
In tidypopgen
, we represent data as a gen_tbl
, a subclass of tibble
which has
two compulsory columns: id
of the individual (as a character
, which must
be unique for each individual), and genotypes
(stored in a compressed format
as a File-Backed Matrix. The genotypes
column stores a vector of the row indices of the matrix
for each individual and, when printed, shows the first two genotypes for the individual).
The real data reside on disk, and an attribute bigsnp
of
the genotype
column contains all the information to access it. There is also an
additional attribute , loci
which provides all the information about the loci,
including the column indices that represent each locus in the FBM.
The vector of row indices and the table of loci can be subsetted and reordered without
changing the data on disk; thus, any operation on the gen_tibble
is fast as it
shapes the indices of the genotype matrix rather than the matrix itself.
The loci
tibble includes columns big_index
for the index in the FBM,
name
for the locus name (a character
, which must be unique), chromosome
for the chromosome (an integer
, if known,
otherwise set to NA
), position
for the position on the chromosome (an integer
, if known,
otherwise set to NA
), genetic_dist
for the genetic distance on the chromosome
(numeric
, if known, else set to 0), allele_ref
for the the reference allele (a character
),
and allele_alt
for the alternate allele (a character
, which can be 0
for
monomorphic loci, following the same convention as plink). Additional individual metadata can be stored as columns in
the main gen_tbl
, whilst additional loci information (such as the position in
centimorgans) can be added as columns in the loci
attribute table.
In principle, it is possible to use use multiple ways to compress the genotypes.
tidypopgen
currently uses a bigSNP
object from the package bigsnpr
. It is very
fast and well documented, but it is mostly geared towards diploid data. tidypopgen
expands that object to deal with different levels of ploidy, including multiple ploidy
within a single dataset; however, most functions are currently incompatible with
ploidy levels other than 2 (but they will return a clear error message and avoid
computing anything incorrectly).
gen_tibble
Given information about the individuals, their genotypes, and the loci:
library(tidypopgen) example_indiv_meta <- data.frame( id = c("a", "b", "c", "d", "e"), population = c("pop1", "pop1", "pop2", "pop2", "pop2") ) example_genotypes <- rbind( c(1, 1, 0, 1, 1, 0), c(2, 0, 0, 0, NA, 0), c(1, 2, 0, 0, 1, 1), c(0, 2, 0, 1, 2, 1), c(1, 1, NA, 2, 1, 0) ) example_loci <- data.frame( name = c("rs1", "rs2", "rs3", "rs4", "x1", "x2"), chromosome = c(1, 1, 1, 1, 2, 2), position = c(3, 5, 65, 343, 23, 456), genetic_dist = c(0, 0, 0, 0, 0, 0), allele_ref = c("A", "T", "C", "G", "C", "T"), allele_alt = c("T", "C", NA, "C", "G", "A") )
We can create a simple gen_tibble
object (of class gen_tbl
) with:
example_gt <- gen_tibble(example_genotypes, indiv_meta = example_indiv_meta, loci = example_loci, backingfile = tempfile() )
We are provided information on where the three files underlying the genotype information
are stored. As we don't want to keep the files, we used the tmp directory; normally
you will want to use your working directory so that the files will not be cleared
by R at the end of the session. It is important that these
files are not deleted or moved, as gen_tibble
stores their paths for future use.
Now let's have a look at our gen_tibble
:
example_gt
As discussed above, when this tibble is called, the genotypes
column prints
the first two genotypes of the individual. The genotypes
column itself
contains the indices of the individuals in the FBM as values, and the FBM as an attribute.
To retrieve the rest of the genotypes (which are compressed in the FBM), we use:
example_gt %>% show_genotypes()
If we want to extract the information about the loci for which we have genotypes (which are stored as an attribute of that column), we say:
example_gt %>% show_loci()
Note that, if we are passing a gen_tibble
to a function that works on
genotypes, it is generally not necessary to pass the column genotypes
in the call:
example_gt %>% indiv_het_obs()
However, if such a function is used within a dplyr
verb such as mutate
,
we need to pass the genotype
column to the function:
example_gt %>% mutate(het_obs = indiv_het_obs(.data$genotypes))
Or, more simply:
example_gt %>% mutate(het_obs = indiv_het_obs(genotypes))
dplyr
verbs to manipulate the tibbleThe individual metadata can then be processed with the usual tidyverse
grammar.
So, we can filter individuals by population with
example_pop2 <- example_gt %>% filter(population == "pop2") example_pop2
There are a number of functions that compute population genetics quantities for each individual, such as individual observed heterozygosity. We can compute them simply with:
example_gt %>% indiv_het_obs()
Or after filtering:
example_gt %>% filter(population == "pop2") %>% indiv_het_obs()
We can use mutate
to add observed heterozygosity as a column to our
gen_tibble
(again, note that functions that work on genotypes don't need
to be passed any arguments if the tibble is passed directly to them, but
the column genotypes
has to be provided when they are used within
dplyr
verbs such as mutate
):
example_gt %>% mutate(het_obs = indiv_het_obs(genotypes))
There are a number of functions that estimate quantities at the individual
level, and they are prefixed by indiv_
.
Since the genotypes of the loci are stored as a compressed list in one column,
it is not possible to use standard dplyr
verbs on them. However, tidypopgen
provides
a number of specialised verbs, postfixed by _loci
, to manipulate loci.
A key operation on loci is their selection (and removal). The compressed nature
of genotypes imposes some constraints on the possible grammar.
For selection, there are two verbs: select_loci
and select_loci_if
.
select_loci
understands the concise-minilanguage spoken by standard dplyr::select
that allows to easily refer to variables by their names. However, select_loci
criteria can not be based on the actual genotypes (e.g. on heterozygosity or
missingness). For that, we have to use select_loci_if
, which can operate on the
genotypes but is blind to the names of loci.
Let us start by looking at the loci names in our simple dataset:
loci_names(example_gt)
We can see that there are two categories of loci, one starting with "rs" and the other with "x". If we wanted to select only loci that have an "rs" code, we would use:
example_sub <- example_gt %>% select_loci(starts_with("rs")) example_sub
This gives us a gen_tibble
with only 4 loci, as expected. We can confirm that
we have the correct loci with:
loci_names(example_sub)
Let us check that this has indeed impacted the individual heterozygosity
example_sub %>% indiv_het_obs()
We can also subset and reorder by passing indices:
example_gt %>% select_loci(c(2, 6, 1)) %>% show_loci()
This operation could be helpful when merging datasets that do not fully overlap on their loci (more on that later).
example_gt %>% select_loci(c(2, 6, 1)) %>% show_genotypes()
The limit of select_loci
is that it can not directly summarise the genotypes.
We can do that separately and then feed the result as a set of indices. For
example, we might want to impose a minimum minor allele frequency.
loci_maf()
allows us to inspect the minimum allele frequencies in a gen_tibble
:
example_gt %>% loci_maf()
We can now create a vector of indices of loci with a minimum allele frequency (MAF) larger than 0.2, and use it to select:
sel_indices <- which((example_gt %>% loci_maf()) > 0.2) example_gt %>% select_loci(all_of(sel_indices)) %>% show_loci()
Note that passing a variable directly to select
is deprecated, and so we have
to use all_of
to wrap it.
select_loci_if
allows us to avoid creating a temporary variable to store indices:
example_gt_sub <- example_gt %>% select_loci_if(loci_maf(genotypes) > 0.2) example_gt_sub %>% show_genotypes()
Note that, as we need to tidy evaluate loci_maf
within the select_loci_if
verb,
we need to provide it with the column that we want to use (even though it has
to be genotypes
). Also note that, with select_loci_if
, we cannot reorder
the loci.
select_loci_if
is very flexible; for example, we could filter loci with a MAF
greater than 0.2 that are also on chromosome 2.
We can use a similar approach to select only alleles on a given chromosome:
example_gt %>% select_loci_if( loci_chromosomes(genotypes) == 2 & loci_maf(genotypes) > 0.2 ) %>% show_loci()
Incidentally, loci_maf()
is one of several functions that compute quantities
by locus; they can be identified as they start with loci_
.
In population genetics, we are generally interested in computing quantities that describe groups of individuals (i.e. populations). Grouping can be used in a number of ways.
As a starting point, we can group by population and get pop sizes:
example_gt %>% group_by(population) %>% tally()
For functions that return one result per individual (such as
indiv_het_obs
that we used before), we can use summarise
, which returns a
new tibble
with one line per population. For example, we can count the number of individuals
per population, as well as their mean heterozygosity with :
example_gt %>% group_by(population) %>% summarise(n = n(), mean_het = mean(indiv_het_obs(genotypes)))
However, note that this is somewhat inefficient, as computing the pop averages requires multiple access to the data. A more efficient approach is to:
example_gt %>% mutate(het_obs = indiv_het_obs(genotypes)) %>% group_by(population) %>% summarise(n = n(), mean_het = mean(het_obs))
In this way, we compute all individual heterozygosities in one go (optimising our file access time), and then generate the population summaries.
When working with loci, you can use reframe
to apply functions
across groups. To use reframe
correctly,
be sure to select the genotype
column within the loci verbs:
example_gt %>% group_by(population) %>% reframe(loci_hwe = loci_hwe(genotypes))
Some functions, such as loci_maf()
, also have a method for grouped tibbles that
allows an even easier syntax:
example_gt %>% group_by(population) %>% loci_maf()
These methods are coded in C++ to be very efficient, and they should be
preferred to the reframe()
option. By default, their outputs will return a
tidy tibble, with one row per locus per population (i.e. the same as we
would get with reframe()
). However, using the type
argument, it is also possible to return a list:
example_gt %>% group_by(population) %>% loci_maf(type = "list")
Or a matrix:
example_gt %>% group_by(population) %>% loci_maf(type = "matrix")
Certain metrics and analyses are naturally defined by a grouped tibble as they refer to populations, such as population-specific Fst. For example:
example_gt %>% group_by(population) %>% pop_fst()
These type of functions are prefixed with pop
.
The final group of verbs are prefixed by pairwise_
and nwise_
, and they are
designed to compute pairwise statistics between all pairs or combinations of n individuals or populations
(when targeting populations, we start with pairwise_pop_
and nwise_pop_
). For example, to get the Identity by State
for all pairs of individuals, we use:
example_gt %>% pairwise_ibs()
We can save a gen_tibble
with gt_save()
. This command will save a file with
extension .gt
. Together with the .rds
and .bk
files, the .gt
file include
all the information stored in the gen_tibble
. Note that, whilst the .rds
and .bk
file have to share name, the .gt
file can be named differently (but, by default, if
no specific name is given, gt_save
will use the same pattern as for the .rds
and .bk
file).
So, let us save our file:
gt_file_name <- gt_save(example_gt) gt_file_name
And if we ever need to retrieve the location of the .bk
and .rds
files for a
gen_tibble, we can use:
gt_get_file_names(example_gt)
In a later session, we could reload the data with:
new_example_gt <- gt_load(gt_file_name[1]) new_example_gt %>% show_genotypes()
We can see that our genotypes were recovered correctly.
As we saw at the beginning of this vignette, it is possible to create a gen_tibble
with data
in data.frames and tibbles. We can use that function to wrangle small datasets
in custom formats, but more commonly SNP data are stored as PLINK bed files or VCF files.
gen_tibble
can directly read both types of files (including gzipped vcf files), we just need to provide the
path to file as the first argument of gen_tibble
; for example, if we want to
read a PLINK bed file, we can simply use:
bed_path_pop_a <- system.file("extdata/pop_a.bed", package = "tidypopgen") pop_a_gt <- gen_tibble(bed_path_pop_a, backingfile = tempfile("pop_a_"))
For this vignette, we don't want to keep files, so we are using again a temporary
path for the backing files, but in normal instances, we can simply
omit the backingfile
parameter, and the .rds
and .bk
file will be saved with the same name and
path as the original .bed
file.
We can also export data into various formats with the family of functions gt_as_*()
.
Some functions, such as gt_as_hierfstat()
, gt_as_genind()
or gt_as_genlight()
return
an object of the appropriate type; other functions, such as gt_as_plink()
or gt_as_geno_lea()
write a file in the appropriate format, and return the name of that file on completion.
For example, to export to a PLINK .bed file, we simply use:
gt_as_plink(example_gt, file = tempfile("new_bed_"))
This will also write a .bim and .fam file and save them together with the .bed file.
Note that, from the main tibble, only id
, population
and sex
will be preserved in the .fam file.
It is also possible to write .ped and .raw files by specifying type="ped"
or
type="raw"
in gt_as_plink()
(see the help page for gt_as_plink()
for details).
Merging data from different sources is a common problem, especially in human population
genetics where there is a wealth of SNP chips available. In tidypopgen
, merging is
enacted with an rbind
operation between gen_tibbles
. If the datasets have the same
loci, then the merge is trivial. If not, then it is necessary to subset to the same
loci, and ensure that the data are coded with the same reference and alternate
alleles (or swap if needed). Additionally, if data come from SNP chips, there is
the added complication that the strand is not always consistent, so it might also
be necessary to flip strand (in that case, ambiguous SNPs have to be filtered).
The rbind
method for gen_tibbles
has a number of parameters that allow us to
control the behaviour of the merge.
Let us start by bringing in two sample datasets (note that we use tempfiles to store the data; in real applications, we will usually avoid defining a backingfile and let the function create backing files where the bed file is stored):
bed_path_pop_a <- system.file("extdata/pop_a.bed", package = "tidypopgen") bigsnp_path_a <- bigsnpr::snp_readBed(bed_path_pop_a, backingfile = tempfile("pop_a_")) pop_a_gt <- gen_tibble(bigsnp_path_a) bed_path_pop_b <- system.file("extdata/pop_b.bed", package = "tidypopgen") bigsnp_path_b <- bigsnpr::snp_readBed(bed_path_pop_b, backingfile = tempfile("pop_b_")) pop_b_gt <- gen_tibble(bigsnp_path_b)
And inspect them:
pop_a_gt
And the other one:
pop_b_gt
Here we are using very small datasets, but in real life, rbind
operations
are very demanding. Before performing such an operation, we can run
rbind_dry_run
:
report <- rbind_dry_run(pop_a_gt, pop_b_gt, flip_strand = TRUE)
Note that, by default, rbind
will NOT flip strand or remove ambiguous SNPs
(as they are only relevant when merging different SNP chips),
you need to set flip_strand
to TRUE.
The report object contains details about why each locus was either kept
or removed, but usually the report is sufficient to make decisions on whether
we want to go ahead. If we are happy with the likely outcome, we can
proceed with the rbind
. Note that the data will be saved to disk.
We can either provide a path and prefix, to which '.RDS' and '.bk' will be
appended for the bigSNP
file and its backing file; or let the function
save the files in the same path as the original backing file of the first object).
NOTE: In this vignette, we save to the temporary directory, but in real life you want to save in a directory where you will be able to retrieve the file at a later date!!!
# #create merge merged_gt <- rbind(pop_a_gt, pop_b_gt, flip_strand = TRUE, backingfile = file.path(tempdir(), "gt_merged") )
Let's check the resulting gen_tibble
:
merged_gt
We can look at the subsetted loci (note that we used the first population as reference to determine the strand and order of alleles):
merged_gt %>% show_loci()
Note that the big_index
values have changed compared to the original files,
as we generated a new FBM with the merged data.
By default, rbind
using the loci names to match the two datasets. The option
'use_position = TRUE' forces matching by chromosome and position; note that a common
source of problems when merging is the coding of chromosomes (e.g. a dataset has
'chromosome1' and the other 'chr1'). If using 'use_position = TRUE', you should
also make sure that the two datasets were aligned to the same reference genome assembly.
Many genetic analysis (e.g. PCA) do not allow missing data. In many software implementations,
missing genotypes are imputed on the fly, often in a simplistic manner. In tidypopgen
, we
encourage taking this step explicitly, before running any analysis.
Let us start with a dataset that has some missing genotypes:
bed_file <- system.file("extdata", "example-missing.bed", package = "bigsnpr") missing_gt <- gen_tibble(bed_file, backingfile = tempfile("missing_")) missing_gt
We can visualise the amount of missingness with:
missing_gt %>% loci_missingness() %>% hist()
If we attempt to run a PCA on this dataset, we get:
missing_pca <- missing_gt %>% gt_pca_autoSVD()
Note that using gt_pca_autoSVD with a small dataset will likely cause an error, see manual page for details.
It is possible to obviate to this problem by filtering loci with missing data,
but that might lose a lot of loci. The alternative is to impute the missing
the data. tidypopgen
provides a wrapper for fast imputation available in
bigsnpr
, a simple imputation (gt_impute_simple
) based on the frequency of the alleles
at each locus (by random sampling, or using the mean or mode).
These methods are fine to impute a few missing genotypes, but they should not be used for any
sophisticated imputation (e.g. of low coverage genomes).
We use the simple approach to fix our dataset:
missing_gt <- gt_impute_simple(missing_gt, method = "mode")
We can now check that our dataset has indeed been imputed:
gt_has_imputed(missing_gt)
However, note that a gen_tibble
stores both the raw data and the
imputed data. Even after imputation, the imputed data are not used by default:
gt_uses_imputed(missing_gt)
And indeed, if we summarise missingness, we still get:
missing_gt %>% loci_missingness() %>% hist()
We can manually force a gen_tibble
to use the imputed data:
gt_set_imputed(missing_gt, set = TRUE) missing_gt %>% loci_missingness() %>% hist()
However, this is generally not needed, we can keep our gen_tibble
set to use
the raw data:
gt_set_imputed(missing_gt, set = FALSE) missing_gt %>% loci_missingness() %>% hist()
And let functions that need imputation use it automatically:
missing_pca <- missing_gt %>% gt_pca_partialSVD() missing_pca
Note that, when the function is finished, the gen_tibble
is back to using
the raw genotypes:
gt_uses_imputed(missing_gt) missing_gt %>% loci_missingness() %>% hist()
More details about PCA and other analysis is found in the vignette on population genetic analysis.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.