Convert to tidy data frames | R Documentation |
Convert the information in a vcfR object to a long-format data frame suitable for analysis or use with Hadley Wickham's packages, dplyr, tidyr, and ggplot2. These packages have been optimized for operation on large data frames, and, though they can bog down with very large data sets, they provide a good framework for handling and filtering large variant data sets. For some background on the benefits of such "tidy" data frames, see \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v059.i10")}.
For some filtering operations, such as those where one wants to filter genotypes
upon GT fields in combination with INFO fields, or more complex
operations in which one wants to filter
loci based upon the number of individuals having greater than a certain quality score,
it will be advantageous to put all the information into a long format data frame
and use dplyr
to perform the operations. Additionally, a long data format is
required for using ggplot2
. These functions convert vcfR objects to long format
data frames.
vcfR2tidy(
x,
info_only = FALSE,
single_frame = FALSE,
toss_INFO_column = TRUE,
...
)
extract_info_tidy(x, info_fields = NULL, info_types = TRUE, info_sep = ";")
extract_gt_tidy(
x,
format_fields = NULL,
format_types = TRUE,
dot_is_NA = TRUE,
alleles = TRUE,
allele.sep = "/",
gt_column_prepend = "gt_",
verbose = TRUE
)
vcf_field_names(x, tag = "INFO")
x |
an object of class vcfR |
info_only |
if TRUE return a list with only a |
single_frame |
return a single tidy data frame in list component
|
toss_INFO_column |
if TRUE (the default) the INFO column will be removed from output as its consituent parts will have been parsed into separate columns. |
... |
more options to pass to |
info_fields |
names of the fields to be extracted from the INFO column into a long format data frame. If this is left as NULL (the default) then the function returns a column for every INFO field listed in the metadata. |
info_types |
named vector of "i" or "n" if you want the fields extracted from the INFO column to be converted to integer or numeric types, respectively.
When set to NULL they will be characters.
The names have to be the exact names of the fields.
For example |
info_sep |
the delimiter used in the data portion of the INFO fields to separate different entries. By default it is ";", but earlier versions of the VCF standard apparently used ":" as a delimiter. |
format_fields |
names of the fields in the FORMAT column to be extracted from each individual in the vcfR object into a long format data frame. If left as NULL, the function will extract all the FORMAT columns that were documented in the meta section of the VCF file. |
format_types |
named vector of "i" or "n" if you want the fields extracted according to the FORMAT column to be converted to integer or numeric types, respectively.
When set to TRUE an attempt to determine their type will be made from the meta information.
When set to NULL they will be characters.
The names have to be the exact names of the format_fields.
Works equivalently to the |
dot_is_NA |
if TRUE then a single "." in a character field will be set to NA. If FALSE no conversion is done. Note that "." in a numeric or integer field (according to format_types) with Number == 1 is always going to be set to NA. |
alleles |
if TRUE (the default) then this will return a column, |
allele.sep |
character which delimits the alleles in a genotype (/ or |) to be passed to
|
gt_column_prepend |
string to prepend to the names of the FORMAT columns |
verbose |
logical to specify if verbose output should be produced in the output so that they do not conflict with any INFO columns in the output. Default is "gt_". Should be a valid R name. (i.e. don't start with a number, have a space in it, etc.) |
tag |
name of the lines in the metadata section of the VCF file to parse out. Default is "INFO". The only other one tested and supported, currently is, "FORMAT". |
The function vcfR2tidy is the main function in this series. It takes a vcfR
object and converts the information to a list of long-format data frames. The user can
specify whether only the INFO or both the INFO and the FORMAT columns should be extracted, and also
which INFO and FORMAT fields to extract. If no specific INFO or FORMAT fields are asked
for, then they will all be returned. If single_frame == FALSE
and
info_only == FALSE
(the default),
the function returns a list with three components: fix
, gt
, and meta
as follows:
fix
A data frame of the fixed information columns and the parsed INFO columns, and
an additional column, ChromKey
—an integer identifier
for each locus, ordered by their appearance in the original data frame—that serves
together with POS as a key back to rows in gt
.
gt
A data frame of the genotype-related fields. Column names are the names of the
FORMAT fields with gt_column_prepend
(by default, "gt_") prepended to them. Additionally
there are columns ChromKey
, and POS
that can be used to associate
each row in gt
with a row in fix
.
meta
The meta-data associated with the columns that were extracted from the INFO and FORMAT
columns in a tbl_df-ed data frame.
This is the default return object because it might be space-inefficient to
return a single tidy data frame if there are many individuals and the CHROM names are
long and/or there are many INFO fields. However, if
single_frame = TRUE
, then the results are returned as a list with component meta
as before, but rather than having fix
and gt
as before, both those data frames
have been joined into component dat
and a ChromKey column is not returned, because
the CHROM column is available.
If info_only == FALSE
, then just the fixed columns and the parsed INFO columns are
returned, and the FORMAT fields are not parsed at all. The return value is a list with
components fix
and meta
. No column ChromKey appears.
The following functions are called by vcfR2tidy but are documented below because they may be useful individually.
The function extract_info_tidy let's you pass in a vector of the INFO fields that
you want extracted to a long format data frame. If you don't tell it which fields to
extract it will extract all the INFO columns detailed in the VCF meta section.
The function returns a tbl_df data frame of the INFO fields along with with an additional
integer column Key
that associates
each row in the output data frame with each row (i.e. each CHROM-POS combination)
in the original vcfR object x
.
The function extract_gt_tidy let's you pass in a vector of the FORMAT fields that
you want extracted to a long format data frame. If you don't tell it which fields to
extract it will extract all the FORMAT columns detailed in the VCF meta section.
The function returns a tbl_df data frame of the FORMAT fields with an additional
integer column Key
that associates
each row in the output data frame with each row (i.e. each CHROM-POS combination),
in the original vcfR object x
, and an additional column Indiv
that gives
the name of the individual.
The function vcf_field_names is a helper function that
parses information from the metadata section of the
VCF file to return a data frame with the metadata information about either the INFO
or FORMAT tags. It
returns a tbl_df
-ed data frame with column names: "Tag", "ID", "Number","Type",
"Description", "Source", and "Version".
An object of class tidy::data_frame or a list where every element is of class tidy::data_frame.
To run all the examples, you can issue this:
example("vcfR2tidy")
Eric C. Anderson <eric.anderson@noaa.gov>
# load the data
data("vcfR_test")
vcf <- vcfR_test
# extract all the INFO and FORMAT fields into a list of tidy
# data frames: fix, gt, and meta. Here we don't coerce columns
# to integer or numeric types...
Z <- vcfR2tidy(vcf)
names(Z)
# here is the meta data in a table
Z$meta
# here is the fixed info
Z$fix
# here are the GT fields. Note that ChromKey and POS are keys
# back to Z$fix
Z$gt
# Note that if you wanted to tidy this data set even further
# you could break up the comma-delimited columns easily
# using tidyr::separate
# here we put the data into a single, joined data frame (list component
# dat in the returned list) and the meta data. Let's just pick out a
# few fields:
vcfR2tidy(vcf,
single_frame = TRUE,
info_fields = c("AC", "AN", "MQ"),
format_fields = c("GT", "PL"))
# note that the "gt_GT_alleles" column is always returned when any
# FORMAT fields are extracted.
# Here we extract a single frame with all fields but we automatically change
# types of the columns according to the entries in the metadata.
vcfR2tidy(vcf, single_frame = TRUE, info_types = TRUE, format_types = TRUE)
# for comparison, here note that all the INFO and FORMAT fields that were
# extracted are left as character ("chr" in the dplyr summary)
vcfR2tidy(vcf, single_frame = TRUE)
# Below are some examples with the vcfR2tidy "subfunctions"
# extract the AC, AN, and MQ fields from the INFO column into
# a data frame and convert the AN values integers and the MQ
# values into numerics.
extract_info_tidy(vcf, info_fields = c("AC", "AN", "MQ"), info_types = c(AN = "i", MQ = "n"))
# extract all fields from the INFO column but leave
# them as character vectors
extract_info_tidy(vcf)
# extract all fields from the INFO column and coerce
# types according to metadata info
extract_info_tidy(vcf, info_types = TRUE)
# get the INFO field metadata in a data frame
vcf_field_names(vcf, tag = "INFO")
# get the FORMAT field metadata in a data frame
vcf_field_names(vcf, tag = "FORMAT")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.