knitr::opts_chunk$set(collapse = T, comment = "#>") library(rhapsodi)
rhapsodi is an R package whose overall goal is to exploit low-coverage, single-gamete DNA sequencing data as a means of phasing a diploid donor's haplotypes, then imputing the missing gamete genotypes, and finally discovering gamete-specific meiotic recombination events. Because rhapsodi outputs high-quality information for each of these tasks, rhapsodi is applicable to a wide range of downstream tasks including identifying and exploring recombination hotspots; and scanning across gametes to discover selfish alleles that do not follow classical Mendelian inheritance rules.
In general, the input data for rhapsodi is low-coverage single-cell gamete DNA sequencing data, with all gametes originating from a single diploid donor. This data has been recorded such that for every SNP, the genotype (or lack of genotype) is recorded for each sequenced gamete. This data must be donor- and chromosome-specific, such that for every rhapsodi run, while there are many gamete chromosomes under consideration, all gamete chromosomes included in the input data are from a single diploid donor and the same chromosome of genomic SNPs (e.g. only chr1). This data can either be recorded in (A) a tab delimited file, with header which rhapsodi will load, or (B) a user pre-loaded dataframe.
In both cases, the first column must contain the genomic positions of the SNPs for that chromosome. (Note: there is no chromosome column.) Data can be encoded in two ways with genotypes recorded as 0/1 or A/C/G/T nucleotides; further, NA represents no available genotype for a given gamete at a given SNP. If the data is stored such that genotypes are recorded as 0/1, then the rest of the columns are assumed to be the gamete data. However, if the data are recorded as in nucleotide format with A/C/G/T (or some combination of these nucleotides for any biallelic polymorphism), then the second and third columns must specify the reference and alternate alleles respectively; finally, the rest of the following columns are the gamete data.
There are 5 total steps in the rhapsodi pipeline with 3 main algorithmic steps and 2 bookend data steps (reading in and exporting data). The steps are as follows.
These steps can be performed individually by the user or all together with the rhappsodi_autorun
function. Later in this example, we'll walk-through both of these run options. Note that the 3 algorithmic steps must be performed in order as the output of one is used as input for the next step.
To explore the stages of the rhapsodi pipeline, we'll later use a simulated dataset that was generated to reflect Sperm-seq data from a single donor with the following profile: * 50 gametes * 5000 SNPs * a coverage of 0.1 * an average recombination rate of 1 * a sequencing error rate of 0.005
The data has been both 0/1/NA encoded and A/C/G/T/NA encoded. We'll walk through the steps of rhapsodi with both of these cases.
The first of the two data bookend steps is reading in the sparse gamete DNA sequencing data. The function read_data
is used to read in and prepare data for rhapsodi. The preparation steps involve separating the input dataframe into individual vector(s) for the genomic positions (and the reference and alternate alleles if applicable) and a dataframe of just the gamete genotypes. Further, all vectors and the dataframe are subset to just heterozygous SNPs (hetSNPs), or SNPs for which there is at least one reference observation and one alternate observation. If no hetSNPs remain after filtering, rhapsodi will exit with the following message
no hetSNPs so rhapsodi is exiting
Such an event as no hetSNPs remaining is unlikely unless a dataset is extremely low coverage (<=0.001 (x) in simulations) or there are very few gametes (<= 3 in simulations). However, it is very possible that the number of input SNPs will differ from the number of SNPs that are retained following filtering; yet, generally as the number of gametes or the coverage increases, the number of filtered SNPs decreases.
The main arguments for this function are
input_file
: the path to and name of the tab-delimited file, with header containing the sparse gamete genotypes. Pass NULL
if instead using a user pre-loaded dataframe.use_dt
: A boolean, default is FALSE. Pass TRUE if you want to input a user pre-loaded dataframe.input_dt
: If use_dt
is TRUE, pass the name of the user pre-loaded data frame. acgt
: A boolean, default is FALSE. When FALSE, data is assumed to be 0/1/NA encoded with genomic position and gamete genotypes columns. Pass TRUE if data is A/C/G/T/NA encoded with positions, ref, and alt columns before the gamete genotypes. The output from this function is a named list with 4 elements.
positions
: a vector of the genomic positions of the heterozygous SNPsdt
: a dataframe of the heterozygous SNPs only, 0/1/NA encoded. Number of rows is equal to number of hetSNPs. Number of columns is equal to number of gametes.ref
: a vector of the nucleotide(s) for each heterozygous SNP reference allelealt
: a vector of the nucleotide(s) for each heterozygous SNP alternate alleleFor example, if wanting to use a tab-delimited file (with a header), which was 0/1/NA encoded, with path and name "rhapsodi_input/donor1_chr2_01.txt"
the function would be called as
read_data_out <- rhapsodi::read_data("rhapsodi_input/donor1_chr2_01.txt")
However, if the file was pre-loaded into a dataframe, rhapsodi_dt
, the function would be called as
read_data_out <- rhapsodi::read_data(NULL, use_dt = TRUE, input_dt = rhapsodi_dt)
As another example, if wanting to use a tab-delimited file (with a header), which was A/C/G/T/NA encoded with positions, ref, and alt columns in addition to the gamete columns, with path and name "rhapsodi_input/donor1_chr2_acgt.txt"
, the function would be called as
read_data_out <- rhapsodi::read_data("rhapsodi_input/donor1_chr2_acgt.txt", acgt=TRUE)
And if the file was pre-loaded into a dataframe, rhapsodi_dt_acgt
, the function would be called as
read_data_out <- rhapsodi::read_data(NULL, use_dt = TRUE, input_dt = rhapsodi_dt_acgt, acgt = TRUE)
The first of the three main algorithmic steps is phasing the diploid donor haplotypes from which the gametes originated. The function phase_donor_haplotypes
is used to run phasing. First, the SNP positions are split into windows of length window_length
with overlap of window_length
// overlap_denom
. Then, within each of these windows, binary clustering of the gamete data across SNPs and majority voting are utilized to reconstruct hapltoypes within each window. Finally, adjacent windows are stitched together based on the amount of consensus in the genotypes within the overlapping regions to build the two haplotypes of the donor.
The main arguments for this function are
dt
: This should be the dt
within the named list output from read_data
.positions
: This should be the positions
within the named list output from read_data
.window_length
: Default is 3000. This is the size (in SNP indices) of the windows that will be used for clustering.overlap_denom
: Default is 2. This is the denominator used in calculating the amount of overlap between windows, which will be the number of SNP indices which are considered when looking for amount of consensus in genotypes and stitching together haplotypes. Specifically, the overlap is window_length
// overlap_denom
. threads
: Default is 2. The number of threads to utilize for multi-threading.mcstop
: Default is TRUE. If TRUE, rhapsodi exits if confident stitching of haplotypes isn't possible because the consensus between overlapping windows is <0.9 but >0.1. If FALSE, rhapsodi continues and considers to which threshold the consensus is closer.stringent_stitch
: Default is FALSE. If TRUE, the user can set a specified bifurcating threshold value for stitching, stitch_new_min
.stitch_new_min
: If stringent_stitch
is TRUE, this value becomes the bifurcating threshold for stitching such that if the consensus is greater than this value, the windows are merged as the same haplotype. If consensus, is less than this value, the windows are treated as originating from different haplotypes.The output of this function is a data frame with column names index
, pos
, h1
and h2
where index
is the SNP index, pos
is the genomic SNP positions, h1
is haplotype1, and h2
is haplotype2. The number of rows is equal to the number of hetSNPs.
The recommended usage of this function is to pass the output from read_data
and keep the defaults for the rest of the arguments.
complete_haplotypes <- rhapsodi::phase_donor_haplotypes(read_data_out$dt, read_data_out$positions)
However, if rhapsodi exits with a message similar to the following, then the recommended usage is to additionally pass the argument mcstop = FALSE
.
Haplotypes within overlapping windows are too discordant to merge with a mean concordance of [some number < 0.1 but > 0.9]. rhapsodi is exiting.
By changing mcstop
to FALSE, rhapsodi will continue stitching the overlapping haplotype windows together, considering whether the concordance between two overlapping windows is closer to 0.1 or 0.9.
complete_haplotypes <- rhapsodi::phase_donor_haplotypes(read_data_out$dt, read_data_out$positions, mcstop = FALSE)
Note that rhapsodi will still display a message similar to the one below, but it will not exit.
Haplotypes within overlapping windows are too discordant for confident merging with a mean concordance of [some number < 0.1 but > 0.9], but continuing and [merging as the same haplotype | stitching windows as opposite haplotypes].
Alternatively, the user can set a different bifurcating threshold value by setting stringent_stitch
to FALSE, and passing the new threshold value to stitch_new_min
, like in the following example. Because stitch_new_min = 0.6
, windows with a concordance greater than 0.6 will be merged as originating from the same haplotype, but windows with a concordance less than 0.6 will be separated, assumed to originate from different haplotypes.
complete_haplotypes <- rhapsodi::phase_donor_haplotypes(read_data_out$dt, read_data_out$positions, stringent_stitch = FALSE, stitch_new_min = 0.6)
Setting mcstop
to FALSE, theoretically should be equivalent to stringent_stitch = FALSE
and setting stitch_new_min
to 0.5.
The second of the three main algorithmic steps is imputing the missing gamete genotypes. The function impute_gamete_genoyptes
is used to drive imputation. The function first builds a Hidden Markov Model (HMM) with emission probabilities controlled by the expected sequencing error rate and the transition probabilities controlled by the expected average recombination rate (or the number of expected meiotic recombination events per chromosome per gamete divided by the number of hetSNPs). Then the HMM is applied to the sparse gamete data, tracing the most likely path along the phased donor haplotypes for each gamete. Internals NAs are filled if bordering the same haplotypes. By default, edge NAs are assigned haplotypes matching the first non-NA SNP on the chromosome's end. Alternatively, these NAs at the terminal edges of chromosomes can be left as NAs rather than being filled/assigned genotypes. Finally, by default, if original sequencing observations disagree with imputed genotypes, the imputation is overwritten by the original sequencing read in a process termed unsmoothing. All of these steps together recover the dense gamete genotype matrix.
The main arguments for this function are
original_gamete_data
: The dt
output from the function read_data
.complete_haplotypes
: The output from the function phase_donor_haplotypes
.positions
: The positions
output from the function read_data
.sequencing_error
: Default is 0.005. The expected sequencing error rate which controls emission probabilities. avg_recomb
: Default is 1. The expected average number of recombination events per chromosome per gamete, which controls transition probabilities. smooth_imputed_genotypes
: A boolean, default is FALSE. If TRUE, unsmoothing process is not applied. If FALSE, unsmoothing is applied such that original sequencing observations are used in place of imputed genotypes if the two disagree.fill_ends
: A boolean, default is TRUE, If TRUE, fills the NAs at the terminal edges of chromosomes with the first/last known or imputed SNP for the beginning/ending of chromosomes respectively; if FALSE, leaves these genotypes as NA in the final imputed genotypes. Default is TRUE. threads
: Default is 2. The number of threads to utilize for multi-threading.The output of this function is a named list with 4 elements
filled_gametes
: Dataframe with imputed donor genotypes/recovered dense matrix where genotypes are encoded 0/1/NA. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes.filled_gametes_haps
: Dataframe with imputed donor genotypes/recovered dense matrix where genotypes are encoded as "h1"/"h2"/NA, specifying specifically which donor haplotype that genotype originates from for each SNP and gamete. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes.unsmoothed_gametes
: Unsmoothed version of filled_gametes
if smooth_imputed_genotypes
is FALSE. NULL if smooth_imputed_genotypes
is TRUE.unsmoothed_gametes_haps
: Unsmoothed version of filled_gametes_haps
if smooth_imputed_genotypes
is FALSE. NULL if smooth_imputed_genotypes
is TRUE.The recommended usage to call this function is to pass it the output from both read_data
and phase_donor_haployptes
. Doing this, keeping the default sequencing error rate and average number of recombination breakpoints per chromosome per gamete, is accomplished by the following example.
filled_gametes_list <- rhapsodi::impute_gamte_genotypes(read_data_out$dt, complete_haplotypes, read_data_out$positions)
If you expect a non-default sequencing error rate, for example an error rate of 0.01, you should specify the expected sequencing error rate which the HMM will utilize for emission probabilities. To do this, the function should be called in the following manner:
filled_gametes_list <- rhapsodi::impute_gamete_genotypes(read_data_out$dt, complete_haplotypes, read_data_out$positions, sequencing_error = 0.01)
Similarly, if you expect a non-default average number of recombination events per chromosome per gamete, for example 3 crossovers, you should specify the expected number of recombination events which the HMM will utilize for transition probabilities. To do this, the function should be called in the following way:
filled_gametes_list <- rhapsodi::impute_gamete_genotypes(read_data_out$dt, complete_haplotypes, read_data_out$positions, avg_recomb = 3)
The third and final of the three main algorithmic steps is discovering gamete-specific meiotic recombination events from the imputed gamete genotypes. The function discover_meiotic_recombination
is used for the discovery process.
The main arguments for this function are
original_gamete_data
: The dt
output from the function read_data
.complete_haplotypes
: The output from the function phase_donor_haplotypes
.filled_gamete_data_list
: The output from the function impute_gamete_genotypes
.positions
: The positions
output from the function read_data
.smooth_crossovers
: A boolean, default is TRUE. If TRUE, the filled gamete data that directly resulted from the HMM and NA filling is used. If FALSE, the filled gamete data that was unsmooothed is used.smooth_imputed_genotypes
: This boolean should be the same as the one that is passed to impute_gamete_genotypes
.sampleName
: a string, default is "sampleT". The sample name will be reported in the output from this function.chrom
: a string, default is "chrT". The chromosome will be reported in the output from this function.threads
: Default is 2. The number of threads to utilize for multi-threading.The output of this function is a dataframe with three columns that specifies the predicted recombination breakpoints for each gamete, using genomic coordinates. The first column is the ident
column which specifies the sampleName
, the chrom
, and the gamete where each recombination breakpoint was observed. These three pieces of information are concatenated together, separated by an underscore. The second column is the Genomic_start
column which specifies the genomic position of the start of each recombination breakpoint. The third column is the Genomic_end
column which specifies the genomic position of the end of each recombination breakpoint. If no recombination breakpoint is observed within a gamete, an NA is printed in both the Genomic_start
and Genomic_end
columns. Therefore, the number of rows in this dataframe is equal to the number of observed recombination breakpoints + the number of gametes without any observed recombination breakpoints.
The recommended usage to call this function is to pass the output from read_data
, phase_donor_haplotypes
, and impute_gamete_genotypes
combined with the sample name, the chromosome, and whatever the smoooth_imputed_genotypes
argument was when calling the impute_gamete_genotypes
function earlier. Assuming default values were used for impute_gamete_genotypes
, this would be as follows for a sample named "donorA" and its chromosome 1.
``` {r, eval = FALSE} recomb_breaks <- rhapsodi::discover_meiotic_recombination(read_data_out$dt, complete_haplotypes, filled_gametes_list, read_data_out$positions, sampleName = "donorA", chrom = "chr1")
If `smooth_imputed_genotypes` was specified as TRUE when calling `impute_gamete_genotypes`, the function should be called as such ```r recomb_breaks <- rhapsodi::discover_meiotic_recombination(read_data_out$dt, complete_haployptes, filled_gametes_list, read_data_out$positions, sampleName = "donorA", chrom = "chr1", smooth_imputed_genotypes = TRUE)
Another alternative usage of this function allows the user to specify whether recombination discovery is run on the smoothed or the unsmoothed gamete filled data. The default is to run discovery with the smoothed data directly from the HMM and NA filling functions. This is because with the unsmoothed data, there will likely be more false discoveries of recombination. We expect this because the overwriting of mismatches between the imputed data and the original sequencing reads could potentially re-introduce sequencing errors that the HMM corrected. If you wish to run meiotic recombination discovery with the unsmoothed data, smooth_crossovers
should be set to FALSE. However, this is not recommended as it likely will result in more false discoveries. If the impute_gamete_genotypes
function was run with smooth_imputed_genotypes = TRUE
, then the following call should be used.
recomb_breaks <- rhapsodi::discover_meiotic_recombination(read_data_out$dt, complete_haplotypes, filled_gametes_list, read_data_out$positions, sampleName = "donorA", chrom = "chr1", smooth_crossovers = FALSE)
Otherwise, if you wish to run meiotic recombination discovery with the unsmoothed data, and you ran impute_gamete_genotypes
with smooth_imputed_genotypes = FALSE
, then the following call should be used.
recomb_breaks <- rhapsodi::discover_meiotic_recombination(read_data_out$dt, complete_haplotypes, filled_gametes_list, read_data_out$positions, sampleName = "donorA", chrom = "chr1", smooth_crossovers = FALSE, smooth_imputed_genotypes = FALSE)
The last of the two data bookend steps is exporting the output from rhapsodi. The function export_data
is used to ornament and export the rhapsodi output. Specifically, the function will first add index (index
) and genomic position (pos
) columns to the datasets from impute_gamete_genotypes
. Then, the function will add a0
and a1
columns for 0/1/NA encoded data, if the original input data was A/C/G/T/NA encoded. These a0
and a1
columns specify the allele for the 0 genotype and the allele for the 1 genotype respectively. Lastly, this function combines all of the data from the 3 main tasks into a single named list.
The main arguments for this function are
input_data
: the named list returned from read_data
complete_haplotypes
: the output of phase_donor_haplotypes
filled_gametes
: the named list output from impute_gamete_genotypes
recomb_breaks
: the output of discover_meiotic_recombination
acgt
: A boolean, default is FALSE. Argument should match what was used in read_data
. When FALSE, original input data was assumed to be 0/1/NA encoded with genomic position and gamete genotypes columns. When TRUE input data was assumed to be A/C/G/T/NA encoded with positions, ref, and alt columns before the gamete genotypes. If TRUE, output 0/1/NA encoded matrices will include a0 and a1 columns that essentially add back the ref and alt columns. The output of this function is a named list with 6 elements
donor_haps
: the direct output from phase_donor_haplotypes
; a data frame with column names index
, pos
, h1
and h2
where index
is the SNP index, pos
is the genomic SNP positions, h1
is haplotype1, and h2
is haplotype2. The number of rows is equal to the number of hetSNPs.gamete_haps
: an output from impute_gamete_genotypes
with ornaments from export data
; Dataframe with imputed donor genotypes/recovered dense matrix where genotypes are encoded as "h1"/"h2"/NA, specifying specifically which donor haplotype that genotype originates from for each SNP and gamete. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 (index
and pos
) gamete_genotypes
: an output from impute_gamete_genotypes
with ornaments from export data
; Dataframe with imputed donor genotypes/recovered dense matrix where genotypes are encoded 0/1/NA. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 if acgt
was FALSE (index
and pos
) or + 4 if acgt
was TRUE (index
, pos
, a0
, a1
) unsmoothed_gamete_haps
: an output from impute_gamete_genotypes
with ornaments from export_data
; Unsmoothed version of gamete_haps
if smooth_imputed_genotypes
is FALSE. NULL if smooth_imputed_genotypes
is TRUE. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 (index
and pos
) unsmoothed_gamete_genotypes
: an output from impute_gamete_genoyptes
with ornaments from export_data
; Unsmoothed version of gamete_genotypes
if smooth_imputed_genotypes
is FALSE. NULL if smooth_imputed_genotypes
is TRUE. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 if acgt
was FALSE (index
and pos
) or + 4 if acgt
was TRUE (index
, pos
, a0
, a1
) recomb_breaks
: the direct output from discover_meiotic_recombination
; the output of this function is a dataframe with three columns that specifies the predicted recombination breakpoints for each gamete, using genomic coordinates. Column names Ident
, Genomic_start
, and Genomic_end
, with number of rows equal to the number of observed recombination breakpoints + the number of gametes without any observed recombination breakpoints.The recommended usage to call this function is to pass the output from the read_data
, phase_donor_haplotypes
, impute_gamete_genotypes
, and discover_meiotic_recombination
functions, together with the boolean that was used for the acgt
argument in the read_data
function.
rhapsodi_out <- rhapsodi::export_data(read_data_out, complete_haplotypes, filled_gametes_list, recomb_breaks)
Because the acgt
argument by default is FALSE, acgt=TRUE
is only needed if acgt
was TRUE when calling the read_data
function, as was done earlier with the theoretical inputs "rhapsodi_input/donor1_chr2_acgt.txt"
or rhapsodi_dt_acgt
.
rhapsodi_out <- rhapsodi::export_data(read_data_out, complete_haplotypes, filled_gametes_list, recomb_breaks, acgt=TRUE)
All 5 of the previous steps can be run by calling a single function rhapsodi_autorun
. This autorun function should display a message after each task is completed, updating the user that that step successfully ran.
The input arguments for this function are a combination and subset of the input arguments of all the 5 individual functions. Unlike when the user runs the 5 steps separately, the autorun function handles passing the output of previous steps to the next rhapsodi stage, so the only data the user must provide is the original path to and name of a tab delimited file, with header or user pre-loaded dataframe. The rest of the arguments control the behavior of rhapsodi (e.g. mcstop
, avg_recomb
, smooth_imputed_genotypes
, etc.) and what rhapsodi eventually reports (e.g. sampleName
and chrom
). The only additional argument is verbose
which controls if the user is updated on successful completion of each step.
input_file
: the path to and name of the tab-delimited file, with header containing the sparse gamete genotypes. Pass NULL
if instead using a user pre-loaded dataframeuse_dt
: A boolean, default is FALSE. Pass TRUE if you want to input a user pre-loaded dataframeinput_dt
: If use_dt
is TRUE, pass the name of the user pre-loaded data frame. acgt
: A boolean, default is FALSE. When FALSE, data is assumed to be 0/1/NA encoded with genomic position and gamete genotypes columns. Pass TRUE if data is A/C/G/T/NA encoded with positions, ref, and alt columns before the gamete genotypes.window_length
: Default is 3000. This is the size (in SNP indices) of the windows that will be used for clusteringoverlap_denom
: Default is 2. This is the denominator used in calculating the amount of overlap between windows, which will be the number of SNP indices which are considered when looking for amount of consensus in genotypes and stitching together haplotypes. Specifically, the overlap is window_length
// overlap_denom
. threads
: Default is 2. The number of threads to utilize for multi-threadingmcstop
: Default is TRUE. If TRUE, rhapsodi exits if confident stitching of haplotypes isn't possible because the consensus between overlapping windows is <0.9 but >0.1. If FALSE, rhapsodi continues and considers to which threshold the consensus is closer.stringent_stitch
: Default is FALSE. If TRUE, the user can set a specified bifurcating threshold value for stitching, stitch_new_min
stitch_new_min
: If stringent_stitch
is TRUE, this value becomes the bifurcating threshold for stitching such that if the consensus is greater than this value, the windows are merged as the same haplotype. If consensus, is less than this value, the windows are treated as originating from different haplotypes.seqError_model
: Default is 0.005. (This is what governs sequencing_error
in the impute_gamete_genotypes
function.) The expected sequencing error rate which controls emission probabilities. avg_recomb_model
: Default is 1. (This is the what governs avg_recomb
in the impute_gamete_genotypes
function.) The expected average number of recombination events per chromosome per gamete, which controls transition probabilities. smooth_imputed_genotypes
: A boolean, default is FALSE. If TRUE, unsmoothing process is not applied. If FALSE, unsmoothing is applied such that original sequencing observations are used in place of imputed genotypes if the two disagree.fill_ends
: A boolean, default is TRUE, If TRUE, fills the NAs at the terminal edges of chromosomes with the first/last known or imputed SNP for the beginning/ending of chromosomes respectively; if FALSE, leaves these genotypes as NA in the final imputed genotypes. Default is TRUE.smooth_crossovers
: A boolean, default is TRUE. If TRUE, the filled gamete data that directly resulted from the HMM and NA filling is used. If FALSE, the filled gamete data that was unsmooothed is used.sampleName
: a string, default is "sampleT". The sample name will be reported in the output from this function.chrom
: a string, default is "chrT". The chromosome will be reported in the output from this function.verbose
: a bool, default is FALSE. If TRUE, will print progress updates after each step is successfully completed.The output of this function is the same named list with 6 elements that export_data
would return.
donor_haps
: the direct output from phase_donor_haplotypes
; a data frame with column names index
, pos
, h1
and h2
where index
is the SNP index, pos
is the genomic SNP positions, h1
is haplotype1, and h2
is haplotype2. The number of rows is equal to the number of hetSNPs.gamete_haps
: an output from impute_gamete_genotypes
with ornaments from export data
; Dataframe with imputed donor genotypes/recovered dense matrix where genotypes are encoded as "h1"/"h2"/NA, specifying specifically which donor haplotype that genotype originates from for each SNP and gamete. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 (index
and pos
) gamete_genotypes
: an output from impute_gamete_genotypes
with ornaments from export data
; Dataframe with imputed donor genotypes/recovered dense matrix where genotypes are encoded 0/1/NA. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 if acgt
was FALSE (index
and pos
) or + 4 if acgt
was TRUE (index
, pos
, a0
, a1
) unsmoothed_gamete_haps
: an output from impute_gamete_genotypes
with ornaments from export_data
; Unsmoothed version of gamete_haps
if smooth_imputed_genotypes
is FALSE. NULL if smooth_imputed_genotypes
is TRUE. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 (index
and pos
) unsmoothed_gamete_genotypes
: an output from impute_gamete_genoyptes
with ornaments from export_data
; Unsmoothed version of gamete_genotypes
if smooth_imputed_genotypes
is FALSE. NULL if smooth_imputed_genotypes
is TRUE. Number of rows equal to number of hetSNPs; number of columns equal to number of gametes + 2 if acgt
was FALSE (index
and pos
) or + 4 if acgt
was TRUE (index
, pos
, a0
, a1
) recomb_breaks
: the direct output from discover_meiotic_recombination
; the output of this function is a dataframe with three columns that specifies the predicted recombination breakpoints for each gamete, using genomic coordinates. Column names Ident
, Genomic_start
, and Genomic_end
, with number of rows equal to the number of observed recombination breakpoints + the number of gametes without any observed recombination breakpoints.Because so many combinations of the parameters exist, a single example call will be provided here. For this example, assume that we are running rhapsodi on the pre-loaded dataframe rhapsodi_dt_acgt
which is A/C/G/T/NA encoded. Therefore
input_file
(the file + path argument) will be passed NULL
use_dt = TRUE
input_dt = rhapsodi_dt_acgt
acgt = TRUE
In addition for this example, assume that (A) we want phasing to continue even if stitching encounters non-stringent concordance values, (B) we expect a sequencing error rate of 0.05, (C) we expect an average of number of 2 recombination events per gamete per chromosome, (D) due to the higher sequencing error rate, we decide to skip unsmoothing. Thus
mcstop = FALSE
seqError_model = 0.05
avg_recomb_model = 2
smooth_imputed_genotypes = TRUE
Then, just generally, we want to use 8 threads, would like rhapsodi to update us on its progress with messages, and are running data from donorC, chromosome 17.
threads=8
verbose = TRUE
sampleName = "donorC"
chrom="chr17"
With all of this together, keeping all other arguments as default, our example call would be the following.
rhapsodi_out <- rhapsodi::rhapsodi_autorun(NULL, use_dt = TRUE, input_dt = rhapsodi_dt_acgt, acgt = TRUE, threads = 8, sampleName = 'donorC', chrom="chr17", seqError_model = 0.05, avg_recomb_model = 2, smooth_imputed_genotypes = TRUE, verbose = TRUE)
The 0/1/NA encoded data we'll be using for this is the sim01NA42
data.
head(sim01NA42)
dim(sim01NA42)
The dataframe has 4136 rows or heterozygous SNPs (hetSNPs) and 51 columns. The first column is the positions
or the SNP index for this simulated data. For real data, this first column should be the genomic position. The rest of the 50 columns are the gamete genotypes.
read_data_out <- rhapsodi::read_data(NULL, use_dt = TRUE, input_dt = sim01NA42) complete_haplotypes <- rhapsodi::phase_donor_haplotypes(read_data_out$dt, read_data_out$positions)
head(complete_haplotypes)
filled_gametes <- rhapsodi::impute_gamete_genotypes(read_data_out$dt, complete_haplotypes, read_data_out$positions) names(filled_gametes) dim(filled_gametes$filled_gametes)
recomb_breaks <- rhapsodi::discover_meiotic_recombination(read_data_out$dt, complete_haplotypes, filled_gametes, read_data_out$positions)
rhapsodi_out_sbs_01 <- rhapsodi::export_data(read_data_out, complete_haplotypes, filled_gametes, recomb_breaks) names(rhapsodi_out_sbs_01)
rhapsodi_out_auto_01 <- rhapsodi::rhapsodi_autorun(NULL, use_dt = TRUE, input_dt = sim01NA42, verbose = TRUE) names(rhapsodi_out_auto_01)
As should be, the outputs from running the ACGT data step-by-step and with the autorun function are identical to one another.
identical(rhapsodi_out_sbs_01, rhapsodi_out_auto_01)
The donor_haps
output contains numerical index
and pos
columns for the SNP index and genomic positions respectively. Finally, the numerical h1
and h2
columns record what rhapsodi determines to be the co-inherited genotypes for each haplotype.
head(rhapsodi_out_auto_01$donor_haps)
gamete_haps
and unsmoothed_gamete_haps
both contain numerical index
and pos
columns as well, followed by character columns for each gamete recording what rhapsodi determines to be the donor haplotype ("h1"
or "h2"
) which was inherited for each SNP.
head(rhapsodi_out_auto_01$gamete_haps) head(rhapsodi_out_auto_01$unsmoothed_gamete_haps)
gamete_genotypes
and unsmoothed_gamete_genotypes
both contain numerical index
and pos
columns as well. Finally, this is followed by numerical columns for each gamete recording what rhapsodi determines to be the inherited genotype for each SNP.
head(rhapsodi_out_auto_01$gamete_genotypes) head(rhapsodi_out_auto_01$unsmoothed_gamete_genotypes)
recomb_breaks
is the final output. The first columns is a character column, Ident
which records the sample name, the chromosome, and the gamete for which rhapsodi observed a recombination breakpoint. Genomic_start
is a numerical column recording the genomic start position of the breakpoint. Genomic_end
is a numerical column recording the genomic end position of the breakpoint.
head(rhapsodi_out_auto_01$recomb_breaks)
The A/C/G/T/NA encoded data we'll be using for this is the dataACGT
data, which is actually derived from the sim01NA42
data, just after randomly assigning alleles for the genotypes of 0 and 1 at each SNP.
head(dataACGT)
dim(dataACGT)
The dataframe has 4136 rows or hetSNPs and 53 columns. The first column is the positions
or the SNP index for this simulated data. For real data, this first column should be the genomic position. The second column is the ref
or reference allele, specifying the reference allele for each SNP/genomic position. The third columns is the alt
or alternate allele, specifying the alternate allele for each SNP/genomic position. The rest of the 50 columns are the gamete genotypes.
read_data_out <- rhapsodi::read_data(NULL, use_dt = TRUE, input_dt = dataACGT, acgt = TRUE) complete_haplotypes <- rhapsodi::phase_donor_haplotypes(read_data_out$dt, read_data_out$positions)
head(complete_haplotypes)
filled_gametes <- rhapsodi::impute_gamete_genotypes(read_data_out$dt, complete_haplotypes, read_data_out$positions) names(filled_gametes) dim(filled_gametes$filled_gametes)
recomb_breaks <- rhapsodi::discover_meiotic_recombination(read_data_out$dt, complete_haplotypes, filled_gametes, read_data_out$positions)
rhapsodi_out_sbs <- rhapsodi::export_data(read_data_out, complete_haplotypes, filled_gametes, recomb_breaks, acgt = TRUE) names(rhapsodi_out_sbs)
rhapsodi_out_auto <- rhapsodi::rhapsodi_autorun(NULL, use_dt = TRUE, input_dt = dataACGT, acgt = TRUE, verbose = TRUE) names(rhapsodi_out_auto)
As should be, the outputs from running the ACGT data step-by-step and with the autorun function are identical to one another.
identical(rhapsodi_out_sbs, rhapsodi_out_auto)
The donor_haps
output still contains the numerical index
and pos
columns for the SNP index and genomic positions respectively. In addition, this output now also includes the character a0
and a1
columns which record the allele for genotypes of 0 and 1 respectively. Finally, the numerical h1
and h2
columns record what rhapsodi determines to be the co-inherited genotypes for each haplotype.
head(rhapsodi_out_auto$donor_haps)
gamete_haps
and unsmoothed_gamete_haps
both contain numerical index
and pos
columns as well, followed by character columns for each gamete recording what rhapsodi determines to be the donor haplotype ("h1"
or "h2"
) which was inherited for each SNP.
head(rhapsodi_out_auto$gamete_haps) head(rhapsodi_out_auto$unsmoothed_gamete_haps)
gamete_genotypes
and unsmoothed_gamete_genotypes
both contain numerical index
and pos
columns as well. In addition, this output now also includes the character a0
and a1
columns which record the allele for genotypes of 0 and 1 respectively. Finally, this is followed by numerical columns for each gamete recording what rhapsodi determines to be the inherited genotype for each SNP.
head(rhapsodi_out_auto$gamete_genotypes) head(rhapsodi_out_auto$unsmoothed_gamete_genotypes)
Finally, recomb_breaks
is the same format as when non-acgt encoded data is used. The first columns is a character column, Ident
which records the sample name, the chromosome, and the gamete for which rhapsodi observed a recombination breakpoint. Genomic_start
is a numerical column recording the genomic start position of the breakpoint. Genomic_end
is a numerical column recording the genomic end position of the breakpoint.
head(rhapsodi_out_auto$recomb_breaks)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.