sim_RVstudy: Simulate sequence data for a sample of pedigrees

Description Usage Arguments Details Value References See Also Examples

View source: R/sim_StudySeqFunctions.R

Description

Simulate single-nucleotide variant (SNV) data for a sample of pedigrees.

Usage

1
2
3
sim_RVstudy(ped_files, SNV_data, affected_only = TRUE,
  remove_wild = TRUE, pos_in_bp = TRUE, gamma_params = c(2.63,
  2.63/0.5), burn_in = 1000, SNV_map = NULL, haplos = NULL)

Arguments

ped_files

Data frame. A data frame of pedigrees for which to simulate sequence data, see details.

SNV_data

SNVdata. An object of class SNVdata created by SNVdata.

affected_only

Logical. When affected_only = TRUE, we only simulate SNV data for the disease-affected individuals and the family members that connect them along a line of descent. When affected_only = FALSE, SNV data is simulated for the entire study. By default, affected_only = TRUE.

remove_wild

Logical. When remove_wild = TRUE the data is reduced by removing SNVs which are not observed in any of the study participants; otherwise if remove_wild = FALSE no data reduction occurs. By default, remove_wild = TRUE.

pos_in_bp

Logical. This argument indicates if the positions in SNV_map are listed in base pairs. By default, pos_in_bp = TRUE. If the positions in SNV_map are listed in centiMorgan please set pos_in_bp = FALSE instead.

gamma_params

Numeric list of length 2. The respective shape and rate parameters of the gamma distribution used to simulate distance between chiasmata. By default, gamma _params = c(2.63, 2*2.63), as discussed in Voorrips and Maliepaard (2012).

burn_in

Numeric. The "burn-in" distance in centiMorgan, as defined by Voorrips and Maliepaard (2012), which is required before simulating the location of the first chiasmata with interference. By default, burn_in = 1000. The burn in distance in cM. By default, burn_in = 1000.

SNV_map

This argument has been deprecated. Users now supply objects of class SNVdata to argument SNV_data.

haplos

This argument has been deprecated. Users now supply objects of class SNVdata to argument SNV_data.

Details

The sim_RVstudy function is used to simulate single-nucleotide variant (SNV) data for a sample of pedigrees. Please note: this function is NOT appropriate for users who wish to simulate genotype conditional on phenotype. Instead, sim_RVstudy employs the following algorithm.

  1. For each pedigree, we sample a single causal rare variant (cRV) from a pool of SNVs specified by the user.

  2. Upon identifying the familial cRV we sample founder haplotypes from haplotype data conditional on the founder's cRV status at the familial cRV locus.

  3. Proceeding forward in time, from founders to more recent generations, for each parent/offspring pair we:

    1. simulate recombination and formation of gametes, according to the model proposed by Voorrips and Maliepaard (2012), and then

    2. perform a conditional gene drop to model inheritance of the cRV.

It is important to note that due to the forwards-in-time algorithm used by sim_RVstudy, certain types of inbreeding and/or loops cannot be accommodated. Please see examples.

For a detailed description of the model employed by sim_RVstudy, please refer to section 6 of the vignette.

The data frame of pedigrees, ped_files, supplied to sim_RVstudy must contain the variables:

name type description
FamID numeric family identification number
ID numeric individual identification number
sex numeric sex identification variable: sex = 0 for males, and sex = 1 females.
dadID numeric identification number of father
momID numeric identification number of mother
affected logical disease status indicator: set affected = TRUE if individual has disease.
DA1 numeric paternally inherited allele at the cRV locus:
DA1 = 1 if the cRV is inherited, and 0 otherwise.
DA2 numeric maternally inherited allele at the cRV locus:
DA2 = 1 if the cRV is inherited, and 0 otherwise.

If ped_files does not contain the variables DA1 and DA2 the pedigrees are assumed to be fully sporadic. Hence, the supplied pedigrees will not segregate any of the SNVs in the user-specified pool of cRVs.

Pedigrees simulated by the sim_RVped and sim_ped functions of the SimRVPedigree package are properly formatted for the sim_RVstudy function. That is, the pedigrees generated by these functions contain all of the variables required for ped_files (including DA1 and DA2).

The data frame SNV_map catalogs the SNVs in haplos. The variables in SNV_map must be formatted as follows:

name type description
colID numeric associates the rows in SNV_map to the columns of haplos
chrom numeric the chromosome that the SNV resides on
position numeric is the position of the SNV in base pairs when argument
pos_in_bp = TRUE or centiMorgan when pos_in_bp = FALSE
marker character (Optional) a unique character identifier for the SNV.
If missing this variable will be created from chrom and position.
pathwaySNV logical (Optional) identifies SNVs located within the pathway of interest as TRUE
is_CRV logical identifies causal rare variants (cRVs) as TRUE.

Please note that when the variable is_CRV is missing from SNV_map, we sample a single SNV to be the causal rare variant for all pedigrees in the study, which is identified in the returned famStudy object.

Value

A object of class famStudy. Objects of class famStudy are lists that include the following named items:

ped_files

A data frame containing the sample of pedigrees for which sequence data was simulated.

ped_haplos

A sparse matrix that contains the simulated haplotypes for each pedigree member in ped_files.

haplo_map

A data frame that maps the haplotypes (i.e. rows) in ped_haplos to the individuals in ped_files.

SNV_map

A data frame cataloging the SNVs in ped_haplos.

Objects of class famStudy are discussed in detail in section 5.2 of the vignette.

References

Roeland E. Voorrips and Chris A Maliepaard. (2012). The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC Bioinformatics, 13:248.

Christina Nieuwoudt, Angela Brooks-Wilson, and Jinko Graham. (2019). SimRVSequences: an R package to simulate genetic sequence data for pedigrees. <doi:10.1101/534552>.

See Also

sim_RVped, read_slim, summary.famStudy

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
library(SimRVSequences)

#load pedigree, haplotype, and mutation data
data(study_peds)
data(EXmuts)
data(EXhaps)

# create variable 'is_CRV' in EXmuts.  This variable identifies the pool of
# causal rare variants  from which to sample familial cRVs.
EXmuts$is_CRV = FALSE
EXmuts$is_CRV[c(26, 139, 223, 228, 472)] = TRUE

# create object of class SNVdata
my_SNVdata <- SNVdata(Haplotypes = EXhaps,
                      Mutations = EXmuts)

#supply required inputs to the sim_RVstudy function
seqDat = sim_RVstudy(ped_files = study_peds,
                     SNV_data = my_SNVdata)


# Inbreeding examples
# Due to the forward-in-time model used by sim_RVstudy certain types of
# inbreeding and/or loops *may* cause fatal errors when using sim_RVstudy.
# The following examples demonstrate: (1) imbreeding that can be accommodated
# under this model, and (2) when this limitation is problematic.

# Create inbreeding in family 1 of study_peds
imb_ped1 <- study_peds[study_peds$FamID == 3, ]
imb_ped1[imb_ped1$ID == 18, c("momID")] = 7
plot(imb_ped1)

# Notice that this instance of inbreeding can be accommodated by our model.
seqDat = sim_RVstudy(ped_files = imb_ped1,
                     SNV_data = my_SNVdata)

# Create different type of inbreeding in family 1 of study_peds
imb_ped2 <- study_peds[study_peds$FamID == 3, ]
imb_ped2[imb_ped1$ID == 8, c("momID")] = 18
plot(imb_ped2)

# Notice that inbreeding in imb_ped2 will cause a fatal
# error when the sim_RVstudy function is executed
## Not run: 
seqDat = sim_RVstudy(ped_files = imb_ped2,
                     SNV_data = my_SNVdata)

## End(Not run)

SimRVSequences documentation built on July 1, 2020, 6:03 p.m.