In genostats/FEstim: Genetic Data Analyses in Presence of Inbreeding

Introduction
- Principal concepts
Getting started
- Installing Fantasio
- Installing example data:HGDP-CEPH
Running Fantasio
Plotting results
Step by step usage of the package Fantasio
Parallelism with the package

Introduction

Fantasio is composed of several functions. Its goals are:

For population genetic studies: estimating and detecting inbreeding on individuals without known genealogy, estimating the population proportion of mating types and the individual probability to be offspring of different mating types
For rare disease studies: performing homozygosity mapping with heterogeneity
For multifactorial disease studies: HBD-GWAS strategy

Fantasio implements the creation of several random sparse submaps on genome-wide data (to remove linkage disequilibrium). It also provides graphical outputs to facilitate interpretations of homozygosity mapping results and plots.

In this vignette, we illustrate how to use the package, using the data set HGDP-CEPH, a ship which contains 1043 individuals and 660918 markers. In order to access the data, you will need to download the HGDP.CEPH package (see below how) and load it.

Not all options of the functions are described here, but rather their basic usage. The reader is advised to look at the manual page of the function for details.

Principal concepts

Fantasio implements a maximum likelihood method that uses a hidden Markov chain to model the dependencies along the genome between the (observed) marker genotypes of an individual, and its (unobserved) homozygous by descent (HBD) status. The emission probabilities of this hidden Markov model (HMM) depend on the allele frequencies. The transition probabilities depend on the genetic distance between two adjacent markers. This model allows estimating the inbreeding coefficient $f$ of an individual, and a parameter $a$, where $af$ is the instantaneous rate of change per unit map length (here cM) from no HBD to HBD. Both HBD and non-HBD segment lengths are assumed to be distributed exponentially with mean lengths $\frac{1}{a(1-f)}$ and $\frac{1}{af}$, respectively.

The method requires the markers to be in minimal linkage disequilibrium (LD). Otherwise biased estimations of $f$ are produced. A strategy consisting of generating multiple random sparse genome maps (submaps) has been proposed to avoid this bias (Leutenegger et al. 2011). When several submaps are considered, $f$ is estimated as the median value on all the f estimation obtained on the different maps after removing submaps with $a > 1$ (having an average HBD segment length of 1 cM is unlikely to be detected with a SNP density of 1 per 0.5 cM). This strategy has the advantage of not requiring any LD computation on the sample and of minimizing loss of information, as compared with a strategy that based on a single map of markers in minimal LD.

Fantasio statistical framework allows fixing HMM parameters to compute the likelihood of a mating type. These likelihoods can be used for:

Inferring an individual as inbred by comparing the maximized likelihood with the one to be outbred with a likelihood ratio test
Estimating the population proportion of mating types
Estimating the individual probability to be into different mating types

When multiple submaps are used, the median p-values/probabilities are considered. See Leutenegger et al. 2011 for more details on the calculations.

Homozygosity mapping (Lander and Botstein 1987) consists in focusing on inbred affected individuals and searching for a region of the genome of shared homozygosity. Original homozygosity mapping requires that the genealogy of patients be known so that inbred patients can be identified and their respective $f$ estimated. Leutenegger et al. (Leutenegger et al. 2006) proposed using the $f$ estimated on genome-wide genetic data to compute a FLOD score, similar to Morton's LOD score (Morton 1955).

Genin et al. (Genin et al. 2012) adapted the FLOD formula for multiple submaps. $FLOD^{(i)}(m,s)$ is computed for each individual $i$, each marker $m$ on each submap $s$, using the equation (1):

$FLOD^{(i)}(m,s) = log_{10}\frac{P\left(Y_{m,s}^{(i)} | H_{1}\right)}{P\left(Y_{m,s}^{(i)} | H_{0}\right)} = log_{10}\frac{P\left(X_{m,s}^{(i)}=1 | Y_{m,s}^{(i)}\right) + q.P\left(X_{m,s}^{(i)}=0 | Y_{m,s}^{(i)}\right)}{\hat{f}_s^{(i)} + q.\left(1-\hat{f}_s^{(i)}\right)}$

With the following parameters :

$Y_{m,s}^{(i)}$ the observed genotype of individual i at marker m on submap s
$H_{1}$ the hypothesis where marker $m$ is linked to the disease, and $H_{0}$ the one where it is not
$X_{m,s}^{(i)}$ the HBD status of individual i at marker m on submap s that is estimated together with the inbreeding coefficient using the HMM of the package
$\hat{f}_s^{(i)}$ the estimated inbreeding coefficient of individual i on submap s
q the assumed frequency of the mutation involved in the disease for this individual.

Results are then averaged over the different submaps to obtain a single $FLOD^{(i)}(m)$ at each marker m.

Genin et al. (Genin et al. 2012) proposed to detect fully penetrant rare recessive variants by performing homozygosity mapping on inbred cases from Genome-Wide Association Study (GWAS) data. Linkage evidence is then evaluated over the entire set I of inbred cases by computing a FLOD score, $HFLOD(m,\alpha))$, at each marker m, in presence of genetic heterogeneity using a parameter $(\alpha)$ (2):

$HFLOD(m,\alpha)=\sum log_{10} \left[\alpha.\frac{P\left(Y_{m,s}^{(i)} | H_{1}\right)}{P\left(Y_{m,s}^{(i)} | H_{0}\right)}+ (1 - \alpha)\right ]= \sum log_{10} \left[\alpha . exp \left(FLOD^{(i)}(m) \times \log(10)\right)+(1-\alpha)\right]$

This heterogeneity score is then maximized over $(\alpha)$ to evaluate the evidence of linkage at marker m where $(\alpha)$ is the estimate of the proportion of cases linked to this locus (3):

$HFLOD(m)=max_{\alpha}(HFLOD(m,\alpha))$

Getting started

The package Fantasio depends on gaston. Hereafter the functions of this package are used for data manipulation. If you are not familiar with gaston, please refer to the vignette of this package for more information.

Installing Fantasio

First and foremost install the package with the following command:

devtools::install_github("genostats/Fantasio@margot", build_vignettes = TRUE)

After that we can load the package.

require(Fantasio)

## Loading required package: Fantasio
## Loading required package: parallel
## Loading required package: gaston
## Loading required package: Rcpp
## Loading required package: RcppParallel
## 
## Attaching package: 'RcppParallel'
## The following object is masked from 'package:Rcpp':
## 
##     LdFlags
## Gaston set number of threads to 8. Use setThreadOptions() to modify this.
## 
## Attaching package: 'gaston'
## The following object is masked from 'package:stats':
## 
##     sigma
## The following objects are masked from 'package:base':
## 
##     cbind, rbind

Installing example data: HGDP-CEPH

We illustrate the usage of the package with the data provided in the data package HGDP-CEPH. First, we need to install it:

install.packages("HGDP.CEPH", repos="https://genostats.github.io/R/")

We can load the HGDP-CEPH data as follows:

require(HGDP.CEPH)

## Loading required package: HGDP.CEPH

filepath <- system.file("extdata", "hgdp_ceph.bed", package="HGDP.CEPH")
x <- read.bed.matrix(filepath)

## Reading /home/margot/R/x86_64-pc-linux-gnu-library/3.6/HGDP.CEPH/extdata/hgdp_ceph.rds 
## Reading /home/margot/R/x86_64-pc-linux-gnu-library/3.6/HGDP.CEPH/extdata/hgdp_ceph.bed

The object x is a bed.matrix object (see gaston package for details). The statistics on SNPs and individual callrates, etc, have not been computed; run

x <- set.stats(x)

## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.

to add them in the bed matrix. These data have genetic distances (centiMorgans) set in the @snps slot, column dist:

head(x@snps)

##   chr         id       dist     pos A1 A2  N0  N1  N2 NAs N0.f N1.f N2.f NAs.f
## 1   1  rs3094315 0.09162711  742429  C  T  75 346 620   2   NA   NA   NA    NA
## 2   1 rs12562034 0.09918407  758311  A  G  81 275 674  13   NA   NA   NA    NA
## 3   1  rs3934834 0.49630053  995669  T  C  58 268 717   0   NA   NA   NA    NA
## 4   1  rs9442372 0.50394850 1008567  A  G 149 423 471   0   NA   NA   NA    NA
## 5   1  rs3737728 0.50800870 1011278  T  C  45 290 708   0   NA   NA   NA    NA
## 6   1 rs11260588 0.50867275 1011521  A  G   3  49 991   0   NA   NA   NA    NA
##    callrate        maf         hz
## 1 0.9980825 0.23823247 0.33237272
## 2 0.9875360 0.21213592 0.26699029
## 3 1.0000000 0.18408437 0.25695110
## 4 1.0000000 0.34563758 0.40556088
## 5 1.0000000 0.18216683 0.27804410
## 6 1.0000000 0.02636625 0.04697987

Most datasets come with physical positions (column pos in @snps), but no genetic distance. They can be added with Gaston. First, you need install the HumanGeneticMap package with install.packages("HumanGeneticMap", repos="https://genostats.github.io/R/").
Then, you would run

x <- set.dist(x, HumanGeneticMap::genetic.map.b36)

## Setting distances for chromosome 1 
## Setting distances for chromosome 2 
## Setting distances for chromosome 3 
## Setting distances for chromosome 4 
## Setting distances for chromosome 5 
## Setting distances for chromosome 6 
## Setting distances for chromosome 7 
## Setting distances for chromosome 8 
## Setting distances for chromosome 9 
## Setting distances for chromosome 10 
## Setting distances for chromosome 11 
## Setting distances for chromosome 12 
## Setting distances for chromosome 13 
## Setting distances for chromosome 14 
## Setting distances for chromosome 15 
## Setting distances for chromosome 16 
## Setting distances for chromosome 17 
## Setting distances for chromosome 18 
## Setting distances for chromosome 19 
## Setting distances for chromosome 20 
## Setting distances for chromosome 21 
## Setting distances for chromosome 22 
## No map for chromosome 23 
## No map for chromosome 24 
## No map for chromosome 25 
## No map for chromosome 26

as the SNP positions in these data are on hg18/b36 reference.

We are going to work on the Bedouin population. To select this population run:

x.be <- select.inds(x, population == "Bedouin")

Running Fantasio

Two differents methods for submaps creation are implemented in the package:

"By Hotspots": with this method we use recombination hotspots to segment the genome. The package contains recombination hotspots for the Human Genome in builds 35 (hg17), 36 (hg18) and 37 (hg19) in datasets hotspots_hg17, hotspots_hg18,hotspots_hg19. The default recombination hotspots dataset used is hotspots_hg19. Segments should contain at least a user defined number of markers. Markers are then randomly selected within each segment along the genome. By doing this process we obtain a submap (a list of marker).
"By Distance" or "by SNPs": with this method we use a fixed step based on genetic or physiscal distance (0.5 cM by default) to pick a marker randomly along the genome. More technically, segments are created whenever there is a gap larger than the step (0.5 cM) between adjacent markers. Each segment is then subdivided in several mini-segments. By default we create 20 mini-segments, each containing at least 50 markers. If this is not possible (not enough markers), we do not create mini-segments. After this process is done, we loop over the mini-segments, pick a random marker and walk through the mini-segments by picking the nearest marker after taking a step (default 0.5 cM) downstream and upstream the mini-segments.

The function Fantasio calls two different functions for submaps creation, based on the value of segments: segmentsListByHotspots and segmentsListByDistance. The first function segmentsListByDistance is used to create a list of segments though the genome. The second function makeAtlasByDistance or makeAllSubsmapsbyHotspots is used to create the submaps.

Hotspots (Default)

By default, the submaps are created using the file of recombination hotspots and summarizing the results for each snp that appears in a submap.

F1 <- Fantasio(bedmatrix=x.be, segments="Hotspots", 
           segment.options = list(hotspots = hotspot_hg18), n = 100, 
           verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

The function markerRepresentation gives for each marker of the original bed matrix the number of times it has been selected. It can be used in conjonction with table to obtain a summary of the number of markers never seen, seen once, twice, etc:

head(markerRepresentation(F1), 10)

##            id chr     pos       dist Freq
## 1   rs3094315   1  742429 0.09162711    2
## 2  rs12562034   1  758311 0.09918407    2
## 3   rs3934834   1  995669 0.49630053    0
## 4   rs9442372   1 1008567 0.50394850    2
## 5   rs3737728   1 1011278 0.50800870    6
## 6  rs11260588   1 1011521 0.50867275    0
## 7   rs9442398   1 1011558 0.50878962    0
## 8   rs6687776   1 1020428 0.54737207    0
## 9   rs9651273   1 1021403 0.54793723    1
## 10  rs4970405   1 1038818 0.55291100    2

table(markerRepresentation(F1)$Freq)

## 
##      0      1      2      3      4      5      6      7      8      9     10 
## 217897 154511 101646  62532  39023  24346  16145  10899   7659   5467   4090 
##     11     12     13     14     15     16     17     18     19     20     21 
##   3081   2267   1771   1474   1107    947    774    632    549    459    369 
##     22     23     24     25     26     27     28     29     30     31     32 
##    336    286    274    207    210    166    147    110    132    115    109 
##     33     34     35     36     37     38     39     40     41     42     43 
##    103     67     68     68     48     31     32     22     33     27     20 
##     44     45     46     47     48     49     50     51     52     53     54 
##     30     35     28     30     40     42     38     42     38     27     27 
##     55     56     57     58     59     60     61     62     63     64     68 
##     30     25     14      9      3      3      4      3      3      2      1 
##    100 
##    188

The slot F1@submap_summary contains a summmary of results accross submaps for all individuals:

head(F1@submap_summary, 10)

##        famid        id pheno submaps quality      f_min      f_max       f_mean
## 1  HGDP00607 HGDP00607     1     100     100 0.01968740 0.03266918 0.0249375555
## 2  HGDP00608 HGDP00608     1     100     100 0.03329754 0.04466905 0.0378773806
## 3  HGDP00609 HGDP00609     1     100     100 0.03555332 0.04570728 0.0395621560
## 4  HGDP00610 HGDP00610     1     100     100 0.04405081 0.05855172 0.0496007528
## 5  HGDP00611 HGDP00611     1      96      96 0.00000000 0.01027520 0.0003100903
## 6  HGDP00612 HGDP00612     1     100     100 0.04563621 0.06452940 0.0541143211
## 7  HGDP00613 HGDP00613     1     100     100 0.00000000 0.00000000 0.0000000000
## 8  HGDP00614 HGDP00614     1     100     100 0.02374184 0.03547134 0.0289076895
## 9  HGDP00615 HGDP00615     1     100     100 0.08069546 0.09608753 0.0888941435
## 10 HGDP00616 HGDP00616     1     100     100 0.06384241 0.08046822 0.0709961142
##      f_median   a_median   pLRT_median inbred pLRT_inf_0.05
## 1  0.02456151 0.13107792  1.534069e-37   TRUE           100
## 2  0.03771483 0.06290722  4.902576e-69   TRUE           100
## 3  0.03913424 0.10664411  1.144885e-72   TRUE           100
## 4  0.04932863 0.15372793  1.366378e-79   TRUE           100
## 5  0.00000000 0.01000000  1.000000e+00  FALSE             1
## 6  0.05363621 0.28137287  5.006136e-56   TRUE           100
## 7  0.00000000 0.01000000  1.000000e+00  FALSE             0
## 8  0.02847253 0.13542484  6.424485e-41   TRUE           100
## 9  0.08883544 0.09133073 5.673614e-168   TRUE           100
## 10 0.07040706 0.15074459 3.685643e-104   TRUE           100

If you are interested in the values of $f$ and $a$ accross the submaps, you can find them in F1@estimation_summary.

The function also computed HBD probabilities that can be found in F1@HBD_recap, which are used to determine the position of HBD segments, which are recapitulated in F1@HBDsegments. It is required that at least n.consecutive.marker markers are HBD before calling a HBD segment. Default value isn.consecutive.marker=5.

Moreover, FLOD and HFLOD linkage statistics are computed. The FLOD of each individual is in F1@FLOD_recap; they are used to compute the whole sample statistics HFLOD in F1@HFLOD.

By default, the package only computes HBD, FLOD statistics and HFLOD statistics for affected individuals (phenotype = 2). In case there are no affected individuals, all individuals are taken into account (a warning is issued). The list.id argument allows to specify a list of individuals.

Hotspots by Segments

For the "Hotspots" method, the results can also be summarized globally for each segment using option recap.by.segments=TRUE. In that case, n.consecutive.marker should be set to 1.

F2 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", 
          segment.options = list(hotspots = hotspot_hg18), n = 100, 
          recap.by.segments = TRUE, n.consecutive.marker = 1, verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

Distance

F3 <- Fantasio(bedmatrix=x.be, segments="Distance", n = 100, verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

In addition to all the above mentionned summaries, the function markerSummary returns the number of SNPs in each submap. Note that when `segments = "Hotspots", it's always the number of hotspot segments, but here it varies from submap to submap.

head(markerSummary(F3))

##          number_of_markers_used
## Submap 1                   6229
## Submap 2                   6239
## Submap 3                   6236
## Submap 4                   6238
## Submap 5                   6221
## Submap 6                   6233

How to run logistic regression

To run logistic regression use run.logistic = TRUE in Fantasio function. (default = FALSE) Specify your explanatory variable in expl_var. It can be HBD_prob or FLOD. You can import a dataframe with covariates such as (age, sex, BMI ...) incovariates argument to run adjusted glm.

How to use the `segment.option` argument

In order to use the segment.option argument you need to pass a list of arguments, each variable name in the list must be an argument name in the function. The function that will be called is either createsSegmentsListByDistance if segments argument is equal to "Distance" or segmentsListByHotspots if segments argument is equal to "Hotspots" and the arguments list will be passed to it. So refer to these functions for possible arguments.

l <- list(hotspots = hotspot_hg18, minMarkers = 50) # default is 0
F4 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", n = 100, segment.options = l, verbose = FALSE)

## Warning in setSummary(h, probs = run.proba, recap.by.segments = recap.by.segments, : No individual with pheno = 2.
## Using all inbred individuals with good estimation quality.

In the case of "Hotspots", by default, we do not require to have a minimum number of markers in each segment (minMarkers = 0). With the above command line, we impose to have at least 50 markers in a segment. Otherwise it is merged with the previous segment.

Plotting results

We illustrate the plotting function on the F1 object obtained with segments="Hotspots" and F2 obtained with segments="Hotspots" and recap.by.segments=TRUE.

HFLOD plots

This plot shows the values of the HFLOD linkage statistics along the genome (contained in F1@HFLOD).

HFLODManhattanPlot(F1)

The points are the HFLOD values. In the case of F1, they are computed at every marker included in at least one submaps. The red line is the value of their moving average (see below).

To plot a single chromosome, use the following function:

HFLODPlotChr(F1, chr=15)

The red line plotted is the moving average of the HFLOD, calculated on moving windows of 50 markers. This allows checking the consistency of HFLOD calculations (i.e. checking the fact that a high HFLOD score is not due to one submap only).
For method "Hotspots by segments", the use of the moving average does not make much sense as results are already averaged over the SNPs of a segment between hospots regions. We recommend using MA=FALSE as this:

HFLODPlotChr(F2, chr=15, MA=FALSE)

HBD plots

The HBD plots allow to visualize the positions of the HBD segments on the genome, contained in F1@HBDsegments. You can plot multiple individuals for one chromosome (by default, all individuals with $f$ significantly positive are plotted), as illustrated below with F1 and F2:

HBDPlotChr(F1, chr=15)

HBDPlotChr(F2, chr=15)

It is also possible to look at all HBD segments of a given individual (again we show the results for F1 and F2):

HBDPlotId(F1, id = "HGDP00649", famid = "HGDP00649")

HBDPlotId(F2, id = "HGDP00649", famid = "HGDP00649")

GLM plots

Current data doesn't allow to use this funcitonnality of Fantasio but - when it's allowed you can plot QQ-plots and Manhattan-plots using the function glmHBDPlot. See documentation about this function for more information.

Step by step usage of the package Fantasio

Hotspots

Creation of the segments list

We will now create segments, which will be use to create the submaps later, further explication below, for now use this command :

s1 <- segmentsListByHotspots(x.be, hotspots = hotspot_hg18)

## Using hotspots from  hotspot_hg18 
## Gathering all hotspots for the genome : ......................
## Gathering all the genome's markers : ......................
## Finding which markers are between two hotspots : ......................

This function creates a list of chromosomes, in each, you have a list of several segments created thanks to the hotspots file given in argument (files are given with the package), in each segments you have SNPs index.

You can watch a summary of what was done with :

segmentsListSummary(s1)

##    chromosome number_of_segments number_of_markers
## 1           1               1143             47146
## 2           2               1075             51611
## 3           3                939             42555
## 4           4                882             38215
## 5           5                870             39191
## 6           6                834             41404
## 7           7                726             34011
## 8           8                706             35659
## 9           9                699             29513
## 10         10                782             32753
## 11         11                663             30521
## 12         12                739             30331
## 13         13                574             23931
## 14         14                481             20314
## 15         15                509             18429
## 16         16                549             18487
## 17         17                475             15527
## 18         18                552             18911
## 19         19                323             10004
## 20         20                465             15759
## 21         21                262              9025
## 22         22                271              8956

This function creates a dataframe with three colums :

chromosome
number_of_segments
number_of_marker

Creation of the submaps and computation

We will now head toward the creation of submaps using the following commands :

F1 <- makeAtlasByHotspots(x.be, 5, s1)
F1 <- festim(F1)
F1 <- setSummary(F1, list.id = "all")

For the sake of clarity we have only created 5 submaps, but generally we do 100.

This function will creates 5 submaps, all the parameters can be modified (use args(makeAtlasByHotspots) for more informations).

The variable submaps becomes an submapsList object, you can watch the different elements of it with :

str(F1) #careful it can become huge depending on your data sizes

Descrition of the object F1

This object contains all the results of the different computations executed during the process of creating n submaps. Here is a complete description of each structure in this object :

str(F1@segments_list) #careful it can become huge depending on your data sizes

segments_list : the object Segments created previously (s1)

str(F1@submaps_list) #careful it can become huge depending on your data sizes

submaps_list : the list of all the submaps created during the process. Each element of the list is a S4 object. Depending on the method you used the object can be either a snsp.matrix or a HostspotsMatrix (here we use the hotspots method). Each submaps contains 15 slots :
- submap : the index of each marker picked (index from the bed.matrix object)
- ncol : the total number of marker picked
- nrow : the number of individuals
- ped : a dataframe with all the individuals' genotypes
- map : a dataframe with all the SNP information
- epsilon : genotyping error rate
- delta.dist : distance between each marker in cM/bp
- log.emiss : log of all the emission probabilities of the hidden Markov model
- a : a matrix with all the a's estimation
- f : a matrix with all the f's estimation
- likelihood0 : a matrix with all the likehood under the null hypothesis ($f=0$)
- likelihood1 : a matrix with all the likehood under the inbred hypothesis ($f=1$)
- p.lrt : p value of the likelihood ratio test
- HBD.prob : a matrix with all the HBD probabilities computed for each individual
- FLOD : a matrix with all the FLOD score computed

str(F1@likelihood_summary)

likelihood_summary : a dataframe with all the likelihood0 and likelihood1 computed over the submaps.

str(F1@estimation_summary)

estimation_summary : a dataframe with all the a and f computed over the submaps

submaps_summary : a dataframe which gives several informations about the a and f computed over the submaps. The dataframe contains 13 columns:
- famid: family identifier
- id: individual identifier
- pheno: status (1 non-affected, 2 affected, 0 unknown)
- submaps: number of submaps used
- quality: percentage of valid submaps (i.e. submaps with a < 1)
- f_min: minimum f on valid submaps
- f_max: maximum f on valid submaps
- f_mean: mean f on valid submaps
- f_median: median f on valid submaps (recommended to estimate f)
- a_median : median a on valid submaps (recommended to estimate a)
- pLRT_median: median p-value of LRT tests on valid submaps
- inbred : a flag indicating if the individual is inbred (pLRT_median < 0.05) or not
- pLRT_<0.05: number of valid submaps with a LRT (likelihood ratio test) having a p-value below 0.05

head(F1@submap_summary)

##       famid        id pheno submaps quality      f_min       f_max      f_mean
## 1 HGDP00607 HGDP00607     1       5     100 0.02307054 0.028227601 0.024997474
## 2 HGDP00608 HGDP00608     1       5     100 0.03481927 0.040759254 0.038219872
## 3 HGDP00609 HGDP00609     1       5     100 0.03694378 0.040824604 0.038914434
## 4 HGDP00610 HGDP00610     1       5     100 0.04457247 0.053895480 0.050580150
## 5 HGDP00611 HGDP00611     1       5     100 0.00000000 0.005577867 0.001115573
## 6 HGDP00612 HGDP00612     1       5     100 0.05372673 0.061493384 0.056387572
##     f_median   a_median  pLRT_median inbred pLRT_inf_0.05
## 1 0.02363142 0.12652141 7.107875e-39   TRUE             5
## 2 0.03775612 0.06013269 6.020340e-64   TRUE             5
## 3 0.03849243 0.10540997 3.697390e-74   TRUE             5
## 4 0.05170420 0.15971819 1.573828e-79   TRUE             5
## 5 0.00000000 0.01000000 1.000000e+00  FALSE             1
## 6 0.05515614 0.29336335 1.012755e-58   TRUE             5

bySegments : a boolean indicating whether the creation of summary statistics for HBD and FLOD has to be made by segments or not. By default, it is FALSE. The description below of HBD_recap and FLOD_recap depends on this choice.
HBD_recap : a dataframe with individuals in row and marker positions in column. It contains the mean of the HBD probability at a marker over the submaps for each individual. If the marker only appears in one submap (no mean needed), then it is just the HBD proba for that marker in that submap.

F1@HBD_recap[1:10, 1:10] # an individual * marker matrix

##                      rs11260588    rs6603793   rs7525092    rs4648808
## HGDP00607:HGDP00607 0.004177668 2.496864e-06 0.013449101 1.303554e-03
## HGDP00608:HGDP00608 0.004681580 7.370413e-06 0.002462725 2.144115e-03
## HGDP00609:HGDP00609 0.006472987 6.699717e-03 0.004678171 1.981378e-03
## HGDP00610:HGDP00610 0.014112102 9.672567e-06 0.014375800 4.413431e-06
## HGDP00611:HGDP00611 0.000000000 1.749224e-03 0.000000000 0.000000e+00
## HGDP00612:HGDP00612 0.588190216 6.450754e-01 0.742525650 8.549851e-01
## HGDP00613:HGDP00613 0.000000000 0.000000e+00 0.000000000 0.000000e+00
## HGDP00614:HGDP00614 0.768097811 8.940857e-01 0.943150847 8.654859e-01
## HGDP00615:HGDP00615 0.012682905 7.809544e-06 0.298069056 2.112174e-05
## HGDP00616:HGDP00616 0.021159744 1.379115e-04 0.017317536 3.134844e-02
##                       rs10910047     rs903919   rs10910050   rs12082516
## HGDP00607:HGDP00607 1.589246e-06 1.125124e-02 6.753533e-08 2.179563e-05
## HGDP00608:HGDP00608 5.612408e-07 1.872173e-06 2.873811e-04 1.708830e-07
## HGDP00609:HGDP00609 6.353068e-07 3.549626e-06 1.775769e-03 2.255507e-07
## HGDP00610:HGDP00610 3.291959e-06 6.216004e-03 3.539811e-04 4.981109e-05
## HGDP00611:HGDP00611 0.000000e+00 0.000000e+00 2.829025e-06 0.000000e+00
## HGDP00612:HGDP00612 7.737806e-01 8.713890e-01 7.719123e-01 7.947139e-01
## HGDP00613:HGDP00613 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## HGDP00614:HGDP00614 9.491040e-01 9.723467e-01 9.810801e-01 9.606419e-01
## HGDP00615:HGDP00615 2.029699e-03 2.848096e-01 1.783233e-05 9.361770e-06
## HGDP00616:HGDP00616 1.014055e-04 7.419373e-03 1.869893e-02 2.371933e-03
##                        rs2645065    rs2840530
## HGDP00607:HGDP00607 1.361605e-05 6.883217e-06
## HGDP00608:HGDP00608 1.557253e-03 1.320757e-03
## HGDP00609:HGDP00609 2.420824e-06 5.038964e-04
## HGDP00610:HGDP00610 2.596836e-03 1.064889e-04
## HGDP00611:HGDP00611 0.000000e+00 0.000000e+00
## HGDP00612:HGDP00612 8.800096e-01 8.715920e-01
## HGDP00613:HGDP00613 0.000000e+00 0.000000e+00
## HGDP00614:HGDP00614 9.129762e-01 8.904461e-01
## HGDP00615:HGDP00615 2.550489e-05 5.212088e-04
## HGDP00616:HGDP00616 7.551350e-03 1.909570e-02

FLOD_recap : a dataframe like above. It contains the mean of the FLOD score at a marker over the submaps for each individual. Again if a marker only appears in one submap, no mean is needed.

F1@FLOD_recap[1:10, 1:10] # an individual * marker matrix

##                     rs11260588  rs6603793  rs7525092  rs4648808 rs10910047
## HGDP00607:HGDP00607 -0.7356160 -2.3645709 -0.2330256 -1.2839118  -2.445319
## HGDP00608:HGDP00608 -0.8985638 -2.5803655 -1.1633156 -1.1920183  -2.606459
## HGDP00609:HGDP00609 -0.7942212 -0.7525045 -0.9233675 -1.2503623  -2.583709
## HGDP00610:HGDP00610 -0.5796989 -2.6848496 -0.5369504 -2.6312421  -2.700256
## HGDP00611:HGDP00611  0.0000000 -0.4871941  0.0000000  0.0000000   0.000000
## HGDP00612:HGDP00612  1.0385936  1.0672972  1.1396964  1.1424752   1.125804
## HGDP00613:HGDP00613  0.0000000  0.0000000  0.0000000  0.0000000   0.000000
## HGDP00614:HGDP00614  1.4390193  1.4926120  1.5179023  1.4920049   1.479503
## HGDP00615:HGDP00615 -0.8429585 -2.9191408  0.5165932 -2.8809159  -1.623471
## HGDP00616:HGDP00616 -0.5407423 -2.4970489 -0.6376400 -0.3489034  -2.580316
##                       rs903919 rs10910050 rs12082516  rs2645065  rs2840530
## HGDP00607:HGDP00607 -0.3098928 -2.3749872  -2.366543 -2.3113462 -2.4021939
## HGDP00608:HGDP00608 -2.5639212 -2.0231409  -2.608148 -1.3587449 -1.3905425
## HGDP00609:HGDP00609 -2.5874397 -1.3118144  -2.585480 -2.6015539 -1.7877458
## HGDP00610:HGDP00610 -0.8971502 -2.0679398  -2.538792 -1.3015013 -2.3351223
## HGDP00611:HGDP00611  0.0000000 -1.7420280   0.000000  0.0000000  0.0000000
## HGDP00612:HGDP00612  1.2091884  1.1452436   1.137395  1.2135387  1.1508289
## HGDP00613:HGDP00613  0.0000000  0.0000000   0.000000  0.0000000  0.0000000
## HGDP00614:HGDP00614  1.5311409  1.5329331   1.484750  1.5140532  1.5043511
## HGDP00615:HGDP00615  0.4968378 -2.8805369  -2.912886 -2.8508932 -2.1709293
## HGDP00616:HGDP00616 -1.0024447 -0.5993492  -1.491371 -0.9845619 -0.5632986

HBDsegments : a list of dataframe, each datafram is for an individual and it contains a list of HBDsegments (start and end positions). By default, a region is HBD if it contains at least 5 markers (n.consecutive.marker=5) with a HBD probability larger than 0.5 (threshold=0.5).

str(F1@HBDsegments[[1]])

## 'data.frame':    20 obs. of  11 variables:
##  $ id        : Factor w/ 1 level "HGDP00607": 1 1 1 1 1 1 1 1 1 1 ...
##  $ famid     : Factor w/ 1 level "HGDP00607": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pheno     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ start     : num  18788 20590 20600 20626 20649 ...
##  $ end       : num  19127 20595 20605 20647 20654 ...
##  $ size      : num  340 6 6 22 6 7 7 5 7 14 ...
##  $ chromosome: int  5 5 5 5 5 5 5 6 6 6 ...
##  $ start_pos : int  34611504 146164325 146571759 147249909 148176930 148622110 149027979 150577395 150743620 151206457 ...
##  $ end_pos   : int  58142052 146505732 146612787 148149539 148352615 148671072 149227931 150695815 150793567 151383747 ...
##  $ start_dist: num  53.2 148.8 149.3 150.6 151.8 ...
##  $ end_dist  : num  69.4 149.1 149.6 151.6 152.2 ...

HFLOD : a dataframe with the value of HFLOD scores for every markers that appeared at least once in a submap. It uses the formula (2) with FLOD_recap to obtain HFLOD.

str(F1@HFLOD)

## 'data.frame':    63644 obs. of  6 variables:
##  $ chr   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ snps  : Factor w/ 63644 levels "rs1000000","rs10000010",..: 7589 46931 53318 39873 5306 59074 5307 10842 32111 33711 ...
##  $ pos_cM: num  0.509 1.06 1.153 1.791 1.849 ...
##  $ pos_Bp: int  1011521 1495118 1799950 2030796 2201709 2212443 2214736 2241020 2251082 2275590 ...
##  $ HFLOD : num  0.0442 0.0685 0.0801 0.0725 0.1149 ...
##  $ ALPHA : num  0.0205 0.0235 0.0245 0.0239 0.0318 ...

bedmatrix : the bedmatrix object

str(F1@bedmatrix)

unit : the unit of the marker (cM or Bp).

Hotspots by segments

We implemented a second inner method for the "Hotspots" method. The only paramater that changes is recap.by.segments, it is put to TRUE.

In the default "Hotspots" method the HBD probabilities and FLOD scores are computed for each marker (see above).

With the "Hotspots by segment" method, HBD probabilities and FLOD scores correspond to the mean of HBD probabilities and FLOD score of all the markers of a segment that have been selected in one of the sumaps. There is hence only one value for a segment delimited by hotspots. So we also recommend to set n.consecutive.marker to 1. Since results over multiple markers have already been averaged.

Creation of the segments list

We use the same segment list that is used before (s1).

Creation of the submaps and computation

As said before the only argument that changes is "recap.by.segments", it is put to TRUE. And we recommend to adjust "n.consecutive.marker".

F2 <- makeAtlasByHotspots(x.be, 100, s1)
F2 <- festim(F2)
F2 <- setSummary(F2, recap.by.segments = TRUE, n.consecutive.marker = 1, list.id = "all")

Distance

Creation of the segments list

We will now create segments, which will be used to create the submaps later :

s3 <- segmentsListByDistance(x.be)

## Finding segments for the genome : .......................
## Finding which markers are between two segments: .......................
## Finding mini segments .......................

This function creates segments based on the markers available in the data and the gaps between them. The value of the gap is imposed by the minimun step required between markers. If 2 adjacent markers have an inter-distance larger than the step, then there is a gap between them. The first marker of the pair is the end of a segment and the second marker is the start of a new segment. The function creates an object which will contain three slots :

gap : the value of the minimal fix step required between sampled markers of a submap
unit : the unit of the markers ("cM" or "Bp") snpsSegments : for each chromosome, the
list of segments; each segment being a list of SNP indexes

You can watch a summary of what was done with :

segmentsListSummary(s3)

##    chromosome number_of_segments number_of_markers
## 1           1                  7             49639
## 2           2                  3             53765
## 3           3                  2             44564
## 4           4                  3             39942
## 5           5                  1             40976
## 6           6                  1             43239
## 7           7                  3             35507
## 8           8                  2             37282
## 9           9                  5             31192
## 10         10                  2             34493
## 11         11                  2             32005
## 12         12                  1             31873
## 13         13                  1             25191
## 14         14                  2             21450
## 15         15                  5             19594
## 16         16                  2             19727
## 17         17                  1             16629
## 18         18                  2             20165
## 19         19                  1             10739
## 20         20                  1             16911
## 21         21                  2              9645
## 22         22                  5              9730
## 23         25                  1                15

This function creates a dataframe with three colums :

chromosome
number_of_segments
number_of_marker

Creation of the submaps and computation

We will now head toward the creation of submaps using the following commands :

F3 <- makeAtlasByDistance(x.be, 5, s3)
F3 <- festim(F3)
F3 <- setSummary(F3, list.id = "all")

The variable submaps becomes an submapsList object, you can watch the different elements of it with :

str(F3) # careful it can become huge depending on your data sizes

Parallelism with the package

We implemented a paralellism method to make the creation of the submaps more efficient. We paralellized the creation of the submaps, that is to say, the selection of markers. Make sure to have an environment that can support the usage of multiple CPU.

Use the n.cores argument, i.e the number of CPU that will be used to make the differents submaps in the functions Fantasio, makeAtlasByHotspots, makeAtlasByDistance :

F1 <- Fantasio(bedmatrix = x.be, segments = "Hotspots", 
               segment.options = list(hotspots = hotspot_hg18), n = 100, 
               verbose = FALSE, n.cores = 10)

genostats/FEstim documentation built on Feb. 3, 2023, 7:33 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

genostats/FEstim
Genetic Data Analyses in Presence of Inbreeding

In genostats/FEstim: Genetic Data Analyses in Presence of Inbreeding

Introduction

Principal concepts

Getting started

Installing Fantasio

Installing example data: HGDP-CEPH

Running Fantasio

Hotspots (Default)

Hotspots by Segments

Distance

How to run logistic regression

How to use the `segment.option` argument

Plotting results

HFLOD plots

HBD plots

GLM plots

Step by step usage of the package Fantasio

Hotspots

Creation of the segments list

Creation of the submaps and computation

Descrition of the object F1

Hotspots by segments

Creation of the segments list

Creation of the submaps and computation

Distance

Creation of the segments list

Creation of the submaps and computation

Parallelism with the package

R Package Documentation

Browse R Packages

We want your feedback!

genostats/FEstim Genetic Data Analyses in Presence of Inbreeding

In genostats/FEstim: Genetic Data Analyses in Presence of Inbreeding

Introduction

Principal concepts

Getting started

Installing Fantasio

Installing example data: HGDP-CEPH

Running Fantasio

Hotspots (Default)

Hotspots by Segments

Distance

How to run logistic regression

How to use the segment.option argument

Plotting results

HFLOD plots

HBD plots

GLM plots

Step by step usage of the package Fantasio

Hotspots

Creation of the segments list

Creation of the submaps and computation

Descrition of the object F1

Hotspots by segments

Creation of the segments list

Creation of the submaps and computation

Distance

Creation of the segments list

Creation of the submaps and computation

Parallelism with the package

R Package Documentation

Browse R Packages

We want your feedback!

genostats/FEstim
Genetic Data Analyses in Presence of Inbreeding

How to use the `segment.option` argument