estNetwork: Function to estimate & plot haplotype network

View source: R/other_useful_functions.R

estNetworkR Documentation

Function to estimate & plot haplotype network

Description

Function to estimate & plot haplotype network

Usage

estNetwork(
  blockInterest = NULL,
  gwasRes = NULL,
  nTopRes = 1,
  gene.set = NULL,
  indexRegion = 1:10,
  chrInterest = NULL,
  posRegion = NULL,
  blockName = NULL,
  nHaplo = NULL,
  pheno = NULL,
  geno = NULL,
  ZETA = NULL,
  chi2Test = TRUE,
  thresChi2Test = 0.05,
  plotNetwork = TRUE,
  distMat = NULL,
  distMethod = "manhattan",
  evolutionDist = FALSE,
  complementHaplo = "phylo",
  subpopInfo = NULL,
  groupingMethod = "kmedoids",
  nGrp = 3,
  nIterClustering = 100,
  iterRmst = 100,
  networkMethod = "rmst",
  autogamous = FALSE,
  probParsimony = 0.95,
  nMaxHaplo = 1000,
  kernelTypes = "addNOIA",
  n.core = parallel::detectCores() - 1,
  parallel.method = "mclapply",
  hOpt = "optimized",
  hOpt2 = "optimized",
  maxIter = 20,
  rangeHStart = 10^c(-1:1),
  saveName = NULL,
  saveStyle = "png",
  plotWhichMDS = 1:2,
  colConnection = c("grey40", "grey60"),
  ltyConnection = c("solid", "dashed"),
  lwdConnection = c(1.5, 0.8),
  pchBase = c(1, 16),
  colCompBase = c(2, 4),
  colHaploBase = c(3, 5, 6),
  cexMax = 2,
  cexMin = 0.7,
  ggPlotNetwork = FALSE,
  cexMaxForGG = 0.025,
  cexMinForGG = 0.008,
  alphaBase = c(0.9, 0.3),
  verbose = TRUE
)

Arguments

blockInterest

A n \times M matrix representing the marker genotype that belongs to the haplotype block of interest. If this argument is NULL, this argument will automatically be determined by 'geno',

gwasRes

You can use the results (data.frame) of haplotype-based (SNP-set) GWAS by 'RGWAS.multisnp' function.

nTopRes

Haplotype blocks (or gene sets, SNP-sets) with top 'nTopRes' p-values by 'gwasRes' will be used.

gene.set

If you have information of gene (or haplotype block), you can use it to perform kernel-based GWAS. You should assign your gene information to gene.set in the form of a "data.frame" (whose dimension is (the number of gene) x 2). In the first column, you should assign the gene name. And in the second column, you should assign the names of each marker, which correspond to the marker names of "geno" argument.

indexRegion

You can specify the haplotype block (or gene set, SNP-set) of interest by the marker index in 'geno'.

chrInterest

You can specify the haplotype block (or gene set, SNP-set) of interest by the marker position in 'geno'. Please assign the chromosome number to this argument.

posRegion

You can specify the haplotype block (or gene set, SNP-set) of interest by the marker position in 'geno'. Please assign the position in the chromosome to this argument.

blockName

You can specify the haplotype block (or gene set, SNP-set) of interest by the name of haplotype block in 'geno'.

nHaplo

Number of haplotypes. If not defined, this is automatically defined by the data. If defined, k-medoids clustering is performed to define haplotypes.

pheno

Data frame where the first column is the line name (gid). The remaining columns should be a phenotype to test.

geno

Data frame with the marker names in the first column. The second and third columns contain the chromosome and map position. Columns 4 and higher contain the marker scores for each line, coded as [-1, 0, 1] = [aa, Aa, AA].

ZETA

A list of covariance (relationship) matrix (K: m \times m) and its design matrix (Z: n \times m) of random effects. Please set names of list "Z" and "K"! You can use more than one kernel matrix. For example,

ZETA = list(A = list(Z = Z.A, K = K.A), D = list(Z = Z.D, K = K.D))

Z.A, Z.D

Design matrix (n \times m) for the random effects. So, in many cases, you can use the identity matrix.

K.A, K.D

Different kernels which express some relationships between lines.

For example, K.A is additive relationship matrix for the covariance between lines, and K.D is dominance relationship matrix.

chi2Test

If TRUE, chi-square test for the relationship between haplotypes & subpopulations will be performed.

thresChi2Test

The threshold for the chi-square test.

plotNetwork

If TRUE, the function will return the plot of haplotype network.

distMat

You can assign the distance matrix of the block of interest. If NULL, the distance matrix will be computed in this function.

distMethod

You can choose the method to calculate distance between accessions. This argument corresponds to the 'method' argument in the 'dist' function.

evolutionDist

If TRUE, the evolution distance will be used instead of the pure distance. The 'distMat' will be converted to the distance matrix by the evolution distance when you use 'complementHaplo = "phylo"'.

complementHaplo

how to complement unobserved haplotypes. When 'complementHaplo = "all"', all possible haplotypes will be complemented from the observed haplotypes. When 'complementHaplo = "never"', unobserved haplotypes will not be complemented. When 'complementHaplo = "phylo"', unobserved haplotypes will be complemented as nodes of phylogenetic tree. When 'complementHaplo = "TCS"', unobserved haplotypes will be complemented by TCS methods (Clement et al., 2002).

subpopInfo

The information of subpopulations. This argument should be a vector of factor.

groupingMethod

If 'subpopInfo' argument is NULL, this function estimates subpopulation information from marker genotype. You can choose the grouping method from 'kmeans', 'kmedoids', and 'hclust'.

nGrp

The number of groups (or subpopulations) grouped by 'groupingMethod'. If this argument is 0, the subpopulation information will not be estimated.

nIterClustering

If 'groupingMethod' = 'kmeans', the clustering will be performed multiple times. This argument specifies the number of classification performed by the function.

iterRmst

The number of iterations for RMST (randomized minimum spanning tree).

networkMethod

Either one of 'mst' (minimum spanning tree), 'msn' (minimum spanning network), and 'rmst' (randomized minimum spanning tree). 'rmst' is recommended.

autogamous

This argument will be valid only when you use 'complementHaplo = "all"' or 'complementHaplo = "TCS"'. This argument specifies whether the plant is autogamous or not. If autogamous = TRUE, complemented haplotype will consist of only homozygous sites ([-1, 1]). If FALSE, complemented haplotype will consist of both homozygous & heterozygous sites ([-1, 0, 1]).

probParsimony

Equal to the argument 'prob' in 'haplotypes::parsimnet' function:

A numeric vector of length one in the range [0.01, 0.99] giving the probability of parsimony as defined in Templeton et al. (1992). In order to set maximum connection steps to Inf (to connect all the haplotypes in a single network), set the probability to NULL.

nMaxHaplo

The maximum number of haplotypes. If the number of total (complemented + original) haplotypes are larger than 'nMaxHaplo', we will only show the results only for the original haplotypes to reduce the computational time.

kernelTypes

In the function, similarlity matrix between accessions will be computed from marker genotype to estimate genotypic values. This argument specifies the method to compute similarity matrix: If this argument is 'addNOIA' (or one of other options in 'methodGRM' in 'calcGRM'), then the 'addNOIA' (or corresponding) option in the 'calcGRM' function will be used, and if this argument is 'diffusion', the diffusion kernel based on Laplacian matrix will be computed from network. You can assign more than one kernelTypes for this argument; for example, kernelTypes = c("addNOIA", "diffusion").

n.core

Setting n.core > 1 will enable parallel execution on a machine with multiple cores. This argument is not valid when 'parallel.method = "furrr"'.

parallel.method

Method for parallel computation in optimizing hyperparameters for estimating haplotype effects. We offer three methods, "mclapply", "furrr", and "foreach".

When 'parallel.method = "mclapply"', we utilize pbmclapply function in the 'pbmcapply' package with 'count = TRUE' and mclapply function in the 'parallel' package with 'count = FALSE'.

When 'parallel.method = "furrr"', we utilize future_map function in the 'furrr' package. With 'count = TRUE', we also utilize progressor function in the 'progressr' package to show the progress bar, so please install the 'progressr' package from github (https://github.com/HenrikBengtsson/progressr). For 'parallel.method = "furrr"', you can perform multi-thread parallelization by sharing memories, which results in saving your memory, but quite slower compared to 'parallel.method = "mclapply"'.

When 'parallel.method = "foreach"', we utilize foreach function in the 'foreach' package with the utilization of makeCluster function in 'parallel' package, and registerDoParallel function in 'doParallel' package. With 'count = TRUE', we also utilize setTxtProgressBar and txtProgressBar functions in the 'utils' package to show the progress bar.

We recommend that you use the option 'parallel.method = "mclapply"', but for Windows users, this parallelization method is not supported. So, if you are Windows user, we recommend that you use the option 'parallel.method = "foreach"'.

hOpt

Optimized hyper parameter for constructing kernel when estimating haplotype effects. If hOpt = "optimized", hyper parameter will be optimized in the function. If hOpt is numeric, that value will be directly used in the function.

hOpt2

Optimized hyper parameter for constructing kernel when estimating complemented haplotype effects. If hOpt2 = "optimized", hyper parameter will be optimized in the function. If hOpt2 is numeric, that value will be directly used in the function.

maxIter

Max number of iterations for optimization algorithm.

rangeHStart

The median of off-diagonal of distance matrix multiplied by rangeHStart will be used as the initial values for optimization of hyper parameters.

saveName

When drawing any plot, you can save plots in png format. In saveName, you should substitute the name you want to save. When saveName = NULL, the plot is not saved.

saveStyle

This argument specifies how to save the plot of phylogenetic tree. The function offers 'png', 'pdf', 'jpg', and 'tiff'.

plotWhichMDS

We will show the MDS (multi-dimensional scaling) plot, and this argument is a vector of two integers specifying that will define which MDS dimension will be plotted. The first and second integers correspond to the horizontal and vertical axes, respectively.

colConnection

A vector of two integers or characters specifying the colors of connection between nodes for the original and complemented haplotypes, respectively.

ltyConnection

A vector of two characters specifying the line types of connection between nodes for the original and complemented haplotypes, respectively.

lwdConnection

A vector of two integers specifying the line widths of connection between nodes for the original and complemented haplotypes, respectively.

pchBase

A vector of two integers specifying the plot types for the positive and negative genotypic values respectively.

colCompBase

A vector of two integers or characters specifying color of complemented haplotypes for the positive and negative genotypic values respectively.

colHaploBase

A vector of integers or characters specifying color of original haplotypes for the positive and negative genotypic values respectively. The length of the vector should equal to the number of subpopulations.

cexMax

A numeric specifying the maximum point size of the plot.

cexMin

A numeric specifying the minimum point size of the plot.

ggPlotNetwork

If TRUE, the function will return the ggplot version of haplotype network. It offers the precise information on subgroups for each haplotype.

cexMaxForGG

A numeric specifying the maximum point size of the plot for the ggplot version of haplotype network, relative to the range of x and y-axes (0 < cexMaxForGG <= 1).

cexMinForGG

A numeric specifying the minimum point size of the plot for the ggplot version of haplotype network, relative to the range of x and y-axes (0 < cexMaxForGG <= 1).

alphaBase

alpha (parameter that indicates the opacity of a geom) for original haplotype with positive / negative effects. alpha for complemented haplotype will be same as the alpha for original haplotype with negative effects.

verbose

If this argument is TRUE, messages for the current steps will be shown.

Value

A list / lists of

$haplotypeInfo

A list of haplotype information with

$haploCluster

A vector indicating each individual belongs to which haplotypes.

$haploMat

A n x h matrix where n is the number of genotypes and h is the number of haplotypes.

$haploBlock

Marker genotype of haplotype block of interest for the representing haplotypes.

$subpopInfo

The information of subpopulations.

$pValChi2Test

A p-value of the chi-square test for the dependency between haplotypes & subpopulations. If 'chi2Test = FALSE', 'NA' will be returned.

$mstResults

A list of estimated results of MST / MSN / RMST:

$mstRes

Estimated results of MST / MSN / RMST for the data including original haplotypes.

$mstResComp

Estimated results of MST / MSN / RMST for the data including both original and complemented haplotype.

$distMats

A list of distance matrix:

$distMat

Distance matrix between haplotypes.

$distMatComp

Distance matrix between haplotypes (including unobserved ones).

$laplacianMat

Laplacian matrix between haplotypes (including unobserved ones).

$gvTotal

Estimated genotypic values by kernel regression for each haplotype.

$gvTotalForLine

Estimated genotypic values by kernel regression for each individual.

$minuslog10p

-log_{10}(p) for haplotype block of interest. p is the p-value for the siginifacance of the haplotype block effect.

$hOpts

Optimized hyper parameters, hOpt1 & hOpt2.

$EMMResults

A list of estimated results of kernel regression:

$EM3Res

Estimated results of kernel regression for the estimation of haplotype effects. (1st step)

$EMMRes

Estimated results of kernel regression for the estimation of haplotype effects of nodes. (2nd step)

$EMM0Res

Estimated results of kernel regression for the null model.

$clusterNosForHaplotype

A list of cluster Nos of individuals that belong to each haplotype.


RAINBOWR documentation built on July 4, 2024, 1:11 a.m.