knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(clusterhap)

*clusterhap* function identifies haplotypes within QTL
(Quantitative Trait Loci). One haplotype is a combination of SNP
(Single Nucleotide Polymorphisms) within the QTL. This function
groups together all individuals of a population with the same
haplotype. Each group contains individual with the same allele in
each SNP, whether or not missing data. Thus, *clusterhap* groups
individuals, that to be imputed, have a non-zero probability of
having the same alleles in the entire sequence of SNP's.
*clusterhap* does not impute missing data automatically as if they do packages such as this. ( *alleHap*,*hapassoc*,*hsphase*)

The function return five reports:
a. **haplotypes** (all haplotypes identified),
b. **haplotypes_frequencies** (the frequency of each haplotype),
c. **duplicates** (individuals assigned to more than one haplotype),
d. **haplotypes_result** (a matrix with individuals assigned to each haplotype, and sorted haplotypes), the function plot this matrix,
e. **underterminates** (not assigned individuals)

*Note*: the user must decide which haplotype assign to the
individuals which were assigned to more than one haplotype.
Eventually, the user must manually remove from this matrix the
duplicate haplotype.
A decision criterion might be looking at the frequency of
haplotypes and selects the most frequent haplotype.

*clusterhap* uses a simple numeric matrix of individual (row) and
SNP (columns). Given a QTL, the user should transform all SNP
alleles of each individual in a number, in this way: the bases A,
C, G and T are changed to 1, 2, 3 and 4, respectively. A zero (0)
is assigned to all SNP with missing data (NA). Therefore, the
matrix coordinates are 0, 1, 2, 3 or 4.

The *clusterhap* function first builds the set Y, composed of all
the complete sequences of SNP's the QTL, whose elements call
haplotypes. With this objective, the function first counts all
zeros within the vector. When there are not zero, the individual
have not missing data in all SNP and it is assigned as an element
in set Y.
Once built the set Y, *clusterhap* function assigns to each
haplotype the population`s vector which contains the same numbers
in all coordinates nonzero. i.e. *clusterhap* associates to each
QTL’s SNP complete sequence, all individuals which in each locus
with data, has the same allele. For example, clusterhap assigns the
individual 1 defined by [0,2,0,1,1,4,2] to haplotype 1 defined by
[1,2,3,1,1,4,2] since it match in all nonzero coordinates. On the
other hand, clusterhap does not associate individual 2 defined by
[0,2,0,1,1,4,3] to this haplotype since the last coordinate
differs. Originally, haplotype 1 had a C in the last SNP and
individual 2 a G. Therefore, having been imputed individual 2 never
coincides with haplotype 1. An individual may be associated with
more than one haplotype or none, in which case they will be labeled
as indeterminate. Thus, most individuals are univocally determined
and associated with a single haplotype.

The `sim_qtl`

data.frame included with the package has simulated
results for 7 SNPs on 5 individuals.

ind | SNP.1 | SNP.2 | SNP.3 | SNP.4 | SNP.5 | SNP.6 | SNP.7 ------|-------|-------|-------|-------|-------|-------|------ ind.1 | A | C | G | A | A | T | C ind.2 | NA | C | NA | A | A | T | C ind.3 | C | C | T | A | A | T | C ind.4 | NA | C | NA | A | A | T | G ind.5 | C | NA | NA | A | A | T | C

It is transformed to:

ind | SNP.1 | SNP.2 | SNP.3 | SNP.4 | SNP.5 | SNP.6 | SNP.7 ------|-------|-------|-------|-------|-------|-------|------ ind.1 | 1 | 2 | 3 | 1 | 1 | 4 | 2 ind.2 | 0 | 2 | 0 | 1 | 1 | 4 | 2 ind.3 | 2 | 2 | 4 | 1 | 1 | 4 | 2 ind.4 | 0 | 2 | 0 | 1 | 1 | 4 | 3 ind.5 | 2 | 0 | 0 | 1 | 1 | 4 | 2

Only individuals 1 and 3 have complete SNP sequence for this QTL. Therefore:

Y = {H.1 = [1,2,3,1,1,4,2], H.2 = [2,2,4,1,1,4,2]}

Individuals 1 and 2 is assigned to H.1 . The first has the
complete sequence and the second in all nonzero data has the
same alleles. None other individual is assigned to this
haplotype since all remaining individuals differs in the
identity of at least one SNP. Following the reasoning,
individuals 2, 3 and 5 are assigned to H.2 . Notice that
individual 2 was assigned to both haplotypes given that nonzero
SNP coincide with both H.1 and H.2. On the other hand,
individual 4 was not assigned to any haplotype due to SNP.7
does not match with no element of set Y. The *clusterhap*
function classifies individual 4 as indeterminate.

library(clusterhap) data("sim_qtl") clusterhap(sim_qtl, Print=TRUE)

The `rice_qtl`

data.frame included with the package The
Uruguayan Rice Breeding GWAS (URiB) population is composed of
637 genotypes from the INIA’s Rice Breeding Program (IRBP).
In this example we use the information in one of those
populations. The population has 324 indica lines, and 2 indica
cultivars (El Paso 144 and INIA Olimar). The population was
genotyped by SNPs obteined by Genotyping-by sequencing
(GBS).The data is a QTL for Grain Quality.

data("rice_qtl") clusterhap(rice_qtl)

**1.** *clusterhap* function calculates the amount of null components (*cq*) in each individual of the database.

H_data <- x # x is the data frame cq <- NULL Q <- H_data[, -1] for (i in 1:nrow(Q)) { for (j in 1:nrow(Y)) { cq <- rowSums(Q[i, ] == 0)

**2.** On the other hand, clusterhap function calculates vector coordinate to coordinate subtracts, between the individual and each haplotype or element of set *Y*, counting the number of zeros (*cr*).

y <- which(c.Q == 0) b <- Q[y, ] Y <- b[!duplicated(b), ] for (i in 1:nrow(Q)) { for (j in 1:nrow(Y)) { cq <- rowSums(Q[i, ] == 0) w <- Q[i, ] - Y[j, ] cr <- rowSums(w == 0)

**3.** If the sum of *cq+cr*, is equal to the amount of SNP within the QTL, then associate the individual to the haplotype. Note that the subtraction vector has zeros in coordinates where both vectors (individual and haplotype) contain the same number. When two SNP differs, the resulting vector presents no null coordinates; particularly when the individual contains zeros, due to the haplotype is a complete SNP sequence with no missing data. The only way that the sum *cq+cr* is equal to the amount of QTL and SNP is when individual and haplotype coordinates differ only in null coordinate of individual. i.e. individual not null coordinate coincide with haplotype coordinates. Hence, of being imputed missing SNP of the individual, this SNP has non-zero probability to match the haplotypes that was assigned by the function. Those individuals assigned to non haplotype are classified as indeterminate.

for (i in 1:nrow(Q)) { for (j in 1:nrow(Y)) { cq <- rowSums(Q[i, ] == 0) w <- Q[i, ] - Y[j, ] cr <- rowSums(w == 0) zeros <- cq + cr if (zeros == ncol(Q)) { hp <- cbind(hp, i) hp.1 <- cbind(hp.1, j) hpl <- cbind(Q[i, ], j) hpl1 <- cbind(id[i], hpl) hplq <- rbind(hplq, hpl1) }

Burkett, K. *et al*. (2006) hapassoc:
Software for Like lihood Inference of Trait Associations
with SNP Haplotypes and Other Attributes. *J Stat Soft*.,
**16**, 1-19.

Ferdosi, M. H. *et al* .(2014) hsphase: an R package for
pedigree reconstruction, detection of recombination events,
phasing and imputation of half-sib family groups. BMC
*Bioinformatics* **15**, 172.

Medina-Rodriguez, Nathan and Santana, A. (2015) alleHap: Allele Imputation and Haplotype Reconstruction from Pedigree Databases, R,package version 0.9.2.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.