Description Usage Arguments Details Value Author(s) See Also Examples
View source: R/DistanceMatrix.R
Calculates a distance matrix for an XStringSet
. Each element of the distance matrix corresponds to the dissimilarity between two sequences in the XStringSet
.
1 2 3 4 5 6 7 8 |
myXStringSet |
An |
type |
Character string indicating the type of output desired. This should be either |
includeTerminalGaps |
Logical specifying whether or not to include terminal gaps ("-" or "." characters on each end of the sequence) into the calculation of distance. |
penalizeGapLetterMatches |
Logical specifying whether or not to consider gap-to-letter matches as mismatches. If |
penalizeGapGapMatches |
Logical specifying whether or not to consider gap-to-gap matches as mismatches. If |
correction |
The substitution model used for distance correction. This should be (an abbreviation of) either |
processors |
The number of processors to use, or |
verbose |
Logical indicating whether to display progress. |
The uncorrected distance matrix represents the hamming distance between each of the sequences in myXStringSet
. Ambiguity can be represented using the characters of the IUPAC_CODE_MAP
for DNAStringSet
and RNAStringSet
inputs, or using the AMINO_ACID_CODE
for an AAStringSet
input. For example, the distance between an 'N' and any other nucleotide base is zero. The letters B (N or D), J (I or L), Z (Q or E), and X (any letter) are degenerate in the AMINO_ACID_CODE
.
If includeTerminalGaps = FALSE
then terminal gaps ("-" or "." characters) are not included in sequence length. This can be faster since only the positions common to each pair of sequences are compared. Sequences with no overlapping region in the alignment are given a value of NA
, unless includeTerminalGaps = TRUE
, in which case distance is 100%.
Penalizing gap-to-gap and gap-to-letter mismatches specifies whether to penalize these special mismatch types and include them in the total length when calculating distance. Both "-" and "." characters are interpreted as gaps. The default behavior is to calculate distance as the fraction of positions that differ across the region of the alignment shared by both sequences (not including gap-to-gap matches).
The elements of the distance matrix can be referenced by dimnames
corresponding to the names
of the XStringSet
. Additionally, an attribute named "correction" specifying the method of correction used can be accessed using the function attr
.
If type
is "matrix"
, a symmetric matrix where each element is the distance between the sequences referenced by the respective row and column. The dimnames
of the matrix correspond to the names
of the XStringSet
.
If type
is "dist"
, an object of class
"dist"
that contains one triangle of the distance matrix as a vector. Since the distance matrix is symmetric, storing only one triangle is more memory efficient.
Erik Wright eswright@pitt.edu
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | # example of using the defaults:
dna <- DNAStringSet(c("ACTG", "ACCG"))
dna
DistanceMatrix(dna)
# changing the output type to "dist":
d <- DistanceMatrix(dna, type="dist")
d
length(d) # minimal memory space required
m <- as.matrix(d)
length(m) # more memory space required
# supplying an AAStringSet
aa <- AAStringSet(c("ASYK", "ATYK", "CTWN"))
aa
DistanceMatrix(aa)
# defaults compare intersection of internal ranges:
dna <- DNAStringSet(c("ANGCT-", "-ACCT-"))
dna
d <- DistanceMatrix(dna)
# d[1,2] is 1 base in 4 = 0.25
# compare the entire sequence, including gaps:
dna <- DNAStringSet(c("ANGCT-", "-ACCT-"))
dna
d <- DistanceMatrix(dna, includeTerminalGaps=TRUE,
penalizeGapGapMatches=TRUE)
# d[1,2] is now 3 bases in 6 = 0.50
# compare union of internal positions, without terminal gaps:
dna <- DNAStringSet(c("ANGCT-", "-ACCT-"))
dna
d <- DistanceMatrix(dna, includeTerminalGaps=TRUE,
penalizeGapGapMatches=FALSE)
# d[1,2] is now 2 bases in 5 = 0.40
# gap ("-") and unknown (".") characters are interchangeable:
dna <- DNAStringSet(c("ANGCT.", ".ACCT-"))
dna
d <- DistanceMatrix(dna, includeTerminalGaps=TRUE,
penalizeGapGapMatches=FALSE)
# d[1,2] is still 2 bases in 5 = 0.40
|
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: parallel
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:parallel’:
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, append, as.data.frame, basename, cbind, colnames,
dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:base’:
expand.grid
Loading required package: IRanges
Loading required package: XVector
Attaching package: ‘Biostrings’
The following object is masked from ‘package:base’:
strsplit
Loading required package: RSQLite
DNAStringSet object of length 2:
width seq
[1] 4 ACTG
[2] 4 ACCG
================================================================================
Time difference of 0 secs
[,1] [,2]
[1,] 0.00 0.25
[2,] 0.25 0.00
attr(,"correction")
[1] "none"
================================================================================
Time difference of 0 secs
1 2
1 0.00 0.25
2 0.25 0.00
[1] 1
[1] 4
AAStringSet object of length 3:
width seq
[1] 4 ASYK
[2] 4 ATYK
[3] 4 CTWN
================================================================================
Time difference of 0 secs
[,1] [,2] [,3]
[1,] 0.00 0.25 1.00
[2,] 0.25 0.00 0.75
[3,] 1.00 0.75 0.00
attr(,"correction")
[1] "none"
DNAStringSet object of length 2:
width seq
[1] 6 ANGCT-
[2] 6 -ACCT-
================================================================================
Time difference of 0 secs
DNAStringSet object of length 2:
width seq
[1] 6 ANGCT-
[2] 6 -ACCT-
================================================================================
Time difference of 0 secs
DNAStringSet object of length 2:
width seq
[1] 6 ANGCT-
[2] 6 -ACCT-
================================================================================
Time difference of 0 secs
DNAStringSet object of length 2:
width seq
[1] 6 ANGCT.
[2] 6 .ACCT-
================================================================================
Time difference of 0 secs
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.