Score a New Sequence Given an EMM
Description
Calculates a score of how likely it is that a new sequence was generated by the same process as the sequences used to build the EMM.
Usage
1 2 3 4 5 6 7 8 9 10  ## S4 method for signature 'EMM,matrix'
score(x, newdata, method = c("product", "log_sum", "sum",
"log_odds", "supported_transitions", "supported_states",
"sum_transitions", "log_loss", "likelihood", "log_likelihood", "AIC"),
match_cluster = "exact", prior=TRUE, normalize=TRUE,
initial_transition = FALSE, threshold = NA)
## S4 method for signature 'EMM,EMM'
score(x, newdata, method = c("product", "log_sum", "sum",
"supported_transitions"), match_cluster = "exact", prior=TRUE,
initial_transition = FALSE)

Arguments
x 
an 
newdata 
sequence or another 
method 
method to calculate the score (see details) 
match_cluster 
do the new observations have to fall within
the threshold of the cluster ( 
prior 
add one to each transition count. This is equal to start with a count of one for each transition, i.e. initially all transitions are equally likely. It prevents the product of probabilities to be zero if a transition was never observed. 
normalize 
normalize the score by the length of the sequence. 
initial_transition 
include the initial transition in the computation? 
threshold 
minimum count threshold used by supported transitions and supported states. 
Details
The scores for a new sequence x of length l can be computed
by the following methods. For match_cluster="exact"
or "nn"
:
 "product"

Product of transition probabilities along the path of x in the model. A single missing transition (transition probability of zero) will result in a score of 0. Use
prior
to avoid this.S_product = prod(a_s(i),s(i+1))^(1/(l1))
where a_s(i),s(j) is the transition probability between the state representing positions i and j in the sequence.
 "sum"

Sum of transition probabilities along the path of x in the model.
S_sum = 1/(l1) sum(a_s(i),s(i+1))
 "log_sum"

Sum of the log of the transition probabilities along the path of x in the model. The ranking of the scores is equivalent to the product of probabilities, however, the calculation is more reliable since the product of probabilities might become a very small number.
A single missing transition (transition probability of zero) will result in a score of neg. infinity. Use
prior
to avoid this.S_sum = 1/(l1) sum(log(a_s(i),s(i+1)))
 "supported_transitions"
Fraction of transitions in the new sequence x supported (present) in the model after assigning each data point in x to a state in the model.
S_supported_transitions = 1/(l1) sum(I(a_s(i),s(i+1)))
 "supported_states"
Fraction of points in the new sequence x for which a state (cluster) exists in the model.
match_cluster
is always"exact"
because for"nn"
this measure would always give 1. Note that this measure ignores transition information.If threshold is given, then only states with a count greater than the given threshold are counted as supported.
 "sum_transitions"
Sum of the counts on the edges in the model on the path of sequence x normalized by the total number of transition counts in the model.
S_sum_transitions = 1/(l1) sum(c_s(i),s(i+1))
where c_s(i),s(i+1) is the transition count between the state representing positions i and j in the sequence.
If threshold is given, then only transitions with a count greater than the given threshold are counted as supported.
 "likelihood", "log_likelihood"
The likelihood of the model given the new data is the unnormalized product score (product of transition probabilities).
 "log_loss"
The average log loss is defined as
sum(log2(a_s(i),s(i+1)))/(l1)
It represents the average compression rate of the new sequence given the model.
 "AIC"
Akaike Information Criterion corrected for finite sample size.
2k  2log(L) 2k(k1)/(nk1)
where n=l1 and k is the model complexity measured by the number of nonzero entries in the transition matrix. We use the likelihood of the model given by the proportion of supported transitions. AIC can be used for model selection where the smallest value indicates the preferred model.
where x_i represents the ith data point in the new sequence, a(i,j) is the transition probability from state i to state j in the model, s(i) is the state the ith data point (x_i) in the new sequence is assigned to. I(v) is an indicator function which is 0 for v=0 and 1 otherwise.
For match_cluster="weighted"
:
 "product"

Weighted version of the product of probabilities. The weight is the similarity between a new data point and the state in the model it is assigned to.
P_weighted_product = prod(simil(x_i,s(i))simil(x_i,s(i+1)) a_s(i),s(i+1))^(1/(l1))
 "sum"

Weighted version of the sum of probabilities.
S_weighted_sum = 1/(l1) sum(simil(x_i,s(i))simil(x_i,s(i+1)) a_s(i),s(i+1))
 "log_sum"

Weighted version of the sum of the log of probabilities.
S_sum = 1/(l1) sum(simil(x_i,s(i))simil(x_i,s(i+1)) a_s(i),s(i+1))
 "supported_states"

Same as
"supported_states"
but instead of counting the supported states, the similarity simil(x_i,s(i)) is used as a weight. Threshold is not implemented.
where simil(.) is a modified and normalized similarity function given by simil(x,s)=11/(1+exp((d(x,s)/t1.5)/.2)) where d is the distance measure and t is the threshold that was used to create the model.
Value
A scalar score value.
See Also
transition
to access transition probabilities
and find_clusters
for assigning observations to states/clusters.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48  data("EMMsim")
emm < EMM(threshold=.2)
emm < build(emm, EMMsim_train)
score(emm, EMMsim_test) # default is method "product"
### create shuffled data (destroy temporal relationship)
### and create noisy data
test_shuffled < EMMsim_test[sample(1:nrow(EMMsim_test)),]
test_noise < jitter(EMMsim_test, amount=.3)
### helper for plotting
mybars < function(...) {
oldpar < par(mar=c(5,10,4,2))
ss < rbind(...)
barplot(ss[,ncol(ss):1], xlim=c(1,4), beside=TRUE,
horiz=TRUE, las=2,
legend = rownames(ss))
par(oldpar)
}
### compare various scores
methods < c("product",
"sum",
"log_sum",
"supported_states",
"supported_transitions",
"sum_transitions",
"log_loss",
"likelihood")
### default is exact matching
clean < sapply(methods, FUN=function(m) score(emm, EMMsim_test, method=m))
shuffled < sapply(methods, FUN=function(m) score(emm, test_shuffled, method=m))
noise < sapply(methods, FUN=function(m) score(emm, test_noise, method=m))
mybars(shuffled, noise, clean)
### weighted matching is better for noisy data
clean < sapply(methods, FUN=function(m) score(emm, EMMsim_test, method=m,
match="weighted"))
shuffled < sapply(methods, FUN=function(m) score(emm, test_shuffled, method=m,
match="weighted"))
noise < sapply(methods, FUN=function(m) score(emm, test_noise, method=m,
match="weighted"))
mybars(shuffled, noise, clean)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker. Vote for new features on Trello.