#' @title dnamatch2
#' @author Oyvind Bleka <oyvble.at.hotmail.com>
#' @description dnamatch2 is a function for doing large scale DNA database search between trace samples (also mixture profiles) and between trace samples and reference samples.
#' @details dnamatch2 automatically imports DNA profiles from genemapper-format and feeds it into a structure for doing effective comparison matching. Before the comparison matches are carried out, every trace samples are optionally filtered: Alleles with peak heights below a speficied threshold (threshHeight) or alleles in stutter position having peak heights below a specified threshold (threshStutt) are removed. The comparison match algorithm first counts the number of alleles of the reference profiles which are included in each of the trace samples. Then a candidate list of matching situations is created, whom satisfy being over a given treshold (threshMAC). Here all candidate matches having the same CID (case ID) can be optionally be removed. If wanted, also candidate matches which have a timedifference (based on last edited file dates) outside a specified time difference (timediff) can be removed. The second part of the comparison match algorithm first estimates the number of contributors of the remaining trace samples in the candidate list using the likelihood function of the quatliative model (likEvid in forensim), based on the AIC. After, the Likelihood Ratio for all comparisons (between all reference and trace samples) are calculated based on the same qualitative model, and an updated candidate list is created by considering all comparisons with LR greater than a specified threshold (threshLR[1]). Last, the Likelihood Ratio for all remaining comparisons in the candidate list are calculated based on the same quantitative model (likEvidMLE in euroformix). The final match candidates are the comparisons which has a LR greater than a specified threshold (threshLR[2]). Finally, detailed results of these match candidates are automatically printed to files.
#'
#' To search between trace samples, a major component of each trace sample is extracted (based on relative peak heights): The allele with largest peak height belongs to the major. If the second largest allele has ratio (relative to the largest allele) above a specified threshold (threshMaj), then second allele is part of the major profile as well. If the relative peak height between second and third largest allele has ratio greater than this threshold, no profile for the major is assigned.
#'
#' It is optional whether between trace samples matches should be considered (boolean betweensamples). If this is turned off, the speed is drastically increased (because of many less comparisons).
#'
#' Encoding idea: References coded with primenumbers, stains alleles/heights coded with original values but collapsed(/) strings per marker.
#'
#' Note 1: Trace samples must have this structured format: "SID_RID_CID", seperated by underscores (the separator can be changed with argument IDsep). The trace name must be unique and consist of SID=Sample ID, RID=Unique together with SID for all traces, CID=Case ID. Hence the IDs must not contain underscores themselves.
#' Note 2: The names of the files containing trace samples must start with pattern specified by variable BIDptrn (BatchID pattern). The variable SIDptrn is required to recognize specific sample type (i.e. it can be used as a filter).
#' Note 3: Marker names of Amelogen must contain the name "AM" (not case-sensitive).
#'
#' Timestamp="YY-MM-DD-HH-MM-SS" is used as identification key of when the dnamatch2 function was run.
#' Match results are stored as a table in matchfile.csv with the timestamp. Matches which are earlier found in matchfile.csv will not be stored again. The column "Checked" can be used for comments.
#' More details about a given dnamatch2 run are stored in the folder "session" with corresponding timestamp.
#'
#' @param evidfold A folder with stain files (possible a vector). Full directory must be given.
#' @param freqfile A file containing population frequencies for alleles. Full directory must be given.
#' @param reffold A folder with ref files (possible a vector). Default is no references. Full directory must be given.
#' @param sameCID Boolean whether matches within same case ID number should be allowed.
#' @param betweensamples A boolean of whether between samples are searched. Default is TRUE.
#' @param Thist Number of days back in time for a Batch-file to be imported (stains).
#' @param threshMAC Threshold for a allele match. A proportion [0,1]
#' @param threshLR Threshold for a match. Can be a vector (qual,quan).
#' @param threshHeight Acceptable peak height (rfu) in stains.
#' @param threshStutt Acceptable stutter ratio in stains (relative peak heights) .
#' @param threshMaj If second largest allele has ratio (relative to the largest allele) above this threshold, then second allele is part of the major profile (used for extracting major from mixture). If the relative peak height between second and third largest allele has ratio greater than this threshold, no major is assigned.
#' @param minLocStain Number of minimum loci in stain profiles in order to be evaluated.
#' @param minLocMaj Number of minimum loci which are required to be considered for a major component (extracted from each stain profile).
#' @param pC Assumed drop-in rate per marker (parameter in the LR models).
#' @param lambda Assumed hyperparameter lambda for peak height drop-in model (parameter in quantiative model).
#' @param kit The shortname of the kit used for the samples (default is no kit specified). This will use the euroformix degradation model on the evidence samples. NB: Don't use if evidence profiles are run with several kits!
#' @param minFreq Minimum frequency for rare alleles. Used to assign new allele frequenceis. Default is 0.001.
#' @param searchtime An object with format as returned from Sys.time(). Default is Sys.time().
#' @param SIDvec A vector with Sample-ID numbers which are considered in the search (i.e. a search filter).
#' @param BIDvec A vector with Batch-files, which are considered in the search (i.e. a search filter).
#' @param CIDvec A vector with Case-ID numbers which are considered in the search (i.e. a search filter).
#' @param timediff Timedifference (in days) allowed between matching reference and target. Default is NULL (not used).
#' @param IDsep Seperator sign between SID,RID,CID in SampleName. Default is "_", giving SampleName structure SID_RID_CID.
#' @param BIDptrn Filename structure of stain files to search. For instance BIDptrn="TA-".
#' @param SIDptrn Pattern of name of which sample ID should be searched Example: (SID_RID_CID)=("01-S0001_BES00001-14_2014234231"). Here SIDptrn="-S" means that only SID with -S should be searched. Can be a vector.
#' @param printHistPlots Boolean of showing plots of the scores (number of matching alleles or LRs) for each comparisons.
#' @param writeScores Writes detailed score information to file. Default is FALSE.
#' @param maxK The maximum number of contributors used in the search. Can be a vector for (qual,quan) methods. Default is c(4,3)
#' @param matchfile The name of the matchfile where multiple searches are stored in one place. Default is "matchfile.csv".
#' @param sessionfold The name of the folder with run sessions. Default is "sessions".
#' @param searchoption Selected search module to carry out search (1=MAC, 2=MAC+QUAL, 3=MAC+QUAL+QUAN)
#' @param nDone Number of required optimizations for euroformix model (contLikMLE)
#' @param ignoreEmptyLoci Whether empty loci should be ignored in calculation. NB: This should only be FALSE if all evidence profiles contains all markers as specified in population frequency data (i.e. all evid profiles contains the same markers)
#' @param searchSubFoldersEvid Whether to extract files from subfolders for evidence profiles
#' @param searchSubFoldersRef Whether to extract files from subfolders for reference profiles
#' @param importEvidFile A file name including a function for reading evidence files (function euroformix::tableReader used if NULL)
#' @param importRefFile A file name including a function for reading reference files (function euroformix::tableReader used if NULL)
#' @references dnamatch2: An open source software to carry out large scale database searches of mixtures using qualitative and quantitative models.
#' @export
#library(roxygen2);roxygenize("dnamatch2")
dnamatch2 <- function (evidfold, freqfile, reffold = NULL, sameCID = FALSE ,betweensamples=TRUE,
Thist = Inf, threshMAC=0.75, threshLR = c(10,100), threshHeight = 200,threshStutt = 0.1, threshMaj = 0.6,
minLocStain=3,minLocMaj=3, pC = 0.05, lambda=0.01,kit=NULL,
minFreq = 0.001, searchtime=Sys.time(),SIDvec=NULL,BIDvec=NULL,CIDvec=NULL,timediff=NULL,IDsep = "_", BIDptrn=NULL, SIDptrn=NULL,
printHistPlots=FALSE,writeScores=FALSE,maxK=c(4,3),matchfile="matchfile.csv",sessionfold="sessions",searchoption=3,
nDone=4, ignoreEmptyLoci=TRUE, searchSubFoldersEvid=FALSE , searchSubFoldersRef=FALSE, importEvidFile=NULL, importRefFile=NULL)
{
#library(euroformix) #v4.0.0 is required
newf0 <- minFreq #5/(2 * N) #allele-frequence for new alleles
if (!dir.exists(sessionfold) ) dir.create(sessionfold) #create folder if not existing
#get search stamp as time when function is executed
stamp <- format(searchtime, "%y-%m-%d-%H-%M-%S") #timestamp: Year-month-day-hour-minute-second
#CREATE LOG-FILE (saved in session). User, versions, input for search.
txt = "THIS IS A LOG FOR A dnamatch2 RUN"
txt <- paste0(txt,"\ndnamatch2 version: ",packageVersion("dnamatch2"))
txt <- paste0(txt,"\nOther packages: (euroformix_",packageVersion("euroformix"),")")
txt <- paste0(txt,"\nR-version used: ",R.version.string) #Add R-version used
txt <- paste0(txt,"\nUser: ",Sys.getenv("USERNAME"))
txt <- paste0(txt,"\nCreated: ",Sys.time(),"\n")
txt <- paste0(txt,"\n-------FUNCTION CALL-------\n")
args = names(as.list( match.call() )) #get list of arguments
for(i in 2:length(args)) txt = paste0(txt,args[i],": ", paste0(get(args[i]),collapse="/"),"\n") #Show each arguments
write.table(txt, file = paste0(sessionfold,"/searchLog_",stamp,".csv"), row.names = FALSE, col.names = FALSE, sep = ";", append = FALSE,quote=FALSE)
###############
#HELPFUNCTIONS#
###############
#helpfunction to obtain read evidence and ref function from selected R-files:
obtainFun = function(Rfile) {
ret = source(Rfile,new.env(parent = globalenv() ),echo=FALSE) #obtain function with source
return( ret$value )
}
#Boolean of whether all letters are numbers
isNum = function(x) suppressWarnings({ !is.na(as.numeric(x)) })
#split with multiple separators
strsplit2 <- function(x, spl) {
if (nchar(x) == 0) return("")
txt <- x
for (j in 1:length(spl)) {
txt <- unlist(strsplit(txt, split = spl[j]))
}
return(txt)
}
makematrix <- function(x) { #not necessary if you use X[1,,drop=F]
if (is.null(dim(x))) x <- t(x)
return(x)
}
prim = as.integer(c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,
37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97,
101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211,
223, 227, 229, 233, 239, 241, 251, 257, 263, 269, 271,
277, 281, 283, 293, 307, 311, 313, 317, 331, 337, 347,
349, 353, 359, 367, 373, 379, 383, 389, 397, 401, 409,
419, 421, 431, 433, 439, 443, 449, 457, 461, 463, 467,
479, 487, 491, 499, 503, 509, 521, 523, 541, 547, 557,
563, 569, 571, 577, 587, 593, 599, 601, 607, 613, 617,
619, 631, 641, 643, 647, 653, 659, 661, 673, 677, 683,
691, 701, 709, 719, 727, 733, 739, 743, 751, 757, 761,
769, 773, 787, 797, 809, 811, 821, 823, 827, 829, 839,
853, 857, 859, 863, 877, 881, 883, 887, 907, 911, 919,
929, 937, 941, 947, 953, 967, 971, 977, 983, 991, 997,
1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049, 1051,
1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109,
1117, 1123, 1129, 1151, 1153, 1163, 1171, 1181, 1187,
1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249,
1259, 1277, 1279, 1283, 1289, 1291, 1297, 1301, 1303,
1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399,
1409, 1423, 1427, 1429, 1433, 1439, 1447, 1451, 1453,
1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511,
1523, 1531, 1543, 1549)) #max number of allele outcome (per marker) is 244
#0) Configurate file reader functions (evid/ref)
fileReaderEvid <- fileReaderRef <- euroformix::tableReader #set default function
if(!is.null(importEvidFile)) {
fileReaderEvid = obtainFun(importEvidFile) #obtain function from file
if(!is.function(fileReaderEvid)) stop("Failed to obtain function for reading evidence profiles")
}
if(!is.null(importRefFile)) {
fileReaderRef = obtainFun(importRefFile) #obtain function from file
if(!is.function(fileReaderRef)) stop("Failed to obtain function for reading reference profiles")
}
#1) Configurate Kit:
popFreq = euroformix::freqImport(freqfile)[[1]] #read frequencies from file using function in EFM
locs = names(popFreq) #locus names to use in data (AMEL can be ignored)
#encode allele names with prime numbers:
popFreqP = popFreq
for (loc in locs) names(popFreqP[[loc]]) <- prim[1:length(popFreq[[loc]])]
DBcolN <- c("ID","SID","CID","BID","Time") #column name of DBrefN and DBstainN: (id,sampleID,"CaseID","Batch-file","Time")
#2) IMPORT References:
DBref <- matrix(nrow = 0, ncol = length(locs)) #encoded matrix
DBrefN <- numeric() #corresponding names
#IMPORT Rfiles (Reference files)
Rfiles <- NULL
if (!is.null(reffold)) { #only run if reffolders are specified
for( reffold1 in reffold) { #loop through all selected folders
Rfiles <- list.files(reffold1, include.dirs = FALSE, recursive = searchSubFoldersRef) #look files
for (Rfile in Rfiles) {
#Rfile=Rfiles[1]
fname = paste0(reffold1, "/", Rfile)
tryCatch( {
X = fileReaderRef(fname)
#X[is.na(X)] = "" #remove NA
cn = colnames(X)
sind = grep("ID", toupper(cn)) #sample col-ind
if (length(sind) == 0) sind = grep("SAMPLE", toupper(cn))
if (length(sind) > 1) stop("More than one SID was find in the file")
sn = unique(as.character(X[, sind]))
if (length(sn) == 1) { #only one sample was found
sn <- strsplit(Rfile, "\\.")[[1]][1] #get samplename of filename
X[, sind] <- sn #update sample name within file!
}
lind = grep("marker", tolower(cn)) #locus col-ind
if (length(lind) == 0) lind = grep("LOC", toupper(cn))
X[, lind] <- toupper(X[, lind]) #make all loci upper-case
ln = intersect(locs,unique(X[, lind])) #import common markers for evid/refs
#if (!all(locs %in%ln)) {
# print(paste0("Reference-File '", Rfile, "' had another kit-format than given in the frequency file!"))
# next
#}
Aind = grep("allele", tolower(cn))[1:2] #allele col-ind. Only consider the two first
ishom <- X[, Aind[2]] == "" | is.na(X[, Aind[2]])
X[ishom, Aind[2]] <- X[ishom, Aind[1]] #add homozygote to both
#Introduce fast encoding: 1 means that it is no alleles
smallREF <- matrix(1, ncol = length(locs), nrow = length(sn))
for (loc in ln) { #traverse each locus in file
locind <- which(X[, lind] == loc) #get index of loci
tmp <- popFreq[[loc]]
tmpP <- popFreqP[[loc]]
av = unique( as.character(unlist(X[locind, Aind]) )) #vectorize and get unique alleles
av = av[!is.na(av) & av!=""] #remove empty alleles
newA <- av[!av%in%names(popFreq[[loc]])] #get new Alleles
if (length(newA) > 0) {
tmp <- popFreq[[loc]]
tmpP <- popFreqP[[loc]]
popFreqP[[loc]] <- popFreq[[loc]] <- c(popFreq[[loc]], rep(newf0, length(newA))) #insert new freqs.
names(popFreq[[loc]]) <- c(names(tmp), newA) #insert new a-names
names(popFreqP[[loc]]) <- c(names(tmpP), prim[(length(tmp) + 1):length(popFreq[[loc]])]) #insert new p-names
}
Pname <- as.integer(names(popFreqP[[loc]])) #get primenumbers
Aname <- names(popFreq[[loc]]) #get allelenames
for (an in Aname) { #for each alleles in population
for (ai in Aind) { #for each alleles
aind <- which( X[locind, ai] == an ) #get rows which has corresponding allele (NA removed)
srow <- which(sn %in% X[locind, sind]) #get belonging rows in smallREF
smallREF[srow[aind], which(locs == loc)] <- smallREF[srow[aind], which(locs == loc)] * Pname[which(Aname == an)]
}
} #end for each alleles
}#end for each loci
smallREF[smallREF == 1] = NA #insert missing loci as NA
DBref <- rbind(DBref, smallREF)
DBrefN <- rbind(DBrefN, cbind(sn,"0","0","0","0") )
},error = function(e) {
print(paste0("Failed to read file ",fname) )
print("Data import continues...")
})
} #end for each R-file
} #end for each Reffold
} #end if Reffold specified
if (length(DBrefN) == 0) {
print("NOTE: No reference samples were imported!")
if( !betweensamples) {
print("Reference samples needed for further search since not searching between samples.")
return(FALSE)
}
} else {
colnames(DBrefN) <- DBcolN
#Check for duplicated R-ID:
if (length(unique(DBrefN[,DBcolN=="ID"])) != nrow(DBrefN)) print("Some imported references had same R-ID. This is ignored.")
print(paste0("Number of imported references: ",nrow(DBrefN)))
}
colnames(DBref) <- locs
#2) IMPORT Stains:
DBstainA <- DBstainH <- DBstainN <- numeric() #matrices to store all stains, DBstainN stores SID(Sample),RID(replicate),CID(case),BID(batch)
for(ff in evidfold) { #for each evidence folder
#ff <- evidfold[1]
Bfiles <- list.files(ff, pattern = BIDptrn, include.dirs = FALSE, recursive = searchSubFoldersEvid) #Uses BID-pattern in files.
if(!is.null(BIDvec) && length(Bfiles)>0) { #extract Bfiles which are selected in BIDvec
induse = numeric()
for(bid in BIDvec) induse = c(induse, grep(bid,Bfiles) ) #store index of matches
Bfiles = Bfiles[induse] #update
}
for (Bfile in Bfiles) { #for each file
#Bfile = Bfiles[16]
if (which(Bfiles == Bfile)%%100 == 0) print(paste0(round(which(Bfiles == Bfile)/length(Bfiles) * 100), "% import complete")) #print first?
fname <- paste0(ff, "/", Bfile)
tryCatch( { #avoid crashing program
#get time-info from file (to this early):
Tfile <- file.info(fname)$mtime #time when file was last modified!
Fdate <- format(Tfile)
Tdiff <- difftime(searchtime, Tfile, units = "days") #use day resolution
Tdiff <- as.integer(strsplit(format(Tdiff, digits = 1)," ")[[1]][1]) #check roundoff (day resolution)
if (Tdiff<0 || Tdiff > Thist) next #don't import if file was modified more than Thist days ago, or newer than Searchtime
X = fileReaderEvid(fname) #reading table using inbuilt efm function
cn = colnames(X)
sind = grep("sample", tolower(cn)) #sample col-ind
if (length(sind) > 1) sind <- sind[grep("name", tolower(cn)[sind])] #if multiple sind
sn = unique(X[, sind]) #sample names
if (length(sn) == 0) next #this file did not contain valid samples!
#Obtain marker info:
lind = grep("marker", tolower(cn)) #locus col-ind
if (length(lind) == 0) lind = grep("LOC", toupper(cn))
X[, lind] <- toupper(X[, lind]) #make all loci upper-case (THIS IS VERY IMPORTANT!)
ln = intersect(locs,unique(X[, lind])) #import common markers for evid/refs
if (length(ln)==0 || (searchoption==3 && !is.null(kit) && !all(locs%in%ln))) { #check that marker names are same as in the freq-file
print(paste0("Evidence-File '", Bfile, "' had another kit-format than given in the frequency file (can't use when kit is selected with searchoption=3)"))
next
}
#Update data table to only include relevant markers:
X = X[X[, lind]%in%ln,,drop=FALSE]
Aind = grep("allele", tolower(cn)) #allele col-ind
Hind = grep("height", tolower(cn)) #height col-ind
if (length(Aind) == 0 | length(Hind) == 0) next #if no allele or peak height info
#Extract SID,RID,CID,BID
tmp = strsplit(sn, IDsep)
SID <- sapply(tmp , function(x) x[1]) #extract SID
RID <- sapply(tmp , function(x) x[2]) #extract RID
CID <- sapply(tmp , function(x) x[3]) #extract CID
RID[is.na(RID) | RID==""] = "0" #default if missing
CID[is.na(CID) | CID==""] = "0" #default if missing
if( !all(length(SID)==length(RID)) || !all(length(CID)==length(RID)) ) stop("Requiring correct sampleName formats. See details ?dnamatch2.")
tmp <- strsplit2(Bfile, c(BIDptrn, "\\."))[1] #Get BatchID (remove extension)
BID <- paste0(BIDptrn, tmp) #Update BID
induse2 = 1:length(sn) #default range to use
if(!is.null(SIDptrn) && length(SIDptrn)>0) {
induse2 = numeric()
for(pat in SIDptrn) induse2 = c(induse2, grep(pat ,SID) ) #store indices corresponds to in SIDvec
}
if(!is.null(SIDvec) || !is.null(CIDvec) ) {
induse2 = numeric()
if(!is.null(SIDvec) && length(SIDvec)>0) {
for(sid in SIDvec) induse2 = c(induse2, grep(sid,SID) ) #store indices corresponds to in SIDvec
}
if(!is.null(CIDvec) && length(CIDvec)>0) {
for(cid in CIDvec) induse2 = c(induse2, grep(cid,CID) ) #store indices corresponds to in SIDvec
}
}
induse2 = unique(induse2) #consider only unique indices
for (ss in induse2) { #for each combination of SID and RID (unique stains)
subX <- X[X[, sind] == sn[ss],,drop=FALSE] #get submatrix
SSvec <- rep(NA, length(locs)) #place to store allele-info into MAJOR/single source-matrix
Avec <- Hvec <- rep(NA, length(locs)) #place to store allele/height-info into evidence-matrix
for (ii in 1:nrow(subX)) { #for each marker-row
loc <- subX[ii, lind]
locind <- match(loc, locs) #find correct locus (NB: grep causes potential MORE candidates)
if (length(locind) == 0) next
if (length(locind) > 1) stop("Several locus names were observed! Please check.")
Ainfo <- as.character(subX[ii, Aind]) #extract allele info
Hinfo <- as.numeric(subX[ii, Hind]) #extract height info
usedInd <- !(is.na(Hinfo) | Ainfo == "" | Ainfo == "NA" | Ainfo == "OL")
Ainfo <- Ainfo[usedInd]
Hinfo <- Hinfo[usedInd]
#(1) Peak height above threshold
keep <- Hinfo >= threshHeight #required minumum peak height
Ainfo <- Ainfo[keep]
Hinfo <- Hinfo[keep]
if (length(Ainfo) == 0) next
isNumAllele = rep(FALSE,length(Ainfo))
for(i in 1:length(Ainfo)) isNumAllele[i] = isNum(Ainfo[i]) #check if allele can be converted to number
#Check if alleles are both numeric and non-numericc
if(length(unique(isNumAllele))==2) {
cat(paste0("Both numerical/non-numerical alleles found for:\nBatch=",BID,", Sample ",sn[ss],", Marker ",loc,"\nProgram will stop!\n") )
return(NULL)
}
#(2) Stutter-filter (only if all alleles are numeric)
if(all(isNumAllele) ) {
AllsNum <- as.numeric(Ainfo) #convert to numbers
stuttindL <- which(as.character(AllsNum) %in% as.character(AllsNum - 1)) #stutter-alleles one low bp
stuttindH <- which(as.character(AllsNum - 1) %in% as.character(AllsNum)) #stutter-alleles one high bp
stuttR <- Hinfo[stuttindL]/Hinfo[stuttindH] #stutter-ratio is comparing observed peak heights
remove <- stuttindL[stuttR < threshStutt] #alleles to remove
if (length(remove) > 0) {
Ainfo <- Ainfo[-remove]
Hinfo <- Hinfo[-remove]
}
}
#Insert new alleles
Anew <- unique(Ainfo[!Ainfo %in% names(popFreq[[loc]])])
if (length(Anew) > 0) {
tmp <- popFreq[[loc]]
tmpP <- popFreqP[[loc]]
popFreqP[[loc]] <- popFreq[[loc]] <- c(popFreq[[loc]], rep(newf0, length(Anew)))
names(popFreq[[loc]]) <- c(names(tmp), Anew)
names(popFreqP[[loc]]) <- c(names(tmpP), prim[(length(tmp) + 1):length(popFreq[[loc]])])
}
Pname <- as.integer(names(popFreqP[[loc]])) #remember to update Pname (used for SS-vec)
#Collapse alleles and heights and save in vector
Avec[locind] <- paste0(Ainfo,collapse="/")
Hvec[locind] <- paste0(Hinfo,collapse="/")
#Extract major profiles: Put into an own matrix:
if(betweensamples) {
ord <- order(Hinfo, decreasing = TRUE)
Ainfo <- Ainfo[ord]
Hinfo <- Hinfo[ord]
SS <- Ainfo[1] #largest allele always included
nA <- length(Ainfo)
if (nA > 1 && Hinfo[2]/Hinfo[1] > threshMaj) { #if more than 1 allele and ratios of two largest suffice
SS <- c(SS, Ainfo[2]) #add second allele
if (nA > 2 && Hinfo[3]/Hinfo[2] > threshMaj) { #if more than 2 major alleles and also 3. largest suffice
SS <- NA #couldn't deside anything for marker
}
}
if (!any(is.na(SS))) {
tmp <- prod(Pname[names(popFreq[[loc]]) %in% SS])
if(length(SS)==1) tmp <- tmp*tmp #homozygous multiplied twice
SSvec[locind] <- tmp #store SS-info
}
} #end if between samples
} #end for each markers
if( sum(!is.na(Avec)) < minLocStain ) { #if too few accepted markers (below a given threshold)
print(paste0("Sample ",sn[ss]," was not included: Num.mark=",sum(!is.na(Avec)))) #print number of markers
next
}
stainN <- cbind(sn[ss],SID[ss],CID[ss], BID, Fdate) #stain names (include date for file also)
DBstainN <- rbind(DBstainN, stainN )
DBstainA <- rbind(DBstainA, Avec )
DBstainH <- rbind(DBstainH, Hvec )
if(betweensamples && sum(!is.na(SSvec))>=minLocMaj ) {
DBref <- rbind(DBref, SSvec ) #add single source to DBref
DBrefN <- rbind(DBrefN , stainN) #add full sample name (as in v1.0)
}
} #end for each sample
},error = function(e) { #throw error if file name failed to print
print(paste0("Failed to read file ",fname) )
print("Data import continues...")
})
} #end for each Bfiles
} #end for each fn path
if (length(DBstainN) == 0) {
print("No evidence was imported! Search stops..")
return(FALSE)
}
colnames(DBstainN) <- DBcolN
colnames(DBstainA) <- colnames(DBstainH) <- locs
#Check for duplicates of tracenamesnames:
dupind <- duplicated(DBstainN[, DBcolN=="ID"])
if (sum(dupind) > 0) {
print(paste0("It was ", sum(dupind), " duplicated stain samples. Program continues after removing following duplicates:..."))
print(DBstainN[dupind,])
DBstainN <- DBstainN[!dupind, ,drop=FALSE]
DBstainA <- DBstainA[!dupind, ,drop=FALSE]
DBstainH <- DBstainH[!dupind, ,drop=FALSE]
}
########################################
#Showing metadata for imported samples:#
########################################
nS <- nrow(DBstainN)
nR <- nrow(DBrefN)
nL <- ncol(DBstainA) #number of loci
print(paste0("Number of loci to use: ",nL))
print(paste0("Number of samples to search: ",nS))
print(paste0("Number of reference profiles to search: ",nR))
############################################################################################################################################
######################################COMPARISON#########################################################################
#############################################################################################################################################
#helpfunction to store matchfile results to file
storeResults = function(Ctab0) {
Ctab0 <- Ctab0[order(Ctab0[,ncol(Ctab0)],decreasing=T),,drop=F] #order results
nLstain <- rowSums(!is.na(DBstainA[Ctab0[,1],,drop=F]) & !is.na(DBref[Ctab0[,2],,drop=F])) #get number of non-zero loci in both stain and reference
#Create result matrix:
resMat <- cbind( DBrefN[Ctab0[,2], DBcolN=="BID"], DBstainN[Ctab0[,1], DBcolN=="BID"], DBrefN[Ctab0[,2], DBcolN=="ID"], DBstainN[Ctab0[,1], DBcolN=="ID"],nLstain,Ctab0[,-(1:2),drop=F])#,stamp)
if(ncol(resMat)==6) resMat = cbind(resMat,NA,NA,NA,NA) #filling empty with NA
if(ncol(resMat)==8) resMat = cbind(resMat,NA,NA) #filling empty with NA
resMat = cbind(resMat,stamp)
cn <- c("refBID","tarBID","refID","tarID","nLocs","MAC","nContr","LRqual","LRquan","Mx","Searchtime") #colnames (searchoption 3)
colnames(resMat) = cn
#FILTER AWAY SYMMETRIC MATCHES
allSID <- cbind(resMat[,cn=="refID"],resMat[,cn=="tarID"] )
indrev <- as.character(allSID[, 1]) > as.character(allSID[, 2])
allSID[indrev, ] <- allSID[indrev, 2:1] #sort columns
isdup <- duplicated(allSID) #find (redudance+symmetrical) matches already found before
resMat <- resMat[!isdup,,drop=F] #get only unique matches!
#5) STORE MATCHES
write.table( resMat[ resMat[,cn=="refBID"]!="0",,drop=F] , file = paste0(sessionfold,"/stainmatches_", stamp, ".csv"), row.names = FALSE, sep = ";") #store matches not personell:
write.table( resMat[ resMat[,cn=="refBID"]=="0",,drop=F] , file = paste0(sessionfold,"/refmatches_", stamp, ".csv"), row.names = FALSE, sep = ";") #store matches only personell:
matchingprofiles = function(resMat) {
if (length(resMat) == 0) return()
cn <- colnames(resMat)
refind <- which("refID" == cn)
tarind <- which("tarID" == cn)
refindBID <- which("refBID" == cn)
tarindBID <- which("tarBID" == cn)
fulltab <- numeric()
for (ss in 1:nrow(resMat)) {
refID <- resMat[ss,refind]
refA <- DBref[DBrefN[,DBcolN=="ID"]==refID,,drop=F][1,] #get reference data: Make robust if several refs
tarID <- resMat[ss,tarind]
tarA <- DBstainA[DBstainN[,DBcolN=="ID"]==tarID,] #get reference data
tarH <- DBstainH[DBstainN[,DBcolN=="ID"]==tarID,] #get reference data
nL0 = as.numeric(resMat[ss,cn=="nLocs"])
nL <- paste0("nLocs=",nL0)
mac0 = as.numeric(resMat[ss,cn=="MAC"])
mac <- paste0("MAC=", round(mac0,3)) #get rounded mac
mac2 <- paste0("MM=", round(2*nL0*(1-mac0)) ) #number of mismatches
mactxt = paste0(c(mac,mac2),collapse=" - ")
Khat <- paste0("nContr=", round( as.numeric(resMat[ss,cn=="nContr"]),2))
lr1 <- paste0("LRqual=",signif(as.numeric(resMat[ss,cn=="LRqual"]),digits=5))
lr2 <- paste0("LRquan=",signif(as.numeric(resMat[ss,cn=="LRquan"]),digits=5))
mx <- paste0("Mx=",signif(as.numeric(resMat[ss,cn=="Mx"]),digits=3))
rowtab <- rbind(c(resMat[ss, refind],resMat[ss, tarind]),c(nL,mactxt),c(Khat,lr1),c(mx,lr2),c("----REFERENCE----","---------TARGET------"))
for (loc in locs) {
Pname <- as.integer(names(popFreqP[[loc]]))
Ei <- paste0(tarA[loc==locs]," - ",tarH[loc==locs])
RPi <- as.integer(refA[loc == locs])
Ri <- ifelse(!is.na(RPi),paste0(names(popFreq[[loc]])[RPi%%Pname == 0], collapse = "/"),NA)
if(any(is.na(Ri))) next #don't show if ref is missing
if(!grepl("/",Ri)) Ri <- paste0(Ri,"/",Ri)
rowtab <- rbind(rowtab, c(paste0(loc,": ",Ri), Ei)) #v1.4 updated
}
fulltab <- rbind(fulltab, rep(paste0("----------", ss, "----------"), 2))
fulltab <- rbind(fulltab, rowtab)
fulltab <- rbind(fulltab, rep(paste0("------------------------------"), 2))
}
return(fulltab)
}
#store details of all matches:
write.table(matchingprofiles(resMat), file = paste0(sessionfold,"/matchinfo_", stamp, ".csv"), col.names = FALSE,row.names = FALSE, sep = "\t")
#Number of matches:
print(paste0("Number of matches=", nrow(resMat)))
#THIS BLOCK: Loads matches in matchfile. Makes sure that new matches don't duplicate with old (not even the symmetry)
if (!file.exists(matchfile)) { #create empty file if not found
x <- matrix(NA, ncol = length(cn), nrow = 0)
colnames(x) <- cn
write.table(x, file = matchfile, row.names = FALSE, col.names = TRUE, sep = ";")
}
if(length(resMat) > 0) {
matchlist2 <- read.table(file = matchfile, sep = ";", header = TRUE, stringsAsFactors = FALSE)
cn2 <- colnames(matchlist2)
if(length(cn)!=length(cn2)) print("Existing file had wrong number of columns!")
SID <- cbind(resMat[,cn=="refID"],resMat[,cn=="tarID"] )
SID2 <- cbind(matchlist2[,cn2=="refID"],matchlist2[,cn2=="tarID"] )
allSID <- rbind(SID2, SID) #combine ID from file and ID from new matches
allSID <- as.matrix(allSID) #vectors in matrix is already converted to factors, necessary to make string agaain!
indrev <- as.character(allSID[, 1]) > as.character(allSID[,2])
allSID[indrev, ] <- allSID[indrev, 2:1] #sort columns
isdup <- duplicated(allSID) #find (redudance+symmetrical) matches already found before
resMat2 <- resMat[!isdup[(nrow(SID2) + 1):length(isdup)], ,drop=F] #get only NEW matches!
print(paste0("Number of new matches=", nrow(resMat2)))
if (length(resMat2) > 0) { #only if any new matches
write.table(resMat2, file = matchfile, row.names = FALSE, col.names = FALSE, sep = ";", append = TRUE)
if(length(resMat2)!=length(resMat) ) write.table(matchingprofiles(resMat), file = paste0(sessionfold,"/matchinfoNEW_", stamp, ".csv"), row.names = FALSE, col.names = FALSE, sep = ";") #Only provide if different number of matches
} else {
print("NO NEW MATCHES!")
}
}
}
######################################################################
#Part 1: Compare number of alleles in references included in evidence#
######################################################################
#Create full MAC matrix: all combinations of rows in DBstainN and rows in DBrefN:
#DBrefLocs <- rowSums(!is.na(DBref))
print(paste0("Calculating MAC for all ",nS*nR," comparisons: All refs against all stains"))
bigMAC <- rep(0,nS*nR) #keep MAC in a vector (sample1-ref1,sample1-ref2,...,sample2-ref1 etc.)
#all(colnames(DBstainA)==colnames(DBref))
systime <- system.time( {
for(ss in 1:nS) { #for each sample: Limited in how the samples are looking
#ss=2
bigInd <- nR*(ss-1) + 1:nR #index in bigMAC matrix
macS <- nLocs <- rep(0,nR) #make vector for all references
for(ll in 1:ncol(DBstainA)) { #for each locus: Vectorizing as much as possible!
#ll=2
if(is.na(DBstainA[ss,ll])) next #skip if no data
sttmp <- unlist(strsplit(DBstainA[ss,ll],"/")) #get alleles
sttmpP <- as.integer(names(popFreqP[[ll]][match( sttmp , names(popFreq[[ll]]) )])) #get prime numbs
isna <- is.na(DBref[,ll]) #get which refs are non-zero
numWithin <- rep(0,sum(!isna)) #number of unique alleles of reference that are in stain
for(aa in sttmpP) numWithin <- numWithin + as.integer(DBref[!isna,ll]%%aa==0) #for each alleles in stain
isHom <- sqrt(DBref[!isna,ll])
isHom <- abs(round(isHom) - isHom)<1e-6 #which are homozygous
macS[!isna] <- macS[!isna] + numWithin #add number of matching alleles
macS[!isna][isHom & numWithin==1] <- macS[!isna][isHom & numWithin==1] + 1 #add homozygous twice IF it was matching with one
nLocs[!isna] <- nLocs[!isna] + 1 #add locus
} #end for each locus
bigMAC[bigInd] <- macS/(2*nLocs) #divide number of matching-alleles in refernce by maximum possible
#hist(macS)
#insert MAC to long vector
if (ss%%100 == 0) print(paste0(round(ss/nS* 100), "% MAC calculation complete"))
} #end number of samples
})[3]
print(paste0("Calculating MAC scores took ",ceiling(systime), " seconds"))
#End compare with MAC
if(printHistPlots && length(bigMAC)>0) {
dev.new()
hist(bigMAC,main="Frequency of MAC score",xlab="MAC")
abline(v=threshMAC,col=2)
op <- par(no.readonly = TRUE)
par(op)
}
#max(bigMAC)
###########
#FILTER 1:#
###########
keepInd <- which(bigMAC>=threshMAC)
#bigMAC[keepInd]
#(1:length(bigMAC)-1)%%nR + 1
#floor( (1:length(bigMAC)-1)/nR ) + 1
Ctab <- cbind( floor((keepInd-1)/nR) + 1, (keepInd-1)%%nR + 1 , bigMAC[keepInd]) #convert back indices
colnames(Ctab) <- c("stain","ref","score") #note the order: stainID first
#nR*(Ctab[,1]-1) + Ctab[,2] #check
#cbind(DBstainN[Ctab[,1],4],DBrefN[Ctab[,2]]) #
#Ctab <- Ctab[rowSums(is.na(DBref[Ctab[,2],]))>0,] #consider these only
#FILTER 1b: remove for other reasons:
#(1) Remove because it was the same stain:
sameSID <- DBrefN[ Ctab[,2],DBcolN=="SID"]==DBstainN[ Ctab[,1],DBcolN=="SID"] #check for mathces which have same sampleID
Ctab <- Ctab[!sameSID,,drop=FALSE] #remove those who are the same
#(2) Remove because it was the same CID
if (!sameCID) {
sameCID2 <- DBrefN[ Ctab[,2],DBcolN=="CID"]==DBstainN[ Ctab[,1],DBcolN=="CID"] #check for mathces which have same sampleID
Ctab <- Ctab[!sameCID2,,drop=FALSE] #remove those who are the same
}
#(3) Remove because it was outside time difference
if (betweensamples && !is.null(timediff)) {
isREF <- DBrefN[Ctab[,2],DBcolN=="Time"]=="0" #get samples which are ref
if(!all(isREF)) {
diff <- abs(difftime(DBrefN[Ctab[!isREF,2],DBcolN=="Time"], DBstainN[ Ctab[!isREF,1],DBcolN=="Time"], units = "days"))
isREF[!isREF] <- diff<=timediff #if ref-profile or inside time
Ctab <- Ctab[isREF,,drop=FALSE] #keep only ref-profiles or inside time
}
}
nK <- nrow(Ctab) #numer of comparisons to continue with
print(paste0("Number of comparisons satisfying (after filters) threshMAC=",threshMAC,": ", nK))
if(nK==0 ) { #if not more comparisons
print("There were no more comparisons to do after simple allele matching. Program stops!")
return(FALSE)
}
#Write results to file:
if( 0 ) { #write to file only if searchoption=1 (not considered causing large files)
ord = order(Ctab[,3],decreasing = TRUE) #sort results
out = cbind( DBstainN[Ctab[,1]], DBrefN[Ctab[,2]],signif(Ctab[ord,3],3))
colnames(out) = c("Stain","Reference","MAC")
write.table(out, file = paste0(sessionfold,"/MACresults_",stamp,".csv"), row.names = FALSE, col.names = TRUE, sep = ";", append = FALSE,quote=FALSE)
rm(out);gc()
} #dont
if( searchoption==1 ) { #write to file only if searchoption=1
print("Search completed! Storing results...")
storeResults(Ctab0=Ctab) #store results
return(TRUE)
}
#####################################################
#Part 2: Calculate LRqual for relevant combinations:#
#####################################################
unStain <- unique(Ctab[,1]) #unique stain profiles:
nU <- length(unStain) #number of unique
print(paste0("Estimating num. contr. for ", nU," stains"))
print(paste0("Calculating LRqual for ", nK," combinations"))
#2.1) Estimate number of contr. using qual model
#New settings to make more robust optimization;
maxIter = 5 #max number of iterations in optimization
Kqual <- log10LRqual <- rep(0,nK) #for storing num. contr and logLd=log P(E|Hd)
systime <- system.time( {
for(ss in unStain) { #for each unique stain we estimate number of contr and calculate the LR for all the references
# ss = unStain[1]
dat <- DBstainA[ss,]
locUseS <- !is.na(dat) #loci to consider for sample (only where it is data)
if(!ignoreEmptyLoci) locUseS <- locUseS | TRUE #loci to consider for sample: Assumes kit to popFreq. Important to get correct image of whole sample
ln = locs[locUseS] #get loci to use in all calculations
#if(any(is.na(dat))) stop("sasd")
samples <- list(sample=lapply(dat[locUseS], function(x) { #create sample data list
a0 <- unlist(strsplit(x,"/"))
if(all(is.na(a0))) a0 <- numeric()
list(adata=a0)
})) #strsplit(dat[locUseS],"/")
data <- euroformix::Qassignate(samples, popFreq[ln],incS=FALSE)
#Optimization:
K <- 1 #Start with one ceiling(max(sapply(dat,function(x) length( unlist(strsplit(x,"/")) )))/2) #lowest number of contr
done <- FALSE
hdval <- matrix(NA,ncol=4,nrow=0) #columns: #contr,,dropprob,loglik_max,loglik_max-K
while(!done) { #model selection for LRmix: Find MLE under hd
data$nU <- rep(K,length(ln)) #equal for all loci
foo = euroformix::calcQualMLE(K,data$samples,data$popFreq,prC=pC,maxIter = maxIter) #max number of iterations
hdval <- rbind(hdval,c(K,foo$pDhat,foo$loglik, foo$loglik - K)) #append to table
#print(hdval)
nval <- nrow(hdval) #obtain number of rows
if(nval>1 && as.numeric(hdval[nval,4]) < as.numeric(hdval[nval-1,4]) ) {
done <- TRUE #We keep number of contributors
} else {
if(K==maxK[1]) { #maximum was obtained
done = TRUE
} else {
K <- K + 1 #Increment number of contributors if not done
}
}
}
indAIC <- which.max(hdval[,4])
Khat <- hdval[indAIC,1] #estimated contr.
pDhat <- hdval[indAIC,2] #estimated dropout
loghd <- hdval[indAIC,3] #Problem: Not always global since ref may miss some markers.
Kqual[ ss == Ctab[,1] ] <- Khat #estimated number used in quanLR directly
#Calc. Hp to get LR:
whatR <- which(ss == Ctab[,1]) #what ref ind to consider
subRef <- DBref[ Ctab[whatR,2], ,drop=FALSE] #get references
for(r in 1:length(whatR) ) { #for each reference to compare with selected
#fix refData:
subRef2 <- subRef[r,] #get reference vector (with alleles)
refD <- list()
for(loc in ln ) {
pri <- subRef2[loc==locs] #get prime number
if(is.na(pri)) {
a0 <- numeric()
} else {
an <- names(popFreq[[loc]])
anP <- as.integer( names(popFreqP[[loc]]) )
a0 <- an[pri%%anP==0]
}
av2 = names(data$popFreq[[loc]]) #obtain alleles used in evaluation
if(length(a0)==1) a0 <- rep(a0,2)
refD[[loc]] <- list(ref1=a0)
}
data = euroformix::Qassignate(samples, popFreq[ln],refD,incS=FALSE,incR=FALSE)
foo = euroformix::calcQualMLE(Khat,data$samples,data$popFreq,data$refData,condOrder=1, prC=pC,maxIter = maxIter) #max number of iterations
log10LRqual[whatR[r]] = (foo$loglik - loghd)/log(10) #get calculated LR
} #end for each references given unique stain
ii <- which(ss==unStain)
if (ii%%10 == 0) print(paste0(round(ii/nU* 100), "% LR qual calculation complete"))
} #end for each stains
#Kqual
})[3]
print(paste0("Calculating LR (qual) took ",ceiling(systime), " seconds"))
if(printHistPlots && length(log10LRqual)>0) {
dev.new()
hist(log10LRqual,main="Frequency of LRqual score",xlab="log10LR")
abline(v=log10(threshLR[1]),col=2)
op <- par(no.readonly = TRUE)
par(op)
}
###########
#FILTER 2:#
###########
keepInd2 <- which(log10LRqual>=log10(threshLR[1]))
Ctab2 <- cbind(Ctab[keepInd2,,drop=F], Kqual[keepInd2], log10LRqual[keepInd2]) #include LRqual (on original scale)
nK2 <- nrow(Ctab2) #update nK (number of combinations)
print(paste0("Number of comparisons satisfying threshLRqual>",threshLR[1],": ", nK2))
if(nK2==0) {
print("There were no more comparisons to do after qualitative comparison. Program stops!")
return(FALSE)
}
if(writeScores && searchoption>2 ) { #if write temporary scorings
out = cbind( DBstainN[Ctab2[,1]], DBrefN[Ctab2[,2]],round(Ctab2[,3],3),Ctab2[,4],round(Ctab2[,5],3))
colnames(out) = c("Stain","Reference","MAC","nContr","LRqual")
write.table(out, file = paste0(sessionfold,"/LRqualResults_",stamp,".csv"), row.names = FALSE, col.names = TRUE, sep = ";", append = FALSE,quote=FALSE)
rm(out);gc()
}
if(searchoption==2) {
print("Search completed! Storing results...")
storeResults(Ctab0=Ctab2) #store results and stop program
return(TRUE)
}
#####################################################
#Part 3: Calculate LRquan for relevant combinations:#
#####################################################
#Using estimated contr. in Kqual for quan LR:
#ALSO IT REQUIRES ALL MARKERS FOR REFERENCES: ADVANCED NOT IMPLEMENTED!
log10LRquan <- mxhat <- rep(0,nK2) #store LR and mx
unStain <- unique(Ctab2[,1]) #unique stain profiles
nU <- length(unStain) #number of unique stains
print(paste0("Calculating LRquan for ", nK2," combinations (",nU," unique samples)"))
useDEG = FALSE #whether Degradation model should be used or not
if( !is.null(kit) ) { #if kit is specified (degrad model)
kitinfo = euroformix::getKit(kit)
if(length(kitinfo)==1) {
print("Kit was not recognized. Please specify a valid kit to use the degradation model.")
} else {
if( !all( toupper(locs)%in%toupper(unique(kitinfo$Marker)) ) ) {
print("The user has specified markers which was not recognized in the getKit function. Please rename marker names in order to use degradation model!")
return(FALSE) #return from function
}
}
useDEG = TRUE
}
#Notice: Empty markers important because of information about allele dropouts.
systime <- system.time( {
for(ss in unStain) { #for each unique stain we estimate number of contr.
# ss = unStain[1]
datA <- DBstainA[ss,]
datH <- DBstainH[ss,] #get peak heights
#Specifying which markers to use:
locUseS <- !is.na(datA) #loci to consider for sample: Assumes kit to popFreq. Important to get correct image of whole sample
if(!ignoreEmptyLoci) locUseS <- locUseS | TRUE #loci to consider for sample: Assumes kit to popFreq. Important to get correct image of whole sample
ln = locs[locUseS] #get loci to use
#Obtain assumed number of contibutors:
nC <- Kqual[ ss == Ctab[,1] ][1] #number of contr as for qual model
if(length(maxK)>1) nC = min(nC,maxK[2]) #use minimum of these if maxK specified for quan
samples <- list()
for(loc in ln) {
ind <- which(locs==loc)
a0 <- unlist(strsplit(datA[ind],"/"))
h0 <- as.numeric(unlist(strsplit(datH[ind],"/")))
if(is.na(datA[ind])) a0 <- h0 <- numeric()
samples[[loc]] <- list(adata=a0,hdata=h0)
}
samples <- list(sample=samples) #select only loci to use
#Fit under Hd:
data <- euroformix::Qassignate(samples, popFreq=popFreq[ ln ],incS=FALSE) #don't include stutters
fitHd <- euroformix::calcMLE(nC,data$samples,data$popFreq, DEG = useDEG,BWS = FALSE, FWS = FALSE, pC=pC,lambda=lambda,nDone=nDone,AT=threshHeight,kit=kit,verbose=FALSE)
loghd <- fitHd$fit$loglik #used under all combination for this stain.
#Calc. Hp to get LR:
whatR <- which(ss == Ctab2[,1]) #what ref ind to consider
subRef <- DBref[ Ctab2[whatR,2], ,drop=F] #get references
for(r in 1:length(whatR) ) { #for each references
#r=1
#fix refData:
subRef2 <- subRef[r,]
refD <- list()
for(loc in ln) { #for each loci (selected)
pri <- subRef2[loc==locs] #get prime number
if(is.na(pri)) {
a0 <- numeric() #if empty
} else {
an <- names(popFreq[[loc]])
anP <- as.integer( names(popFreqP[[loc]]) )
a0 <- an[pri%%anP==0]
if(length(a0)==1) a0 <- rep(a0,2)
}
refD[[loc]] <- list(ref1=a0)
}
#locUseR <- names(popFreq)%in%names(refD) #loci in ref.
data <- euroformix::Qassignate(samples, popFreq[ln],refD,incS=FALSE,incR=FALSE) #popFreq must be given with correct order?
fitHp <- euroformix::calcMLE(nC,data$samples,data$popFreq, data$refData,condOrder=1, DEG = useDEG,BWS = FALSE, FWS = FALSE, pC=pC,lambda=lambda,nDone=nDone,AT=threshHeight,kit=kit,verbose=FALSE)
mxhat[whatR[r]] <- fitHp$fit$thetahat2[1] #obtain mixture proportions
#Optimized Values under Hd:
log10LRquan[whatR[r]] <- (fitHp$fit$loglik - loghd)/log(10)
} #end for each references given unique stain
ii <- which(ss==unStain)
if (ii%%5 == 0) print(paste0(round(ii/nU* 100), "% LR quan calculation complete"))
} #end for each stains
})[3]
print(paste0("Calculating LR (quan) took ",ceiling(systime), " seconds"))
#hist(log10LRquan)
if(printHistPlots && length(log10LRquan)>0) {
dev.new()
hist(log10LRquan,main="Frequency of LRquan score",xlab="log10LR")
thresh = threshLR
if(length(threshLR)>1) thresh = threshLR[2]
abline(v=log10(thresh),col=2)
op <- par(no.readonly = TRUE)
par(op)
}
###########
#FILTER 3:#
###########
threshQuan <- threshLR[1]
if(length(threshLR)>1) threshQuan <- threshLR[2]
keepInd3 <- which(log10LRquan>=log10(threshQuan))
Ctab3 <- cbind(Ctab2[keepInd3,,drop=F], log10LRquan[keepInd3],mxhat[keepInd3]) #include LRquan (on original scale)
nK3 <- nrow(Ctab3)
print(paste0("Number of comparisons satisfying threshLRquan>",threshQuan,": ", nK3))
if(nK3==0) {
print("There were no match candidates. Program stops!")
return(FALSE)
}
if(writeScores) {
out = cbind( DBstainN[Ctab3[,1]], DBrefN[Ctab3[,2]],round(Ctab3[,3],3),Ctab3[,4],signif(Ctab3[,5],5),round(Ctab3[,6],3),signif(Ctab3[,7],3))
colnames(out) = c("Stain","Reference","MAC","nContr","LRqual","LRquan","Mx")
write.table(out, file = paste0(sessionfold,"/LRquanResults_",stamp,".csv"), row.names = FALSE, col.names = TRUE, sep = ";", append = FALSE,quote=FALSE)
rm(out);gc()
}
print("Search completed! Storing results...")
storeResults(Ctab3)
return(TRUE) #return successfully after completions
} #end dnamatch2
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.