divDyn: Diversity Dynamics using Fossil Sampling Data

Documented in subsample

#' Subsampling wrapper function
#' 
#' The function will take a function that has an occurrence dataset as an argument, and reruns it iteratively on the subsets of the dataset. 
#' 
#' The \code{subsample} function implements the iterative framework of the sampling standardization procedure. 
#' The function 1. takes the dataset \code{x}, 2. runs function \code{FUN} on the dataset and creates a container for results of trials
#' 3. runs one of the subsampling trial functions (e.g. \code{\link{subtrialCR}}) to get a subsampled 'trial dataset'
#' 4. runs \code{FUN} on the trial dataset and
#' 5. averages the results of the trials for a simple output of step 4. such as \code{vector}s, \code{matrices} and \code{data.frames}. For averaging, the \code{vectors} and \code{matrices} have to have the same output dimensions in the subsampling, as in the original object. For \code{data.frames}, the bin-specific information have to be in rows and the \code{bin} numbers have to be given in a variable \code{bin} in the output of \code{FUN}.
#' For a detailed treatment on what the function does, please see the vignette ('Handout to the R package 'divDyn' v0.5.0 for diversity dynamics from fossil occurrence data'). Currently the Classical Rarefaction (\code{"cr"}, Raup, 1975), the occurrence weighted by-list subsampling (\code{"oxw"}, Alroy et al., 2001) and the Shareholder Quorum Subsampling methods are implemented (\code{"sqs"}, Alroy, 2010).
#'
#' \strong{References:}
#'
#' Alroy, J., Marshall, C. R., Bambach, R. K., Bezusko, K., Foote, M., Fürsich, F. T., … Webber, A. (2001). Effects of sampling standardization on estimates of Phanerozoic marine diversification. Proceedings of the National Academy of Science, 98(11), 6261-6266.
#' 
#' Alroy, J. (2010). The Shifting Balance of Diversity Among Major Marine Animal Groups. Science, 329, 1191-1194. https://doi.org/10.1126/science.1189910
#'
#' Raup, D. M. (1975). Taxonomic Diversity Estimation Using Rarefaction. Paleobiology, 1, 333-342. https: //doi.org/10.2307/2400135
#' 
#' @param q (\code{numeric)}: Subsampling level argument (mandatory). Depends on the subsampling function, it is the number of occurrences for \code{"cr"}, and the number of desired occurrences to the power of \code{xexp} for O^x^W. It is also the quorum of the SQS method.
#' @param x (\code{data.frame}): Occurrence dataset, with \code{bin}, \code{tax} and \code{coll} as column names.
#' @param iter (\code{numeric}): The number of iterations to be executed. 
#' @param bin (\code{character}): The name of the subsetting variable (has to be integer). For time series, this is the time-slice variable. Rows with \code{NA} entries in this column will be omitted.
#' @param tax (\code{character}): The name of the taxon variable.
#' @param useFailed (\code{logical}): If the bin does not reach the subsampling quota, should the bin be used? 
#' @param type (\code{character}): The type of subsampling to be implemented. By default this is classical rarefaction (\code{"cr"}). (\code{"oxw"}) stands for occurrence weighted by-list subsampling. If set to (\code{"sqs"}), the program will execute the shareholder quorum subsampling algorithm as it was suggested by Alroy (2010). Setting the argument to \code{"none"} will invoke no subsamling, but the applied function will be iterated on the trials, nevertheless.
#' @param FUN (\code{function}): The function to be iteratively executed on the results of the subsampling trials. If set to \code{NULL}, no function will be executed, and the subsampled datasets will be returned as a \code{list}. By default set to the \code{\link{divDyn}} function. The function must have an argument called \code{x}, that represents the dataset resulting from a subsampling trial (or the entire dataset). Arguments of the \code{subsample} function call will be searched for potential arguments of this function, which means that already provided variables (e.g. \code{bin} and \code{tax}) will also be used. You can also provide additional arguments (similarly to the \code{\link[base]{apply}} iterator). Functions that allow arguments to pass through (that have argument '...') are not allowed, as well as functions that have the same arguments as \code{subsample} but would require different values.
#' @param output (\code{character}): If the function output are vectors or matrices, the \code{"arit"} and \code{"geom"} values will trigger simple averaging with arithmetic or geometric means. If the function output of a single trial is again a \code{vector} or a \code{matrix}, setting the output to \code{"dist"} will return the calculated results of every trial, organized in a \code{list} of independent variables (e.g. if the function output is value, the return will contain a single \code{vector}, if it is a \code{vector}, the output will be a list of \code{vector}s, if the function output is a \code{data.frame}, the output will be a \code{list} of \code{matrix} class objects). If \code{output="list"}, the structure of the original function output will be retained, and the results of the individual trials will be concatenated to a \code{list}.
#' @param keep (\code{numeric}): The bins which will not be subsampled but will be added to the subsampling trials. If the number of occurrences does not reach the subsampling quota, by default it will not be represented in the subsampling trials. You can force their inclusion with the \code{keep} argument separetely (for all, see the \code{useFailed} argument). 
#' @param rem  (\code{numeric}): The bins, which will be removed from the dataset before the subsampling trials.
#' @param duplicates (\code{logical} ): Toggles whether multiple entries from the same taxon (\code{"tax"}) and collection (\code{"coll"}) variables should be omitted. Useful for omitting occurrences of multiple species-level occurrences of the same genus. By default these are allowed through analyses (\code{duplicates=TRUE}), setting this to \code{FALSE} will require you to provide a collection variable. (\code{coll})
#' @param coll (\code{character}): The variable name of the collection identifiers. 
#' @param na.rm (\code{logical}): The function call includes more column names that might contain missing values. If this flag is set to \code{TRUE}, all rows will be dropped that have missig values in the specificed columns. This might lead to the exclusion of some data you do not want to exclude. 
#' @param FUN.args (\code{list}): Arguments passed to the applied function \code{FUN} but not used by the subsampling wrapper. Normally, the arguments of \code{FUN} can be added to the call of \code{subsample}, but in case you want to use different values for the same argument, then the arguments added here will be used for the call of \code{FUN}. For instance, if you want to call \code{subsample} with \code{bin=NULL}, but want to run \code{FUN=divDyn} with a valid \code{bin} column then you can add the column name here, e.g. \code{FUN.args=list(bin="stg")}.
#' @param counter (\code{logical}): Should the loop counting be visible?
#' @param ... arguments passed to \code{FUN} and the type-specific subsampling functions: \code{\link{subtrialCR}}, \code{\link{subtrialOXW}}, \code{\link{subtrialSQS}} 
#' 
#' @examples
#' 
#' data(corals)
#' data(stages)
#' # Example 1-calculate metrics of diversity dynamics
#'   dd <- divDyn(corals, tax="genus", bin="stg")
#'   rarefDD<-subsample(corals,iter=30, q=50,
#'   tax="genus", bin="stg", output="dist", keep=95)
#' 	
#' # plotting
#'   tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), 
#'   ylab="range-through diversity (genera)", ylim=c(0,230))
#'   lines(stages$mid, dd$divRT, lwd=2)
#'   shades(stages$mid, rarefDD$divRT, col="blue")
#'   legend("topleft", legend=c("raw","rarefaction"),
#'     col=c("black", "blue"), lwd=c(2,2), bg="white")
#'   
#' \donttest{
#' # Example 2-SIB diversity 
#' # draft a simple function to calculate SIB diversity
#' sib<-function(x, bin, tax){
#'   calc<-tapply(INDEX=x[,bin], X=x[,tax], function(y){
#'     length(levels(factor(y)))
#'   })
#'   return(calc[as.character(stages$stg)])
#' }
#' sibDiv<-sib(corals, bin="stg", tax="genus")
#' 
#' # calculate it with subsampling
#' rarefSIB<-subsample(corals,iter=25, q=50,
#'   tax="genus", bin="stg", output="arit", keep=95, FUN=sib)
#' rarefDD<-subsample(corals,iter=25, q=50,
#'   tax="genus", bin="stg", output="arit", keep=95)
#' 
#' # plot
#' tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), 
#'   ylab="SIB diversity (genera)", ylim=c(0,230))
#' 
#' lines(stages$mid, rarefDD$divSIB, lwd=2, col="black")
#' lines(stages$mid, rarefSIB, lwd=2, col="blue")
#' 
#' 
#' # Example 3 - different subsampling types with default function (divDyn)
#' # compare different subsampling types
#'   # classical rarefaction
#'   cr<-subsample(corals,iter=25, q=20,tax="genus", bin="stg", output="dist", keep=95)
#'   # by-list subsampling (unweighted) - 3 collections
#'   UW<-subsample(corals,iter=25, q=3,tax="genus", bin="stg", coll="collection_no", 
#'     output="dist", keep=95, type="oxw", xexp=0)
#'   # occurrence weighted by list subsampling
#'   OW<-subsample(corals,iter=25, q=20,tax="genus", bin="stg", coll="collection_no", 
#'     output="dist", keep=95, type="oxw", xexp=1)
#'  
#'   SQS<-subsample(corals,iter=25, q=0.4,tax="genus", bin="stg", output="dist", keep=95, type="sqs")
#'
#' # plot
#'   tsplot(stages, shading="series", boxes="sys", xlim=c(260,0), 
#'   ylab="range-through diversity (genera)", ylim=c(0,100))
#'   shades(stages$mid, cr$divRT, col="red")
#'   shades(stages$mid, UW$divRT, col="blue")
#'   shades(stages$mid, OW$divRT, col="green")
#'   shades(stages$mid, SQS$divRT, col="cyan")
#'   
#'   legend("topleft", bg="white", legend=c("CR (20)", "UW (3)", "OW (20)", "SQS (0.4)"), 
#'     col=c("red", "blue", "green", "cyan"), lty=c(1,1,1,1), lwd=c(2,2,2,2))
#' }
#'
#' @rdname subsample
#' @return Either a list of replicates or an object matching the class of \code{FUN}.
#' @export
subsample<- function(x,
	q,
	tax=NULL,
	bin=NULL,
	FUN=divDyn,
	coll=NULL,
	iter=50,
	type="cr",
	keep=NULL,
	rem=NULL,
	duplicates=TRUE,
	output="arit",
	useFailed=FALSE,
	FUN.args=NULL,
	na.rm=FALSE,
	counter=TRUE,
	...){
	
	# 0. rename arguments to make sure it is not interpreted as a closure anywhere
	quoVar <- q

	# arguments to be available later
	loop <- "for"

	# 1. SYSTEMATIC DEFENSE of main arguments
		# x
		datVect <- FALSE
		if(is.null(tax) & is.null(bin) & is.vector(x)){
			x <- data.frame(x, stringsAsFactors=FALSE)
			# keep in mind that the user will expect FUN to be applicable to a vector!
			datVect <- TRUE
			if(identical(FUN, divDyn)) stop("The default 'divDyn()' function cannot be run on vectors.")
		}

		if(!is.data.frame(x)) stop("'x' has to be a data.frame class object.")
		
		# get the call as list
		ca <- as.list(match.call())
			
		# tax
		if(any("tax"==names(ca))){
			# is it even a column?
			if(is.null(x[[tax]])) stop("You have to provide a valid taxon column.")
			if(length(tax) > 1) stop("Only one taxon column is allowed.")
			
			# logical
			bTaxNA <- is.na(x[,tax, drop=TRUE])
			if(na.rm){
				if(sum(bTaxNA)>0) x <- x[!bTaxNA,]
			}else{
				if(sum(bTaxNA)>0){
					warning("The 'tax' variable includes NAs.")
					x <- x[!bTaxNA,]
				}
			}
		}

		# bin
		if(any("bin"==names(ca))){
			if(!is.null(bin)){
				if(is.null(x[[bin]])) stop("The argument 'bin' is neither a column name in 'x' nor NULL.")
				if(length(bin) > 1) stop("Only one 'bin' column is allowed.")	
	

				# logical
				bVar<- x[,bin, drop=TRUE]
				bBinNA <- is.na(bVar)

				# depending on whether this is indicated, provide a warning
				if(na.rm){
					if(sum(bBinNA)>0) x <- x[!bBinNA,]
				}else{
					if(sum(bBinNA)>0){
						warning("The 'bin' variable includes NAs.")
						x <- x[!bBinNA,]
					}
				}

				if(sum(x[,bin, drop=TRUE]%%1)>0) stop("The bin column may only contain integers.")
			}
		}

		# q=quoVar
		if(!is.numeric(quoVar) | length(quoVar)>1) stop("Only a single numeric quota is allowed.")

		# FUN
		# check validity
		if(!is.null(FUN) & !is.function(FUN)) stop("The argument 'FUN' has to be a function or 'NULL'.")
		
		# match function
		if(!is.null(FUN)) {
			FUN<-match.fun(FUN)

			# stop if bins are not provided for divDyn()
			if(identical(FUN, divDyn) & is.null(bin)) stop("You cannot run the divDyn() function without bins.")
		}

		
		# iter - complete
		if(length(iter)!=1) stop("Only a single number of iterations is allowed.")
		if(!is.numeric(iter)) stop("The entered number of iterations is not numeric.")
		if(iter<1 | iter%%1!=0) stop("Only a positive integers are allowed.")
		
		# type
		if(!type%in%c("sqs", "cr", "oxw", "none") | length(type)!=1) stop("Invalid subsampling type.")
		
		# keep- below

		# rem -below
	
		# duplicates
		if(!is.logical(duplicates)) stop("'duplicates' has to be a logical value.")
	
		# output
		if(length(output)!=1) stop("A single 'output' type is needed/allowed.")
		if(!output%in%c("arit", "geom", "list", "dist")) stop("Invalid 'output' argument.")
		
		# useFailed
		if(!is.logical(useFailed) | length(useFailed)>1) stop("'useFailed' has to be a logical value.")
		

	# 2. DATAFRAME MODIFICATION, FILTERING and further checks
		
		# duplicates - omit multiple instances of the same genus, if desired!
		if(!duplicates){
			
			# requires the coll argument
			if(!is.null(coll)){
				if(is.null(x[[coll]])) stop("The argument 'coll' has to be either 'NULL' or a column name of 'x'. ")
				if(length(coll)>1) stop("Only a single collection variable is allowed.")
			} 
			# no need to check for NAs

			# execute the function
			bDupl<-duplicated(x[, c(tax, coll)])
			x<-x[!bDupl,]
		}

		# additional columns might need to be emptied for subsampling, but the prototype should run with these. 
		# as the columns will be omitted below, save the original dataset for prototyping
		origDat <- x
	
		# interval-specific checking
		if(!is.null(bin)){
			
			# for testing and later reuse
			level<-unique(x[,bin, drop=TRUE])
			level<-level[!is.na(level)]
		
			# keep
			if(!is.null(keep)){
				if(sum(!keep%in%level)>0) stop("Invalid keep argument.")
			}
			
			# rem
			if(!is.null(rem)){
				if(sum(!rem%in%level)>0) stop("Invalid rem argument.")
				if(length(rem)>0){
					x<-x[!x[, bin, drop=TRUE]%in%rem,]
				}
			}
		}

	# 3. ARGUMENT DISTRIBUTION and SUBFUNCTION DEFENSE

		# the list of all function arguments
		passedArgs <- list(...)
		
		# additional columns that need to be checked
		check <- NULL

		if(type=="none"){
			funList <- list(FUN)
			fuNames <- "FUN"
		} 
		if(type=="cr"){
			funList <- list(FUN, checkCR)
			fuNames <- c("FUN", "checkCR")
			
			# if present, add the 'unit' column to the columns to be checked
		
			# is the unit variable NA or erroneous?
			if(any("unit"==names(ca))){

				if(!is.null(passedArgs$unit)){
					if(is.null(x[[passedArgs$unit]])) stop("The argument 'unit' has to be either 'NULL' or a column name of 'x'. ")
					if(length(passedArgs$unit)>1) stop("Only a single 'unit' variable is allowed.")

					check<- c(check, (passedArgs$unit))
				} 
			}
		
		} 
	
		if(type=="oxw"){
			funList <- list(FUN, checkOXW)
			fuNames <- c("FUN", "checkOXW")
			
			# collection is required!
			if(is.null(coll)) stop("You cannot do oxw subsampling without providing a collection 'coll' variable.")
			if(is.null(x[[coll]])) stop("The argument 'coll' has to be either 'NULL' or a column name of 'x'. ")
			if(length(coll)>1) stop("Only a single collection variable is allowed.")

			check <- c(check,coll)
			
		} 
		if(type=="sqs"){
			funList <- list(FUN, checkSQSinexact, checkFreq)
			fuNames <- c("FUN", "checkSQSinexact", "checkFreq")

			if(quoVar>1 | quoVar<=0) stop("Shareholder Quorum has to be in the 0-1 interval.")

			# by-list argument
			if(any("byList"==names(ca))){
				if(!is.logical(passedArgs$byList) | length(passedArgs$byList)!=1) stop("Invalid 'byList' argument.")
				
				# this is incomplete!
				if(passedArgs$byList){
					if(is.null(coll)) stop("The byList method requires a collection variable.")
					if(is.null(x[[coll]])) stop("The argument 'coll' has to be a column name of 'x'. ")
					if(length(coll)>1) stop("Only a single collection variable is allowed.")
					
					check <- c(check,coll)
				}
			}

			# largestColl argument
			if(any("largestColl"==names(ca))){
				if(!is.logical(passedArgs$largestColl) | length(passedArgs$largestColl)!=1) stop("Invalid 'largestColl' argument.")
				
				# this is incomplete!
				if(passedArgs$largestColl){
					if(is.null(coll)) stop("You cannot correct with the largest collection, if you do not provide collection variable.")
					if(is.null(x[[coll]])) stop("The argument 'coll' has to be a column name of 'x'. ")
					if(length(coll)>1) stop("Only a single collection variable is allowed.")
					check <- c(check,coll)
				}
			}

			if(any("singleton"==names(ca))){
				if(!is.null(passedArgs$singleton)){
					if(passedArgs$singleton=="ref"){
						if(is.null(passedArgs$ref)) stop("The 'singleton=\"ref\"' option requires a reference variable.")
						if(is.null(x[[passedArgs$ref]])) stop("The argument 'ref' has to be a column name of 'x'. ")
						if(length(passedArgs$ref)>1) stop("Only a single reference variable is allowed.")
						check <- c(check,passedArgs$ref)
					}

				}

			}

		}
	
		# check these columnns
		if(length(check)>0){
			# check the entered argument
			for(i in 1:length(check)){
				# check for NAs
				bColNA <- is.na(x[,check[i], drop=TRUE])

				# remove every NA that can be found...
				if(na.rm){
					if(sum(bColNA)>0) x <- x[!bColNA,]
				# or check whether NAs pose a threat
				}else{
					# do you have to run subsampling on the interval where the NAs are?
					# separate check when not all intervals are included
					if(!is.null(bin) & length(keep)>0){
						if(sum(is.na(x[!x[,bin, drop=TRUE]%in%keep,check[i]]))>0) 
							stop(paste("The variable '",check[i],"' includes NAs in at least one interval-subset that you want to subsample.\nOmit these, or include the interval in 'keep'. "), sep="")
					}else{
						if(sum(bColNA)>0) stop(paste("The '", check[i],"'' variable includes NAs.", sep=""))
					}
				}
			}

		}
		
		allArgs<- c(list(
			x=x, 
			quoVar=q, 
			tax=tax,
			bin=bin,  
			FUN=FUN, 
			coll=coll, 
			iter=iter, 
			type=type, 
			keep=keep, 
			rem=rem, 
			duplicates=duplicates,  
			output=output,  
			useFailed=useFailed, 
			FUN.args=FUN.args), passedArgs)

		# the distributed output
		distributedArgs <- argdist(funList, fuNames, allArgs, unused=TRUE)

		# evaluate these
		appliedArgs <- distributedArgs$FUN

		# the prototype should use the original data structure
		# in case was a vector
		if(datVect){
			appliedArgs$x <- origDat[,1, drop=TRUE]
		# or a data.frame
		}else{
			appliedArgs$x <- origDat
		}
		
		# arguments that for the applied function (FUN)
		if(is.list(FUN.args)){
			# remove more specific arguments
			appliedArgs<- appliedArgs[!names(appliedArgs)%in%names(FUN.args)]
			appliedArgs<- c(appliedArgs, FUN.args)
		}

		# check the formals of the function itself, and what is not used just state as a warning
		unUsedArgs <- distributedArgs[["0"]]
		unUsed <- names(unUsedArgs)
		remain <- unUsed[!unUsed%in%names(formals(subsample))]

		# there is an empty quote '' if you do not do this. This is something I do not understand...
	#	remain<- remain[-1]

		if(length(remain)>0) warning(paste("The argument(s) '", paste(remain, collapse="', '"), "' is/are not used with current configuration.", sep=""))

		# defend the subfunctions
			if(type=="cr") do.call("checkCR", distributedArgs$checkCR)
		#	if(type=="bs") do.call("checkBS", distributedArgs$checkBS)
			if(type=="oxw") do.call("checkOXW", distributedArgs$checkOXW)
			if(type=="sqs"){
				do.call("checkFreq", distributedArgs$checkFreq)
				do.call("checkSQSinexact", distributedArgs$checkSQSinexact)
			} 
	
	# 4. PROTOTYPING - create a replicate based on the total dataset
	# only if there is a function and if there is a matching phase at the end
	if(!is.null(FUN) & output!="list"){
		prototype<-do.call(FUN, appliedArgs)

		# criteria for FUN!!! now is the time to restrain people from using bullshit
		# matching directive, and failreplacement settings
	

		# should the bin-specific results be replaced? default:NO
		replaceFailed <- FALSE
			
		# vectors
		if(is.vector.like(prototype)){

			# try dimensions for matching if no names are present
			if(is.null(names(prototype))){
				matchdir <- "dim"
			}else{
				matchdir <- "name"
				
				
				if(!is.null(bin)){
					# if all the bins are in the names, or if they are omitted they should be omitted if failed
					if(sum(!level%in%names(prototype))==0){
						replaceFailed <- TRUE
					}
				}
			}
		}
		
		# matrix and data.frames
		if(is.data.frame(prototype) | is.matrix(prototype)){
			
			# use dimensions for matching if at least one name is missing
			if(is.null(colnames(prototype)) | is.null(rownames(prototype))){
				matchdir <- "dim"
			}else{
				matchdir <- "name"

				if(!is.null(bin)){
					# if bins represent rows
					if(sum(!level%in%rownames(prototype))==0){
							replaceFailed <- TRUE
					}
				}
			}
		}

		# special prototyping should be used for divDyn()
		if(identical(FUN, divDyn)) a<-NA

	}


	# 5. PREPARATORY CALCULATIONS and SETTINGS

	# 5A. binned subsampling
	if(!is.null(bin)){
		
		# 1. Classical Rarefaction - CR (already implemented in Rcpp)
		if(type=="cr"){
			# order the rows 
			oBin <- order(x[,bin, drop=TRUE])
			
			# reorder dataset
			x <- x[oBin,]
			
			# by-unit subsampling?
			if("unit"%in%names(distributedArgs$checkCR)){
				byUnit <- TRUE
				unitVar <- x[,distributedArgs$checkCR$unit, drop=TRUE]
				rows <- 1:nrow(x)
						
			}else{
				byUnit <- FALSE
			}

			# the only important variable
			binVar <- x[,bin, drop=TRUE]
			
			# intact rows
			keepRows<-binVar%in%keep
			
			# just organize
			oneResult<-NULL
			

		}

		# 2. Occurrence-weighted by list subsampling - OXW (not yet implemented in Rcpp)
		if(type=="oxw"){
			
			# add the arguments that were not available previously
			oxwArgs <- c(
				distributedArgs$checkOXW, 
				list(intact=keep, binVar=x[,bin, drop=TRUE],collVar=x[, coll, drop=TRUE]))
		}		

		# 3. Shareholder Quorum Subsamping - SQS (not yet implemented in Rcpp)
		if(type=="sqs"){
			# frequency calculation
			freqVar<-do.call(frequencies, distributedArgs$checkFreq)

			# by-list is by default: FALSE - to make coll necessary, you have to include it in the added arguments
			collVar <- NULL

			if(!is.null(distributedArgs$checkSQSinexact$byList)){
				if(distributedArgs$checkSQSinexact$byList){
					# and reassing the collection variable
					collVar <- x[, coll, drop=TRUE]
				}
			}
			
			# and create argument basis for SQS
			sqsArgs<-c(
				distributedArgs$checkSQSinexact,
				list(
					"binVar"=x[,bin, drop=TRUE],
					"freqVar"=freqVar,
					"collVar"=collVar,
					"intact"=keep
				)
			)

		}

		# 4. a fake subsampling trial dataset index container
		if(type=="none"){
			oneResult<- NULL
		}

		# x4. Failure-tracking setup
		failMat<-matrix(FALSE, ncol=iter, nrow=length(level))
		rownames(failMat) <- level


	# 5B. not bin-specific subsampling	
	}else{
		if(type=="cr"){
			# by-unit subsampling?
			if("unit"%in%names(distributedArgs$checkCR)){
				byUnit <- TRUE

				unitVar <- x[,distributedArgs$checkCR$unit, drop=TRUE]
				tUn <- table(unitVar)
				

				# if useFailed=FALSE, check whether the quota is low enough
				if(!useFailed & length(tUn)<quoVar) stop("The entered quota is too high. Set 'useFailed=TRUE' to get a result with the unsubsampled dataset.")
				if(length(tUn)< quoVar) quoVar <- length(tUn)


			}else{
				# if useFailed=FALSE, check whether the quota is low enoug
				if(!useFailed & nrow(x)<quoVar) stop("The entered quota is too high. Set 'useFailed=TRUE' to get a result with the unsubsampled dataset.")
				if(nrow(x)< quoVar) quoVar <- nrow(x)
				byUnit <- FALSE
				rows <- 1:nrow(x)
			}
		}
		if(type=="oxw"){
			# collecction - number present?
			if(is.null(coll))stop("You cannot do oxw subsampling without providing a collection 'coll' variable.")
			collVar <- x[[coll]]
			if(is.null(collVar)) stop("The 'coll' argument is not a variable in 'x'.")
			if(sum(is.na(collVar))>0) stop("The 'coll' variable includes NAs, please omit these manually.")

			# number of items in a collection + raise to the exponent
			collTab <- table(collVar)^distributedArgs$checkOXW$xexp
			quoVar <- quoVar
			
			# defense against high quotas
			if(!useFailed & sum(collTab)<quoVar) stop("The entered quota is too high. Set 'useFailed=TRUE' to get a result with the unsubsampled dataset.")
			if(sum(collTab)< quoVar) quoVar <- length(collTab)

		}
		if(type=="sqs"){

			# frequency calculation , but without the bin=NULL
			freqVar<-do.call(frequenciesOne, distributedArgs$checkFreq[names(distributedArgs$checkFreq)!="bin"])

			# by-list is by default: FALSE - to make coll necessary, you have to include it in the added arguments
			collVar <- NULL

			if(!is.null(distributedArgs$checkSQSinexact$byList)){
				if(distributedArgs$checkSQSinexact$byList){
					# and reassing the collection variable
					collVar <- x[, coll, drop=TRUE]
				}
			}
			
			# and create argument basis for SQS
			sqsArgs<-c(
				distributedArgs$checkSQSinexact,
				list(
					"freqVar"=freqVar,
					"collVar"=collVar
				)
			)



		}
		

	}

	# 5D. divDyn-specific optimization
	# override noNAStart, to make sure it that plotting-related empty rows do not affect performance
	if(identical(FUN, divDyn) & output!="list"){
		appliedArgs$noNAStart <- TRUE
	}

	
	# 6. ITERATION - rerun the subsampling function in loops, and run the applied function

	# In case there is random number generation in RCpp
	if(type=="cr"){
		# in case the seed was not set earlier
		if(!exists(".Random.seed")) set.seed(NULL)
	}

	# 6A.  FOR-type iteration
	if(loop=="for"){
	
		# container for iteration results
		containList<-list()

		# a. bin-specific subsampling
		if(!is.null(bin)){
			
			# for every loop 
			for(k in 1:iter){

				# 1. Subsampling subphase
				# classical rarefaction
				if(type=="cr"){
					# switch, which subsampling method should be used - by-unit or by-row
					
					#by-row method
					if(!byUnit){

						# generate a single random number,it will move the generator state forward
						# as long as the state can be linked to the R RNG seed, this should work fine
						seed <- sample(max(.Random.seed),1)
						# message(seed)

						# this is probably not necessary,but just in case
						set.seed(seed)

						# output of cpp function
						binCRres<-.Call('_divDyn_CRbinwise', PACKAGE = 'divDyn', binVar, quoVar, seed)
					
						# increase indexing (R)
						binCRres[,1]<-binCRres[,1, drop=TRUE]+1
					
						# rows where quota not reached
						failOutput <- binCRres[,1, drop=TRUE]==-8
						
						# bins where quota not reached
						oneResult$fail<-binCRres[failOutput,2]
						
						# the rows where 
						oneResult$rows <-rep(FALSE, nrow(x))
						
						# output by subsampling
						oneResult$rows[binCRres[!failOutput,1, drop=TRUE]] <- TRUE
						
						oneResult$rows[keepRows] <- TRUE

					
					# by-unit
					}else{
						
						# subsampling a tapply() loop			
						tap <- tapply(INDEX=binVar, X=rows, function(w){
							unitSlice<-unitVar[w]
							
							# random sample
							uniqueUnit<-sample(unique(unitVar[w]))
							
							if(quoVar<=length(uniqueUnit)){
								# rows corresponding t
								return(w[unitSlice%in%uniqueUnit[1:quoVar]])
							}else{
								return(NULL)
							}
						
						})
						
						# what was not enough for the subsampling
						failed<-unlist(lapply(tap, is.null))
						failed<-as.numeric(names(failed[failed]))
						failed<-failed[!failed%in%keep]
						
						# the rows that should be passed
						trialRows <- rows%in%unlist(tap)
						
						# these rows should be present regardless of the subsampling
						intactRows<-binVar%in%keep
						
						# combine the two
						subsampleRows<-trialRows | intactRows
						
						# result
						oneResult<-list(rows=subsampleRows, fail=failed)
						
					} # end by-unit CR

				} # end CR

				# Occurrence-weighted by list subsampling
				if(type=="oxw"){
					
					# trial dataset rows + failed
					oneResult<-do.call(subsampleOXW, args=oxwArgs)

				} # end OXW
				
				# Shareholder-quorum subsampling
				if(type=="sqs"){
					
					# trial dataset rows + failed
					oneResult<-do.call(subsampleSQSinexact, args=sqsArgs)
					
				} # end SQS


				# 2. make the trial dataset from the subsampling trial outputs
					# subset the function-namespace x and to copy to appliedArgs
					appliedArgs$x<-x[oneResult$rows,]


				# 3. Failure registration subphase
				# add the failed stuff if so required
				if(useFailed){

					# add the failed bins' original data to the trial dataset
					appliedArgs$x<-rbind(appliedArgs$x, x[x[,bin, drop=TRUE]%in%oneResult$fail,])

				# or keep track
				}else{
					# save in the matrix
					failMat[as.character(oneResult$fail),k] <-TRUE
				}

				# 4. Application subphase

				# nothing is done
				if(is.null(FUN)){
					trialRes<-appliedArgs$x

				# apply FUN
				}else{
					trialRes<-do.call(FUN,appliedArgs)
				}

				# 5. Storage subphase
				containList[[k]]<-trialRes

				# 6. loop display
				if(counter){
					cat(k, "\r")
					utils::flush.console()
				}

			} # end for iteration

		# b. no bin subsampling
		}else{

			# for every iteration
			for(k in 1:iter){
				
				# A. CR method
				if(type=="cr"){
					# by-row CR subsampling
					if(!byUnit){
						index <- sample(rows,quoVar, replace=FALSE) 
	
					# by-unit CR subsampling
					}else{
						# randomize the unit table
						randTab <- sample(tUn)
						passed <- randTab[1:quoVar]

						# boolean index
						index <- as.character(unitVar)%in%names(passed)
					}
					

				} # end CR method

				if(type=="oxw"){
					randTab <- sample(collTab)

					# raise to exponent
					summed <- cumsum(randTab)
					bNeed<- summed <= quoVar

					index <- collVar%in%(names(randTab)[bNeed])
				}

				# Shareholder-quorum subsampling
				if(type=="sqs"){
					
					# trial dataset rows or FAILED
					index<-do.call(subsampleSQSinexactOne, args=sqsArgs)

					# NULL indicates that the trial failed
					if(is.null(index)){
						if(!useFailed){
							stop("The entered quorum is too high. Set 'useFailed=TRUE' to get a result with the unsubsampled dataset.")
						}else{
							index<- rep(TRUE, length(freqVar))
						}
					}
					
				} # end SQS

				# 2. the result of subsampling - trial dataset
				# as a vector, if it was so originally
				if(datVect){
					appliedArgs$x <- x[index,1, drop=TRUE]
				# or a data.frame
				}else{
					appliedArgs$x <- x[index,]
				}
			
				
				# 3. Application subphase
				# nothing is done
				if(is.null(FUN)){
					trialRes<-appliedArgs$x

				# apply FUN
				}else{
					trialRes<-do.call(FUN,appliedArgs)
				}

				# 4. Storage subphase
				containList[[k]]<-trialRes

				# 5. loop display
				if(counter){
					cat(k, "\r")
					utils::flush.console()
				}

			} # end iteration for loop

		}# end no bins

	} # end of for-type looping

	# 7. POST-PROCESS
	# which bins failed the trials?
	if(!is.null(bin)){
		failedBins<-sort(rownames(failMat)[which(apply(failMat,1, sum)>0)])
	}

	# 8. MATCHING subphase

	# flag, will the match succeed after going through everything?
		matchFailed <- FALSE

	# a. if output should not be a list
	# originally specified or function to be applied
	if(output!="list" & !is.null(FUN)){
		
		# choose the appropriate averagin function
		if(output=="arit") AFU <- match.fun(mean)
		if(output=="geom") AFU <- match.fun(geom)
		if(output=="dist") AFU <- NULL

		# try to match replicates to prototype, and apply the averaging function
		finalResult<- tryCatch(repmatch(containList, FUN=AFU, proto=prototype, direct=matchdir, na.rm=TRUE), error=function(a){
			message("Replicate matching failed, returning a list.")
			NULL
		})

		# if it succeeds
		if(!is.null(finalResult)){ 
			if(!is.null(bin)){
				# make the failed replacements.
				if(replaceFailed & !useFailed){
				
					# vector-specific 
					if(is.vector.like(prototype)){
						if(output!="dist"){
							finalResult[failedBins] <- NA
						}else{
							finalResult[failedBins,] <- NA
						}
					}

					# data.frame-specific things
					if(is.data.frame(prototype) | is.matrix(prototype)){
						if(output!="dist"){
							finalResult[failedBins,] <- NA
						}else{
							finalResult <- lapply(finalResult, function(w){
								w[failedBins,] <- NA
								w
							})
						}
					}
				}

				# divDyn-specific optimization
				# add the complete bin numbers back, except when distribution list!
				if(identical(divDyn, FUN) & output!="dist"){
					finalResult[,bin] <- prototype[,bin]
				}
			}

		}else {
			matchFailed <- TRUE
		}

	}

	# b. if 1.list output is asked, matching fails, or no applied FUN
	if(output=="list" | matchFailed | is.null(FUN)){
		if(!is.null(bin)){
			finalResult <- list(results=containList, failed=failedBins)
		}else{
			finalResult<-containList
		}
	}

	# 9. RETURN
	return(finalResult)
}





########################################################
# CHECKING FUNCTIONS


# These functions help the argdist() function to select the appropriate arguments
# after the arguments are separated they are called to check whether they are appropriate or not.

# only for checking!
checkCR<-function(x, quoVar, unit=NULL){
	# quota
	if(quoVar<=1 | quoVar%%1!=0) stop("The quota has to be a natural number larger than 1.")
}

#checkBS<-function(x, unit=NULL){
#	
#	if(!is.null(unit)){
#		# check whether this is present in the broader function scope 'x'
#		if(is.null(x[[unit]])) stop("The 'x' object has no column 'unit'. ")
#
#		# check if the unit column contains NAs
#		if(sum(is.na(x[,unit, drop=TRUE])>0)) stop("Variable 'unit' includes NAs, please omit these rows manually.")
#	}
#}


# for checking and distribution, additional arguments to subsampleOXW will be added within the subsample() function.
checkOXW<- function(quoVar, intact=NULL,xexp=1){
	# the quota variable
	if(quoVar<=1 | quoVar%%1!=0) stop("The quota has to be a natural number larger than 1.")
			
	# quota check
	# the xexp 
		if(!is.numeric(xexp) | length(xexp)!=1) stop("Invalid 'xexp' argument.")
		if(xexp<0) stop("Invalid 'xexp' argument.")

	# intact is an internal argument
	intact <- NULL

}

checkFreq <- function(x,bin, tax, coll=NULL, ref=NULL, singleton="occ", excludeDominant=FALSE, largestColl=FALSE, fcorr="good"){

	# ref is only checked here
	if(!is.null(ref)) if(is.null(x[[ref]])) stop("The argument 'ref' has to be either 'NULL' or a column name of 'x'. ")
	if(length(ref)>1) stop("Only a single reference variable is allowed.")

	# singleton
	if(!singleton%in%c("ref", "occ", FALSE) | length(singleton)!=1) stop("Invalid 'singleton' value.")
	if(singleton=="ref" & is.null(ref)) stop("You cannot calculate coverage based on references, if you do not provide a references variable.")

	# excludeDominant
	if(!is.logical(excludeDominant) | length(excludeDominant)!=1) stop("Invalid 'excludeDominant' argument.")
	
	if(!fcorr%in%c("good", "alroy") | length(fcorr)!=1) stop("Invalid 'fcorr' argument.")

	# things that are present but do not have to be checked - checked before
	x <- NULL
	bin <- NULL
	tax <- NULL
	
}

checkSQSinexact <- function(quoVar, byList=FALSE, intact=NULL, appr="under", trialRet="occ"){
	
	# check whether coll is existing in the broader namespace
	
	
	intact <- NULL
	appr <- NULL
	trialRet<-NULL
}


########################################################
# Type-specific functions

# excludeDominant sets the Alroy's 2nd correction, sets the frequency of the dominant taxon's to 0
frequencies<-function(x,bin, tax, coll=NULL, ref=NULL, singleton="occ", excludeDominant=FALSE, largestColl=FALSE, fcorr="good"){

	# the number of occurrences
		O<-table(x[,bin, drop=TRUE])
		rowIndex<-1:nrow(x)
	
	#calculate the frequencies
	if(!excludeDominant){
		# calculate the taxon frequencies in the bin, and assign the value to each occurrence
		freqVarList <-tapply(INDEX=x[,bin, drop=TRUE], X=rowIndex, function(w){
		#	w<-rowIndex[unique(x[,bin])[1]==x[,bin]]
			
			sliceTaxa<-x[w,tax, drop=TRUE]
			taxTab<-table(sliceTaxa)
			freq<-taxTab/length(sliceTaxa)
			
			occWise <- freq[sliceTaxa]
			names(occWise)<-w
			return(occWise)
		
		})
		
		# frequencies
		freqVar<-unlist(freqVarList, use.names=F)
		
		# row identifiers
		nameVec<-unlist(lapply(freqVarList,function(w){
			names(w)
	
		}))
		names(freqVar)<-nameVec
	
		# reorder to reflect the order of occurrences
		freqVar<-freqVar[as.character(rowIndex)]
		
	# adjust dominant to 0
	}else{
		freqVarList <-tapply(INDEX=x[,bin], X=rowIndex, function(w){
			sliceTaxa<-x[w,tax, drop=TRUE]
			taxTab<-table(sliceTaxa)
			freq<-taxTab/length(sliceTaxa)
			domIndex<-which((max(freq)==freq))
			freq[domIndex[1]]<-0
			
			occWise <- freq[sliceTaxa]
			names(occWise)<-w
			return(occWise)
		
		})
		
		freqVar<-unlist(freqVarList, use.names=F)
		nameVec<-unlist(lapply(freqVarList,function(w){
			names(w)
	
		}))
		names(freqVar)<-nameVec
	
		# reorder to reflect the order of occurrences
		freqVar<-freqVar[as.character(rowIndex)]
		
	}
	
	# Alroy 1st correction
	if(singleton%in%c("ref", "occ")){
		
		# number of sampled taxa
		vectS<-tapply(INDEX=x[,bin, drop=TRUE], X=x[,tax, drop=TRUE], function(w){
			return(length(unique(w)))	
		})
			
		# use single reference taxa
		if(singleton=="ref"){
			# count single reference taxa
			refTax<-unique(x[,c(ref,tax)])
			tTax<-table(refTax[,tax, drop=TRUE])
			
			# single reference taxa
			singRefTaxa<-names(tTax)[tTax==1]
			
			# the number of occurrences of single reference taxa
			p1<-tapply(INDEX=x[,bin, drop=TRUE], X=rowIndex, function(w){
				sliceTax<-x[w,tax, drop=TRUE]
				return(sum(sliceTax%in%singRefTaxa))
			})
			
			# two reference taxa
			twoRefTaxa<-names(tTax)[tTax==2]
			
			# the number of occurrences of single reference taxa
			p2<-tapply(INDEX=x[,bin, drop=TRUE], X=rowIndex, function(w){
				sliceTax<-x[w,tax, drop=TRUE]
				return(sum(sliceTax%in%twoRefTaxa))
			})
			
			
		}
		
		# use single occurrence taxa
		if(singleton=="occ"){
		
			# depending on whether there is a collection entry
			if(!is.null(coll)){
				occTax<-unique(x[,c(tax, bin, coll)])
			}else{
				occTax<-x[, c(tax, bin)]
			}
			
			
			p1<-tapply(INDEX=occTax[,bin, drop=TRUE], X=occTax[,tax, drop=TRUE], function(w){
				tTax<-table(w)
				return(sum(tTax==1))
			})
			
			p2<-tapply(INDEX=occTax[,bin, drop=TRUE], X=occTax[,tax, drop=TRUE], function(w){
				tTax<-table(w)
				return(sum(tTax==2))
			})
			
			# 'singRefTaxa' taxa that occurr only in single collections
			singRefTaxa<-tapply(INDEX=occTax[,bin, drop=TRUE], X=occTax[,tax, drop=TRUE], function(w){
				return(names(table(w)))
				
			})
		
		}
	
		# Alroy 2nd correction
		if(excludeDominant){
			# count the dominant taxon 
			n1 <-tapply(INDEX=x[,bin, drop=TRUE], X=rowIndex, function(w){
				sliceTaxa<-x[w,tax, drop=TRUE]
				taxTab<-table(sliceTaxa)
				domIndex<-which((max(taxTab)==taxTab))
				ret<-taxTab[domIndex]
				return(ret[1])
			})
			
			# then set 
			# for the original solution
			uP <-(O-p1-n1)/(O-n1)
			
			# for Alroy's solution
			correction<-(((p1+p2)/2)/vectS)/(O-n1)
			
			# Alroy's 3rd correction
			if(largestColl){
			
				#number of collections a taxon belong to (total dataset)
				taxTab<-table(x[,tax, drop=TRUE])
				
				# taxa in the most diverse collection - that are single-reference and do not occur anywhere else
				tMax<-tapply(INDEX=x[,bin, drop=TRUE], X=rowIndex, function(w){
				#	w<-rowIndex[x[,bin]==61]
				
					# unique collection/taxon table
					collTax<-unique(x[w,c(coll,tax)])
					# how many taxon/collection
					tColl<-sort(table(collTax[,coll, drop=TRUE]), decreasing=T)
					
					# which are the taxa that are there
					taxInMost<-collTax[collTax[,coll, drop=TRUE]==names(tColl[1]), tax]
					
					# which are single reference taxa
					taxInMostAndSingRef<-taxInMost[taxInMost%in%singRefTaxa]
					bTaxInMostAndSingRef<-names(taxTab)%in%taxInMostAndSingRef
					
					# taxa only ever found in the most diverse collection
					return(sum(taxTab==1 & bTaxInMostAndSingRef))
					
					
				})
				
				# calculate u'
				uP <- (O-p1-n1+tMax)/(O-n1)
		
				# for Alroy's solution (exclude taxa found in the largest collection from the single-reference taxa)
				correction<-((((p1-tMax)+p2)/2)/vectS)/(O-n1)
				
			}
		
		}else{
			uP = 1-p1/O
			correction<-(((p1+p2)/2)/vectS)/O
		}
		# map uP to freqVar and adjust
		if(fcorr=="good"){
			freqVar<-freqVar*uP[as.character(x[,bin, drop=TRUE])]
		}
		if(fcorr=="alroy"){
			
			freqVar <- freqVar-correction[as.character(x[,bin, drop=TRUE])]
		
		}
	}
	names(freqVar)<-x[,tax, drop=TRUE]
	return(freqVar)
}



# excludeDominant sets the Alroy's 2nd correction, sets the frequency of the dominant taxon's to 0
frequenciesOne<-function(x, tax=NULL, coll=NULL, ref=NULL, singleton="occ", excludeDominant=FALSE, largestColl=FALSE, fcorr="good"){
	# just a single variable passed
	if(ncol(x)==1) tax<-1

	# the number of occurrences
		O<-nrow(x)
		bin<-NULL
	#calculate the frequencies
	if(!excludeDominant){
		# calculate the taxon frequencies in the bin, and assign the value to each occurrence
		sliceTaxa<-x[,tax, drop=TRUE]
		taxTab<-table(sliceTaxa)
		freq<-taxTab/length(sliceTaxa)
		
		freqVar <- freq[sliceTaxa]
		
	# adjust dominant to 0
	}else{
		domIndex<-which((max(freqVar)==freqVar))
		freqVar[domIndex[1]]<-0
	}
	
	# Alroy 1st correction
	if(singleton%in%c("ref", "occ")){
		# number of sampled taxa
		vectS <- length(taxTab)
			
		# use single reference taxa
		if(singleton=="ref"){
			# count single reference taxa
			refTax<-unique(x[,c(ref,tax)])
			tTax<-table(refTax[,tax, drop=TRUE])
			
			# single reference taxa
			singRefTaxa<-names(tTax)[tTax==1]
			
			# the number of occurrences of single reference taxa
			sliceTax<-x[,tax, drop=TRUE]
			p1 <- sum(sliceTax%in%singRefTaxa)

			# two reference taxa
			twoRefTaxa<-names(tTax)[tTax==2]
			p2 <- sum(sliceTax%in%twoRefTaxa)
			
		}
		
		# use single occurrence taxa
		if(singleton=="occ"){
		
			# depending on whether there is a collection entry
			if(!is.null(coll)){
				occTax<-unique(x[,c(tax, coll)])
			}else{
				occTax<-x[, c(tax), drop=FALSE]
			}
			# table
			tTax<-table(occTax[,tax, drop=TRUE])
			p1 <- sum(tTax==1)
			p2 <- sum(tTax==2)

			# 'singRefTaxa' taxa that occurr only in single collections
			singRefTaxa<-names(table(occTax[,tax, drop=TRUE]))
		
		}
	
		# Alroy 2nd correction
		if(excludeDominant){
			# count the dominant taxon 
			sliceTaxa<-x[,tax, drop=TRUE]
			taxTab<-table(sliceTaxa)
			domIndex<-which((max(taxTab)==taxTab))
			ret<-taxTab[domIndex]
			n1<- ret[1]
			names(n1) <-NULL
			
			
			# then set 
			# for the original solution
			uP <-(O-p1-n1)/(O-n1)
			
			# for Alroy's solution
			correction<-(((p1+p2)/2)/vectS)/(O-n1)
			
			# Alroy's 3rd correction
			if(largestColl){
			
				#number of collections a taxon belong to (total dataset)
				taxTab<-table(x[,tax, drop=TRUE])
				
				# taxa in the most diverse collection - that are single-reference and do not occur anywhere else
			
					# unique collection/taxon table
					collTax<-unique(x[,c(coll,tax)])
					# how many taxon/collection
					tColl<-sort(table(collTax[,coll, drop=TRUE]), decreasing=T)
					
					# which are the taxa that are there
					taxInMost<-collTax[collTax[,coll, drop=TRUE]==names(tColl[1]), tax]
					
					# which are single reference taxa
					taxInMostAndSingRef<-taxInMost[taxInMost%in%singRefTaxa]
					bTaxInMostAndSingRef<-names(taxTab)%in%taxInMostAndSingRef
					
					# taxa only ever found in the most diverse collection
					tMax<- sum(taxTab==1 & bTaxInMostAndSingRef)
					
				# calculate u'
				uP <- (O-p1-n1+tMax)/(O-n1)
		
				# for Alroy's solution (exclude taxa found in the largest collection from the single-reference taxa)
				correction<-((((p1-tMax)+p2)/2)/vectS)/(O-n1)
				
			}
		
		}else{
			uP = 1-p1/O
			correction<-(((p1+p2)/2)/vectS)/O
		}
		# map uP to freqVar and adjust
		if(fcorr=="good"){
			freqVar<-freqVar*uP
		}
		if(fcorr=="alroy"){
			
			freqVar <- freqVar-correction
		
		}
	}
	names(freqVar)<-x[,tax, drop=TRUE]
	return(freqVar)
}


# Subsampling utility functions:
subsampleOXW<-function(binVar, collVar, quoVar, intact=NULL,xexp=1){
	
	# future argument: appr represents how the subsampling quota is approached
	# appr can be "over", "under", "optimize"
	appr="over"
	
	# the chosen quota
	if(xexp==0){
		appr <- "other"
	}
	rows<- 1:length(binVar)
	
	#keep track of not enough
	ne<-NULL
	tap<-tapply(X=rows, INDEX=binVar, FUN=function(y){
	#	y<-rows[binVar==i]
		# because indices are subsetted
		collection<-collVar[y]
		
		# get the weight of each collection
		expTab<-table(collection)^xexp
		
		# shuffle
		expTab<-sample(expTab)
		
		# add up
		cumulative<-cumsum(expTab)


		# there are enough 
		if(length(cumulative)>0){
			if(cumulative[length(cumulative)]>quoVar){
				bSelect<-cumulative<=quoVar
				# potential forking!!! 
				if(appr=="over"){
					if(sum(bSelect)!=length(bSelect)){
						bSelect[sum(bSelect)+1] <- TRUE
					}
				}
				
				# the collections to keep
				keepCollections<-as.numeric(names(cumulative)[bSelect])
				return(y[collection%in%keepCollections])
				
			}else{
				return(NULL)
			}
		}else{
			return(NULL)
			
		}
	
	})
	
	# what was not enough for the subsampling
	failed<-unlist(lapply(tap, is.null))
	failed<-as.numeric(names(failed[failed]))
	failed<-failed[!failed%in%intact]
	
	# the rows that should be passed
	trialRows <- rows%in%unlist(tap)
	
	# these rows should be present regardless of the subsampling
	intactRows<-binVar%in%intact
	
	# combine the two
	subsampleRows<-trialRows | intactRows
	
	# result
	res<-list(rows=subsampleRows, fail=failed)

	# return
	return(res)

}

# single-bin sqs
subsampleSQSinexactOne<-function(freqVar, quoVar, collVar=NULL, byList=FALSE, intact=NULL, appr="under", trialRet="occ"){
	# non by-list SQS
	if(!byList){

		# the frequencies
		frequencies<-freqVar
		
		# the taxa
		taxa<-names(freqVar)
		
		# shuffle, we want to go by occurrences (random order),
		shuffledOccs<-sample(1:length(taxa))
		shuffledSp<-taxa[shuffledOccs]
		shuffledFreq<-frequencies[shuffledOccs]
		
		# but should only count new species' coverage
		speciesFirstEncounterOrder<-shuffledSp[!duplicated(shuffledSp)]
		coverageOrder<-shuffledFreq[!duplicated(shuffledSp)]
		
		# add up
		cumulative<-cumsum(coverageOrder)
		
		if(length(cumulative)>0){
			if(cumulative[length(cumulative)]>quoVar){
				bSelect<-cumulative<=quoVar

				# potential forking!!! 
				if(appr=="over"){
					if(sum(bSelect)!=length(bSelect)){
						bSelect[sum(bSelect)+1] <- TRUE
					}
				}

				# if first selected species have larger then quorum frequency 
				if(appr=="under" & sum(bSelect)==0){
					bSelect[1]<-TRUE	
				}
				
				# the species to keep
				keepTaxa<-speciesFirstEncounterOrder[bSelect]
				
				# what should the function return? all occurrences of the 
				# species, or just those that were skimmed by the algorithm?
				if(trialRet=="occ"){
					# first encounter of the last:
					lastTaken<-keepTaxa[max(which(bSelect))]
					# keep occurrences before that
					lastIndex<-min(which(shuffledSp==lastTaken))
					res<- rep(FALSE, length(freqVar))
					res[shuffledOccs[1:lastIndex]] <- TRUE
				}
				
				if(trialRet=="taxon"){
					res<- (taxa%in%keepTaxa)
				}
				
			}else{
				res<-NULL
			}
		}else{
			res<-NULL
		}
			
	# by-list SQS
	}else{
		# the frequencies
		frequencies<-freqVar
		
		# the taxa
		taxa<-names(freqVar)
		
		# collections
		colls<-collVar
		
		colTab<-table(taxa, colls)
		class(colTab) <- "matrix"
		
		# ensure that it will run if there is no collection info
		if(ncol(colTab)>1){
			
			# shuffle collections
			newOrd<-sample(1:ncol(colTab))
			colTab<-colTab[,newOrd, drop=FALSE]
			
			firstOccur<-apply(colTab, 1, function(b){
				min(which(as.logical(b)))
			})
			# empty
			colTab[,]<-0
			
			# frequencies should be calculated when species occur for the first time
			colTab[cbind(1:nrow(colTab),firstOccur)]<-frequencies[names(firstOccur)]
			
			# frequencies to be cumulated
			coverageOrder<-apply(colTab, 2, sum)
			
			# add up
			cumulative<-cumsum(coverageOrder)
			
			if(length(cumulative)>0){
				if(cumulative[length(cumulative)]>quoVar){
					bSelect<-cumulative<=quoVar
					# potential forking!!! 
					if(appr=="over"){
						if(sum(bSelect)!=length(bSelect)){
							bSelect[sum(bSelect)+1] <- TRUE
						}
					}
					
					# the collections to keep
					keepColls<-colnames(colTab)[bSelect]
					
					res<-(colls%in%keepColls)
					
					
				}else{
					res<-NULL
				}
			}else{
				res<-NULL
			}
		}
		
		
	} # end bylist
	
	# return
	return(res)

}


#- multi-bin inexact sqs
subsampleSQSinexact<-function(binVar, freqVar, quoVar, collVar=NULL, byList=FALSE, intact=NULL, appr="under", trialRet="occ"){
	
	rows<- 1:length(binVar)
	#keep track of not enough
	ne<-NULL
	
	# non by-list SQS
	if(!byList){
		tap<-tapply(X=rows, INDEX=binVar, FUN=function(y){
			# the frequencies
			frequencies<-freqVar[y]
			
			# the taxa
			taxa<-names(freqVar)[y]
			
			# shuffle, we want to go by occurrences (random order),
			shuffledOccs<-sample(1:length(taxa))
			shuffledSp<-taxa[shuffledOccs]
			shuffledFreq<-frequencies[shuffledOccs]
			
			# but should only count new species' coverage
			speciesFirstEncounterOrder<-shuffledSp[!duplicated(shuffledSp)]
			coverageOrder<-shuffledFreq[!duplicated(shuffledSp)]
			
			# add up
			cumulative<-cumsum(coverageOrder)
			
			if(length(cumulative)>0){
				if(cumulative[length(cumulative)]>quoVar){
					bSelect<-cumulative<=quoVar

					# potential forking!!! 
					if(appr=="over"){
						if(sum(bSelect)!=length(bSelect)){
							bSelect[sum(bSelect)+1] <- TRUE
						}
					}

					# if first selected species have larger then quorum frequency 
					if(appr=="under" & sum(bSelect)==0){
						bSelect[1]<-TRUE	
					}
					
					# the species to keep
					keepTaxa<-speciesFirstEncounterOrder[bSelect]
					
					# what should the function return? all occurrences of the 
					# species, or just those that were skimmed by the algorithm?
					if(trialRet=="occ"){
						# first encounter of the last:
						lastTaken<-keepTaxa[max(which(bSelect))]
						# keep occurrences before that
						lastIndex<-min(which(shuffledSp==lastTaken))
						rowShuffle<-y[shuffledOccs]
						return(rowShuffle[1:lastIndex])
					}
					
					if(trialRet=="taxon"){
						return(y[taxa%in%keepTaxa])
					}
					
				}else{
					return(NULL)
				}
			}else{
				return(NULL)
			}
				
			
		})
	# by-list SQS
	}else{
		tap<-tapply(X=rows, INDEX=binVar, FUN=function(y){
		#	y<-rows[binVar==tempVar[zui]]
			# the frequencies
			frequencies<-freqVar[y]
			
			# the taxa
			taxa<-names(freqVar)[y]
			
			# collections
			colls<-collVar[y]
			
			colTab<-table(taxa, colls)
			class(colTab) <- "matrix"
			
			# ensure that it will run if there is no collection info
			if(ncol(colTab)>1){
				
				# shuffle collections
				newOrd<-sample(1:ncol(colTab))
				colTab<-colTab[,newOrd, drop=FALSE]
				
				firstOccur<-apply(colTab, 1, function(b){
					min(which(as.logical(b)))
				})
				# empty
				colTab[,]<-0
				
				# frequencies should be calculated when species occur for the first time
				colTab[cbind(1:nrow(colTab),firstOccur)]<-frequencies[names(firstOccur)]
				
				# frequencies to be cumulated
				coverageOrder<-apply(colTab, 2, sum)
				
				# add up
				cumulative<-cumsum(coverageOrder)
				
				if(length(cumulative)>0){
					if(cumulative[length(cumulative)]>quoVar){
						bSelect<-cumulative<=quoVar
						# potential forking!!! 
						if(appr=="over"){
							if(sum(bSelect)!=length(bSelect)){
								bSelect[sum(bSelect)+1] <- TRUE
							}
						}
						
						# the collections to keep
						keepColls<-colnames(colTab)[bSelect]
						
						return(y[colls%in%keepColls])
						
						
					}else{
						return(NULL)
					}
				}else{
					return(NULL)
				}
			}
			
		}) # end tapply
		
	} # end bylist
	
	# what was not enough for the subsampling
	failed<-unlist(lapply(tap, is.null))
	failed<-as.numeric(names(failed[failed]))
	failed<-failed[!failed%in%intact]
	
	# the rows that should be passed
	trialRows <- rows%in%unlist(tap)
	
	# these rows should be present regardless of the subsampling
	intactRows<-binVar%in%intact
	
	# combine the two
	subsampleRows<-trialRows | intactRows

	
	# result
	res<-list(rows=subsampleRows, fail=failed)

	# return
	return(res)

}
adamkocsis/divDyn documentation built on Nov. 23, 2024, 4:41 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
adamkocsis/divDyn
Diversity Dynamics using Fossil Sampling Data

R/subsample.R
In adamkocsis/divDyn: Diversity Dynamics using Fossil Sampling Data

Defines functions subsampleSQSinexact subsampleSQSinexactOne subsampleOXW frequenciesOne frequencies checkSQSinexact checkFreq checkOXW checkCR subsample

Documented in subsample

R Package Documentation

Browse R Packages

We want your feedback!

adamkocsis/divDyn Diversity Dynamics using Fossil Sampling Data

R/subsample.R In adamkocsis/divDyn: Diversity Dynamics using Fossil Sampling Data

Defines functions subsampleSQSinexact subsampleSQSinexactOne subsampleOXW frequenciesOne frequencies checkSQSinexact checkFreq checkOXW checkCR subsample

Documented in subsample

R Package Documentation

Browse R Packages

We want your feedback!

adamkocsis/divDyn
Diversity Dynamics using Fossil Sampling Data

R/subsample.R
In adamkocsis/divDyn: Diversity Dynamics using Fossil Sampling Data