prepareSequences: Prepare sequences for a comparison analysis.
In distantia: Assessing Dissimilarity Between Multivariate Time Series

Description Usage Arguments Value Author(s) Examples

This function prepares two or more multivariate time-series that are to be compared. It can work on two different scenarios:

Two dataframes: The user provides two separated dataframes, each containing a multivariate time series. These time-series can be regular or irregular, aligned or unaligned, but must have at least a few columns with the same names (pay attention to differences in case between column names representing the same entity) and units. This mode uses exclusively the following arguments: sequence.A, sequence.A.name (optional), sequence.B, sequence.B.name (optional), and merge.model.
One long dataframe: The user provides a single dataframe, through the sequences argument, with two or more multivariate time-series identified by a grouping.column.

prepareSequences(
  sequence.A = NULL,
  sequence.A.name = "A",
  sequence.B = NULL,
  sequence.B.name = "B",
  merge.mode = "complete",
  sequences = NULL,
  grouping.column = NULL,
  time.column = NULL,
  exclude.columns = NULL,
  if.empty.cases = "zero",
  transformation = "none",
  paired.samples = FALSE,
  same.time = FALSE
  )

`sequence.A`	dataframe containing a multivariate time-series.
`sequence.A.name`	character string with the name of `sequence.A`. Will be used as identificator in the `id` column of the output dataframe.
`sequence.B`	dataframe containing a multivariate time-series. Must have overlapping columns with `sequence.A` with same column names and units.
`sequence.B.name`	character string with the name of `sequence.B`. Will be used as identificator in the `id` column of the output dataframe.
`merge.mode`	character string, one of: "overlap", "complete" (default option). If "overlap", `sequence.A` and `sequence.B` are merged by their common columns, and non-common columns are dropped If "complete", columns absent in one dataset but present in the other are added, with values equal to 0. This argument is ignored if `sequences` is provided instead of `sequence.A` and `sequence.B`.
`sequences`	dataframe with multiple sequences identified by a grouping column.
`grouping.column`	character string, name of the column in `sequences` to be used to identify separates sequences within the file. If two sequences are provided through the arguments `sequence.A` and `sequence.B`, this argument defines the name of the grouping column in the output dataframe. If two or several sequences are provided as a single dataframe through the argument `sequences`, then `grouping.column` must be a column in this dataset.
`time.column`	character string, name of the column with time/depth/rank data. If `sequence.A` and `sequence.B` are provided, `time.column` must have the same name and units in both dataframes.
`exclude.columns`	character string or character vector with column names in `sequences`, or `squence.A` and `sequence.B`, to be excluded from the transformation.
`if.empty.cases`	character string with two possible values: "omit", or "zero". If "zero" (default), `NA` values are replaced by zeroes. If "omit", rows with `NA` data are removed.
`transformation`	character string. Defines what data transformation is to be applied to the sequences. One of: "none" (default), "percentage", "proportion", "hellinger", and "scale" (the latter centers and scales the data using the `scale` function).
`paired.samples`	boolean. If `TRUE`, the function will test if the datasets have paired samples. This means that each dataset must have the same number of rows/samples, and that, if available, the `time.column` must have the same values in every dataset. This is only mandatory when using the functions `distancePairedSamples` or `workflowPsi` with `paired.samples = TRUE` after preparing the sequences. The default setting is `FALSE`.
`same.time`	boolean. If `TRUE`, samples in the sequences to compare will be tested to check if they have the same time/age/depth according to `time.column`. This argument is only useful when the user needs to compare two sequences taken at different sites but same time frames.

A dataframe with the multivariate time series. If squence.A and sequence.B are provided, the column identifying the sequences is named "id". If sequences is provided, the time-series are identified by grouping.column.

Blas Benito <blasbenito@gmail.com>

#two sequences as inputs
data(sequenceA)
data(sequenceB)

AB.sequences <- prepareSequences(
 sequence.A = sequenceA,
 sequence.A.name = "A",
 sequence.B = sequenceB,
 sequence.B.name = "B",
 merge.mode = "complete",
 if.empty.cases = "zero",
 transformation = "hellinger"
 )


#several sequences in a single dataframe
data(sequencesMIS)
MIS.sequences <- prepareSequences(
 sequences = sequencesMIS,
 grouping.column = "MIS",
 if.empty.cases = "zero",
 transformation = "hellinger"
 )