While the following vignette details all command line functions available in `santaR`

for reference and potential development work, these are not expected to be used on a day-to-day basis (more details are available in each functions help page).

To analyse time-series data, refered to the graphical user interface or the automated command line functions which implement these functions.

This vignette will detail the following underlying functions:

- Preparing input data
- DF search functions
- Fitting, Confidence bands and plotting
- P-values calculation

As `santaR`

is an univariate approach, this vignette will use one variable from the acute inflammation dataset detailed in How to prepare input data for santaR.

library(santaR) # data (keep the 3rd variable) var1_data <- acuteInflammation$data[,3] # metadata (common to all variables) var1_meta <- acuteInflammation$meta # 7 unique time-points unique(var1_meta$time) # 8 individuals unique(var1_meta$ind) # 2 groups unique(var1_meta$group) # 72 measurements for the given variable var1_data

The first step is to generate the input matrix by converting the vector of observation (*y* response for the variable at one time-point for one individual) into a matrix IND (*row*) x TIME (*column*) using `get_ind_time_matrix()`

:

var1_input <- get_ind_time_matrix( Yi=var1_data, ind=var1_meta$ind, time=var1_meta$time) var1_input

var1_input <- get_ind_time_matrix( Yi=var1_data, ind=var1_meta$ind, time=var1_meta$time) pander::pandoc.table(var1_input)

In order to compare 2 groups, it is necessary to create a grouping matrix that list group membership for all individuals using `get_grouping()`

:

var1_group <- get_grouping( ind=var1_meta$ind, group=var1_meta$group) var1_group

var1_group <- get_grouping( ind=var1_meta$ind, group=var1_meta$group) pander::pandoc.table(var1_group)

The degree of freedom (*df*) is the parameter that controls how closely each individual's time-trajectory fit eachs data point, balancing the fitting of the raw data and the smoothing of measurements errors. An optimal *df* value ensures that the spline is not overfitted or underfitted on the measurments. The degree of freedom should be established once for a dataset as it is a factor of 'complexity' of the time-trajectories under study, but does not change with different variables (same metadata, number of time-points,...)

Refer to santaR theoretical background and Selecting an optimal number of degrees of freedom for more details on *df* and an intuitive approach for its selection.

In order to assist in the selection of an optimal *df* and visualise its impact, the following functions:

- extract the eigen-splines across all time-trajectories of all individuals and all variables
- provide metrics to select the
*df*that gives the best fit on the eigen-splines (as a guide value for the whole dataset)

First we extract the eigen-splines across the whole dataset using `get_eigen_spline()`

:

var_eigen <- get_eigen_spline( inputData=acuteInflammation$data, ind=acuteInflammation$meta$ind, time=acuteInflammation$meta$time)

# The projection of each eigen-spline at each time-point: var_eigen$matrix

pander::pandoc.table(var_eigen$matrix)

# The variance explained by each eigen-spline var_eigen$variance # PCA summary summary(var_eigen$model)

It is then possible to estimate the *df* corresponding to the minimisation of a metric (penalised_residuals cross-validated, penalised_residuals general cross-validation, AIC, BIC or AICc) using `get_eigen_DF()`

. The best *df* can either be averaged over all eigen-splines `df`

or weighted by the variance explained by each eigen-spline `wdf`

:

# The projection of each eigen-spline at each time-point: get_eigen_DF(var_eigen) # $df

tmpDF <- get_eigen_DF(var_eigen) pander::pandoc.table(tmpDF$df)

```
# $wdf
```

pander::pandoc.table(tmpDF$wdf)

The evolution of these metrics (*y*) depending on *df* (*x*) can be plotted for each eigen-spline using `get_param_evolution()`

and `plot_param_evolution()`

:

library(gridExtra) # generate all the parameter values across df var_eigen_paramEvo <- get_param_evolution(var_eigen, step=0.1) # plot the metric evolution plot(arrangeGrob(grobs=plot_param_evolution(var_eigen_paramEvo, scaled=FALSE))) # Scale the metrics for each eigen-spline between 0 and 1 plot(arrangeGrob(grobs=plot_param_evolution(var_eigen_paramEvo, scaled=TRUE)))

As we can see, the recommended *df* can vary widely depending on the metric selected. `get_eigen_DFoverlay_list()`

will plot all eigen-projections (green points), a manually selected *df* (blue line) and automatically fitted *df* (red line), while grey lines represent splines at 0.2 *df* intervals (default value):

library(gridExtra) # plot all eigen-projections plot(arrangeGrob(grobs=get_eigen_DFoverlay_list(var_eigen, manualDf = 5)))

It should be noted that *df=2* corresponds to a linear model. *df=number(time-points)* corresponds to a curve that will go through all points (overfitted).

A final factor to take into account is the number of points needed for each individuals depending on the *df* selected:

- the minimum number of time-points needed is the
*df* - if for example
*df=10*, all individuals that have less than 10 time-points have to be rejected

Using `plot_nbTP_histogram()`

we can visualise how many samples would have to be rejected for a given *df*. Due to the lack of missing values in the `acuteInflammation`

dataset, the plots is not very informative.

# dfCutOff controls which cut-off is to be applied plot_nbTP_histogram(var_eigen, dfCutOff=5)

As it does not seem to be possible to automatically select the degree of freedom, a choice based on visualisation of the splines while being careful of overfitting, keeping in mind the 'expected' evolution of the underlying process is the most sensible approach.

Fitting of each individual and group mean curves are achieved with `santaR_fit()`

to generate a `SANTAObj`

that is then used for processing:

var1 <- santaR_fit(var1_input, df=5, groupin=var1_group) # it is possible to access the SANTAObj structure, which will be filled in the following steps var1$properties var1$general var1$groups$Group1

Confidence bands on the group mean curves can be calculated by bootstrapping using `santaR_CBand()`

:

var1 <- santaR_CBand(var1)

Plot is achieved using `santaR_plot()`

, for more details see Plotting options:

```
santaR_plot(var1)
```

The *p*-values are calculated by the comparison of distance between group mean curves by random sampling of individuals. Due to the stochastic nature of the test, the *p*-value obtained can slighlty vary depending on the random draw. This can be compounded by using the lower and upper confidence range on the *p*-value that is estimated at the same time.

`santaR_pvalue_dist()`

will calculate the significance of the difference between two groups:

var1 <- santaR_pvalue_dist(var1) # p-value var1$general$pval.dist # lower p-value confidence range var1$general$pval.dist.l # upper p-value confidence range var1$general$pval.dist.u # curve correlation coefficiant var1$general$pval.curveCorr

- Getting Started with santaR
- How to prepare input data for santaR
- santaR theoretical background
- Graphical user interface use
- Automated command line analysis
- Plotting options
- Selecting an optimal number of degrees of freedom

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.