Automated command line analysis"

The santaR package is designed for the detection of significantly altered time trajectories between study groups, in short time-series. Command line parallelisation and reporting functions allow the automated analysis of multiple variables.

The automated command line functions are to be prefered to the GUI for the processing of very high number of variables, as they are more efficient and can be integrated in scripts.

Using an example dataset, this vignette will:

Parallel processing

In a same experiment, multiple variables can be measured and explored dynamically (e.g. NMR or MS features, genes). As santaR's analysis is an univariate approach, each variable can be fitted independently. This lack of dependency renders santaR's analysis an embarrassingly parallel workload.

The santaR_auto_fit() function is a wrapper for each of the analytical functions (i.e. get_ind_time_matrix(), santaR_fit(), santaR_CBand(), santaR_pvalue_dist() and santaR_pvalue_fit()), executing them in a parallel fashion (for each individual function see the help and advanced command line options vignette). The parallelisation relies on the doParallel package for the instantiation of worker nodes and foreach for the distribution of tasks. This set of packages enable the parallelisation on all operating systems (Windows, Mac OS and most Linux distributions).

Observation values are expected as a data-frame of samples as rows and variables as columns, the parallelisation taking place over the columns. For a selected number of CPU cores (ncores parameter), santaR_auto_fit() first instantiate worker nodes (if ncores=0, the procedure is applied sequentially (no parallelisation)). The conversion of inputs by get_ind_time_matrix() is however not parallelised by default as the parallelisation overhead time cost is superior to the time gain for all but the most complex datasets. When the number of individuals, unique time points, or variables is elevated, the forceParIndTimeMat parameter enables the parallelisation of this step. All subsequent analytical steps are automatically parallelised, with the calculation of confidence bands on the group mean curves and the identification of altered trajectory activated by default.

santaR_auto_fit() returns a list of SANTAObj containing each variable's analysis results. In practice, santaR_auto_fit() is the function employed for command line analysis as it caters for all possible use cases.


# Load example data
tmp_data  <- acuteInflammation$data
tmp_meta  <- acuteInflammation$meta

# Analyse data, with confidence bands and p-value
res_acuteInf_df5 <- santaR_auto_fit(inputData=tmp_data, ind=tmp_meta$ind, time=tmp_meta$time, group=tmp_meta$group, df=5, ncores=4, CBand=TRUE, pval.dist=TRUE)
# Input data generated: 0.13 secs
# Spline fitted:        1.05 secs
# ConfBands done:      18.98 secs
# p-val dist done:     35.43 secs
# total time:          55.59 secs

# [1] 22
#  [1] "var_1"  "var_2"  "var_3"  "var_4"  "var_5"  "var_6"  "var_7"  "var_8"  "var_9"  "var_10" "var_11" "var_12" "var_13" "var_14" "var_15" "var_16" "var_17" "var_18"
# [19] "var_19" "var_20" "var_21" "var_22"

Automated Reporting

After multiple variables have been analysed using santaR_auto_fit(), a reporting function helps assess significant results and summarise them in an easily interpretable fashion. santaR_auto_summary() takes a list of SANTAObj as generated by santaR_auto_fit() as input.

First, correction for multiple testing can be applied to generate Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli corrected p-values. P-values can be returned by the function, but also automatically saved to disk as .csv. For a given significance cut-off (plotCutOff parameter), the number of variables significantly altered is reported and plots are automatically saved to disk by increasing p-value. The aspect of the plots can be altered using multiple options such as the representation of confidence bands (showConfBand parameter) or the generation of a mean curve across all samples (showTotalMeanCurve parameter) which can help assess difference between groups when group sizes are unbalanced.

# Generate a summary
#   without a defined 'targetFolder', no csv or plots can be saved
pval_acuteInf_df5 <- santaR_auto_summary(SANTAObjList=res_acuteInf_df5, targetFolder=NA)
# p-value dist found
# Benjamini-Hochberg corrected p-value

# [1] "pval.all"     "pval.summary"

pval.summary            <- data.frame(matrix(c('dist', 'dist_BH', 17, 16, 8, 0, 0, 0), ncol=4))
colnames(pval.summary)  <- c('Test', 'Inf 0.05', 'Inf 0.01', 'Inf 0.001')
pval.all <- data.frame(matrix(c(0.009990010, 0.007992008, 0.006993007, 0.209790210, 0.005994006, 0.008991009, 0.013986014, 0.009990010, 0.038961039, 0.034965035, 0.013986014, 0.214785215, 0.066933067, 0.154845155, 0.008991009, 0.015984016, 0.019980020, 0.029970030, 0.053946054, 0.023976024, 0.022977023, 0.007992008, 0.01829662, 0.01569580, 0.01436896, 0.23611241, 0.01302000, 0.01700412, 0.02334465, 0.01829662, 0.05282484, 0.04824640, 0.02334465, 0.24130467, 0.08413827, 0.17858350, 0.01700412, 0.02581244, 0.03066597, 0.04246854, 0.06973190, 0.03543451, 0.03424914, 0.01569580, 0.005433704, 0.004053809, 0.003390296, 0.185689133, 0.002748896, 0.004735847, 0.008347097, 0.005433704, 0.028625807, 0.025242819, 0.008347097, 0.190448652, 0.053042348, 0.133748457, 0.004735847, 0.009860016, 0.012967910, 0.021068901, 0.041574088, 0.016160798, 0.015355810, 0.004053809, -0.2429725352, 0.0006572238, -0.1309866546, -0.3878298395, -0.5634863016, -0.4766589789, -0.5628753031, -0.4678733066, -0.3890447845, -0.0501685235, 0.0568042664, 0.1530029385, -0.4077714803, -0.0650366487, 0.1268468873,  0.5054671665, 0.2797620452,  0.4027539783, 0.5014823976, 0.3899306066, 0.1458163093, -0.2074773622, 0.02747253, 0.02747253, 0.02747253, 0.21478521, 0.02747253, 0.02747253, 0.03076923, 0.02747253, 0.05042017, 0.04807692, 0.03076923, 0.21478521, 0.07750145, 0.17032967, 0.02747253, 0.03196803, 0.03663004, 0.04395604, 0.06593407, 0.03767661, 0.03767661, 0.02747253), ncol=5))
colnames(pval.all) <- c("dist", "dist_upper", "dist_lower", "curveCorr", "dist_BH")
rownames(pval.all) <- c("var_1", "var_2", "var_3", "var_4", "var_5", "var_6", "var_7", "var_8", "var_9", "var_10", "var_11", "var_12", "var_13", "var_14", "var_15", "var_16", "var_17", "var_18", "var_19", "var_20", "var_21", "var_22")

Save results for GUI

In practice, time-dependent patterns for a given biological question (e.g. a grouping of individuals) are assessed by parallelised fitting and analysis using santaR_auto_fit() and reporting using santaR_auto_summary(). When results are available, the most significantly altered variables can be identified using the reports and visually inspected for confirmation using the plots already saved to disk.

Additionally analysis results can be loaded into the GUI for interactive visualisation or generation of plots. For that, the list of SANTAObj generated by santaR_auto_fit() must be saved under the variable name inSp in a .RData file:

# Rename the results
inSp        <- res_acuteInf_df5
# Save to disk
outputPath  <- file.path('path_to_my_output_folder', 'acuteInf_results.rdata') 
save(inSp, file=outputPath, compress=TRUE)

See Also

Try the santaR package in your browser

Any scripts or data that you put into this service are public.

santaR documentation built on May 24, 2022, 1:06 a.m.