choose_optimal_reference: Select the optimal reference for full alignments of peak...

View source: R/choose_optimal_reference.R

choose_optimal_referenceR Documentation

Select the optimal reference for full alignments of peak lists


Full alignments of peak lists require the specification of a fixed reference to which all other samples are aligned to. This function provides an simple algorithm to find the most suitable sample among a dataset. The so defined reference can be used for full alignments using linear_transformation. The functions is evoked internally by align_chromatograms if no reference was specified by the user.


choose_optimal_reference(data = NULL, rt_col_name = NULL, sep = "\t")



Dataset containing peaks that need to be aligned and matched. For every peak a arbitrary number of numerical variables can be included (e.g. peak height, peak area) in addition to the mandatory retention time. The standard format is a tab-delimited text file according to the following layout: (1) The first row contains sample names, the (2) second row column names of the corresponding peak lists. Starting with the third row, peak lists are included for every sample that needs to be incorporated in the dataset. Here, a peak list contains data for individual peaks in rows, whereas columns specify variables in the order given in the second row of the text file. Peak lists of individual samples are concatenated horizontally and need to be of the same width (i.e. the same number of columns in consistent order). Alternatively, the input may be a list of data frames. Each data frame contains the peak data for a single individual. Variables (i.e.columns) are named consistently across data frames. The names of elements in the list are used as sample identifiers. Cells may be filled with numeric or integer values but no factors or characters are allowed. NA and 0 may be used to indicate empty rows.


A character giving the name of the column containing the retention times. The decimal separator needs to be a point.


The field separator character. The default is tab separated (sep = '\t'). See the "sep" argument in read.table for details.


Every sample is considered in determining the optimal reference in comparison to all other samples by estimating the similarity to all other samples. For a reference-sample pair, the deviation in retention times between all reference peaks and the always nearest peak in the sample is summed up and divided by the number of reference peaks. The calculated value is a similarity score that converges to zero the more similar reference and sample are. For every potential reference, the median score of all pair-wise comparisons is used as a similarity proxy. The optimal sample is then defined by the minimum value among these scores. This functions is used internally in align_chromatograms to select a reference if non was specified by the user.


A list with following elements


Name of the sample with the highest average similarity to all other samples


Median number of shared peaks with other samples


Martin Stoffel ( & Meinolf Ottensmann (


## 1.) input is a list
## using a list of samples
## subset for faster processing
peak_data <- peak_data[1:3]
choose_optimal_reference(peak_data, rt_col_name = "time")

GCalignR documentation built on Feb. 16, 2023, 5:23 p.m.