select_rate: Select rate results based on a range of criteria

View source: R/select_rate.R

select_rateR Documentation

Select rate results based on a range of criteria

Description

The functions in respR are powerful, but outputs can be large and difficult to explore, especially when there are hundreds to thousands of results, for example the output of auto_rate on large datasets, or the outputs of calc_rate.int from long intermittent-flow experiments.

The select_rate and select_rate.ft functions help explore, reorder, and filter convert_rate and convert_rate.ft results according to various criteria. For example, extracting only positive or negative rates, only the highest or lowest rates, only those from certain data regions, and numerous other methods that allow advanced filtering of results so the final selection of rates is well-defined towards the research question of interest. This also allows for highly consistent reporting of results and rate selection criteria.

Multiple selection criteria can be applied by saving the output and processing it through the function multiple times using different methods, or alternatively via piping (⁠%>%⁠ or ⁠%>%⁠). See Examples.

Note: when choosing a method, keep in mind that to remain mathematically consistent, respR outputs oxygen consumption (i.e. respiration) rates as negative values. This is particularly important in the difference between highest/lowest and minimum/maximum methods. See Details.

When a rate result is omitted by the selection criteria, it is removed from the ⁠$rate.output⁠ element of the convert_rate object, and the associated data in ⁠$summary⁠ (i.e. that row) is removed. Some methods can also be used with an n = NULL input to reorder the ⁠$rate⁠ and ⁠$summary⁠ elements in various ways.

Replicate and Rank columns

The summary table ⁠$rank⁠ column is context-specific, and what it represents depends on the type of experiment analysed or the function used to determine the rates. If numeric values were converted, it is the order in which they were entered. Similarly, if calc_rate was used, it is the order of rates as entered using from and to (if multiple rates were determined). For auto_rate it relates to the method input. For example it indicates the kernel density ranking if the linear method was used, the ascending or descending ordering by absolute rate value if lowest or highest were used, or the numerical order if minimum or maximum were used. For intermittent-flow experiments analysed via calc_rate.int and auto_rate.int these will be ranked within each replicate as indicated in the ⁠$rep⁠ column. The ⁠$rep⁠ and ⁠$rank⁠ columns can be used to keep track of selection or reordering because the original values will be retained unchanged through selection or reordering operations. The original order can always be restored by using method = "rep" or method = "rank" with n = NULL. In both these cases the ⁠$summary⁠ table and ⁠$rate.output⁠ will be reordered by ⁠$rep⁠ (if used) then ⁠$rank⁠ to restore the original ordering.

Note that if you are analysing intermittent-flow data and used auto_rate.int but changed the n input to output more than one rate result per replicate, the selection or reordering operations will not take any account of this. You should carefully consider if or why you need to output multiple rates per replicate in the first place. If you have, you can perform selection on individual replicates by using method = "rep" to select individual replicates then apply additional selection criteria.

Usage

select_rate(x, method = NULL, n = NULL)

select_rate.ft(x, method = NULL, n = NULL)

Arguments

x

list. An object of class convert_rate or convert_rate_select.

method

string. Method by which to select or reorder rate results. For most methods matching results are retained in the output. See Details.

n

numeric. Number, percentile, or range of results to retain or omit depending on method. Default is NULL, in which case some methods will instead reorder the results. See Details.

Details

These are the current methods by which rates in convert_rate objects can be selected. Matching results are retained in the output. Some methods can also be used to reorder the results. Note that the methods selecting by rate value operate on the ⁠$rate.output⁠ element, that is the final converted rate value.

positive, negative

Selects all positive (>0) or negative (<0) rates. n is ignored. Useful, for example, in respirometry on algae where both oxygen consumption and production rates are recorded. Note, respR outputs oxygen consumption (i.e. respiration) rates as negative values, production rates as positive.

nonzero, zero

Retains all nonzero rates (i.e. removes any zero rates), or retains only zero rates (i.e. removes all rates with any value). n is ignored.

lowest, highest

These methods can only be used when rates all have the same sign, that is are all negative or all positive. These select the lowest and highest absolute rate values. For example, if rates are all negative, method = 'highest' will retain the highest magnitude rates regardless of the sign. n should be an integer indicating the number of lowest/highest rates to retain. If n = NULL the results will instead be reordered by lowest or highest rate without any removed. See minimum and maximum options for extracting numerically lowest and highest rates.

lowest_percentile, highest_percentile

These methods can also only be used when rates all have the same sign. These retain the n'th lowest or highest percentile of absolute rate values. For example, if rates are all negative method = 'highest_percentile' will retain the highest magnitude n'th percentile regardless of the sign. n should be a percentile value between 0 and 1. For example, to extract the lowest 10th percentile of absolute rate values, you would enter ⁠method = 'lowest_percentile', n = 0.1⁠.

minimum, maximum

In contrast to lowest and highest, these are strictly numerical options which take full account of the sign of the rate, and can be used where rates are a mix of positive and negative. For example, method = 'minimum' will retain the minimum numerical value rates, which would actually be the highest oxygen uptake rates. n is an integer indicating how many of the min/max rates to retain. If n = NULL the results will instead be reordered by minimum or maximum rate without any removed.

minimum_percentile, maximum_percentile

Like min and max these are strictly numerical inputs which retain the n'th minimum or maximum percentile of the rates and take full account of the sign. Here n should be a percentile value between 0 and 1. For example, if rates are all negative (i.e. typical uptake rates), to extract the lowest 10th percentile of rates, you would enter ⁠method = 'maximum_percentile', n = 0.1⁠. This is because the lowest negative rates are numerically the maximum rates (highest/lowest percentile methods would be a better option in this case however).

rate

Allows you to enter a value range of output rates to be retained. Matching regressions in which the rate value falls within the n range (inclusive) are retained. n should be a vector of two values. For example, to retain only rates where the rate value is between 0.05 and 0.08: ⁠method = 'rate', n = c(0.05, 0.08)⁠. Note this operates on the ⁠$rate.output⁠ element, that is converted rate values.

rep, rank

These refer to the respective columns of the ⁠$summary⁠ table. For these, n should be a numeric vector of integers of rep or rank values to retain. To retain a range use regular R syntax, e.g. n = 1:10. If n = NULL no results will be removed, instead the results will be reordered ascending by rep (if it contains values) then rank. Essentially this restores the original ordering if other reordering operations have been performed.

The values in these columns depend on the functions used to calculate rates. If calc_rate was used, rep is NA and rank is the order of rates as entered using from and to (if multiple rates were determined). For auto_rate, rep is NA and rank relates to the method input. For example it indicates the kernel density ranking if the linear method was used, the ascending or descending ordering by absolute rate value if lowest or highest were used, or by numerical order if minimum or maximum were used. If calc_rate.int or auto_rate.int were used, rep indicates the replicate number and the rank column represents rank within the relevant replicate, and will generally be filled with the value 1. Therefore you need to adapt your selection criteria appropriately towards which of these columns is relevant.

rep_omit, rank_omit

These refer to the rep and rank columns of the ⁠$summary⁠ table and allow you to exclude rates from particular replicate or rank values. For these, n should be a numeric vector of integers of rep or rank values to OMIT. To omit a range use regular R syntax, e.g. n = 1:10.

rsq, row, time, density

These methods refer to the respective columns of the ⁠$summary⁠ data frame. For these, n should be a vector of two values. Matching regressions in which the respective parameter falls within the n range (inclusive) are retained. To retain all rates with a R-Squared 0.90 or above: ⁠method = 'rsq', n = c(0.9, 1)⁠. The row and time ranges refer to the ⁠$row⁠-⁠$endrow⁠ or ⁠$time⁠-⁠$endtime⁠ columns and the original raw data (⁠$dataframe⁠ element of the convert_rate object), and can be used to constrain results to rates from particular regions of the data (although usually a better option is to subset_data() prior to analysis). Note time is not the same as duration - see later section - and row refers to rows of the raw data, not rows of the summary table - see manual method for this. For all of these methods, if n = NULL no results will be removed, instead the results will be reordered by that respective column (descending for rsq and density, ascending for row, and time).

intercept, slope

These methods are similar to the above and refer to the intercept_b0 and slope_b1 summary table columns. Note these linear model coefficients represent different things in flowthrough vs. other analyses. In non-flowthrough analyses slopes represent rates and coefficients such as a high r-squared are important. In flowthrough, slopes represent the stability of the data region, in that the closer the slope is to zero, the less the delta oxygen values in that region vary, which is an indication of a region of stable rates. In addition, intercept values close to the calculated mean delta of the region also indicate a region of stable rates. Therefore these methods are chiefly useful in selection of flowthrough results, for example slopes close to zero. If n = NULL no results will be removed, instead the results will be reordered by ascending value by that column.

time_omit, row_omit

These methods refer to the original data, and are intended to exclude rates determined over particular data regions. This is useful in the case of, for example, a data anomaly such as a spike or sensor dropout. For these inputs, n are values (a single value, multiple values, or a range) indicating data timepoints or rows of the original data to exclude. Only rates (i.e. regressions) which do not utilise those particular values are retained in the output. For example, if an anomaly occurs precisely at timepoint 3000, time_omit = 3000 means only rates determined solely over regions before or after this will be retained. If it occurs over a range this can be entered as, time_omit = c(3000,3200). If you want to exclude a regular occurrence, for example the flushes in intermittent-flow respirometry, or any other non-continuous values they can be entered as a vector, e.g. row_omit = c(1000, 2000, 3000). Note this last option can be extremely computationally intensive when the vector or dataset is large, so should only be used when a range cannot be entered as two values, which is much faster. For both methods, input values must match exactly to values present in the dataset.

oxygen

This can be used to constrain rate results to regions of the data based on oxygen values. n should be a vector of two values in the units of oxygen in the raw data. Only rate regressions in which all datapoints occur within this range (inclusive) are retained. Any which use even a single value outside of this range are excluded. Note the summary table columns oxy and endoxy refer to the first and last oxygen values in the rate regression, which should broadly indicate which results will be removed or retained, but this method examines every oxygen value in the regression, not just first and last.

oxygen_omit

Similar to time_omit and row_omit above, this can be used to omit rate regressions which use particular oxygen values. For this n are values (single or multiple) indicating oxygen values in the original raw data to exclude. Every oxygen value used by each regression is checked, and to be excluded an n value must match exactly to one in the data. Therefore, note that if a regression is fit across the data region where that value would occur, it is not necessarily excluded unless that exact value occurs. You need to consider the precision of the data values recorded. For example, if you wanted to exclude any rate using an oxygen value of 7, but your data are recorded to two decimals, a rate fit across these data would not be excluded: c(7.03, 7.02, 7.01, 6.99, 6.98, ...). To get around this you can use regular R syntax to input vectors at the correct precision, such as seq, e.g. seq(from = 7.05, to = 6.96, by = -0.01). This can be used to input ranges of oxygen values to exclude.

duration

This method allows selection of rates which have a specific duration range. Here, n should be a numeric vector of two values. Use this to set minimum and maximum durations in the time units of the original data. For example, n = c(0,500) will retain only rates determined over a maximum of 500 time units. To retain rates over a minimum duration, set this using the minimum value plus the maximum duration or simply infinity. For example, for rates determined over a minimum of 500 time units n = c(500,Inf))

manual

This method simply allows particular rows of the ⁠$summary⁠ data frame to be manually selected to be retained. For example, to keep only the top row ⁠method = 'manual', n = 1⁠. To keep multiple rows use regular R selection syntax: n = 1:3, n = c(1,2,3), n = c(5,8,10), etc. No value of n should exceed the number of rows in the ⁠$summary⁠ data frame. Note this is not necessarily the same as selecting by the rep or rank methods, as the table could already have undergone selection or reordering.

manual_omit

As above, but this allows particular rows of the ⁠$summary⁠ data frame to be manually selected to be omitted.

overlap

This method removes rates which overlap, that is regressions which are partly or completely fit over the same rows of the original data. This is useful in particular with auto_rate results. The auto_rate linear method may identify multiple linear regions, some of which may substantially overlap, or even be completely contained within others. In such cases summary operations such as taking an average of the rate values may be questionable, as certain values will be weighted higher due to these multiple, overlapping results. This method removes overlapping rates, using n as a threshold to determine degree of permitted overlap. It is recommended this method be used after all other selection criteria have been applied, as it is quite aggressive about removing rates, and can be very computationally intensive when there are many results.

While it can be used with auto_rate results determined via the rolling, lowest, or highest methods, by their nature these methods produce all possible overlapping regressions, ordered in various ways, so other selection methods are more appropriate. The overlap method is generally intended to be used in combination with the auto_rate linear results, but may prove useful in other analyses.

Permitted overlap is determined by n, which indicates the proportion of each particular regression which must overlap with another for it to be regarded as overlapping. For example, n = 0.2 means a regression would have to overlap with at least one other by at least 20% of its total length to be regarded as overlapping.

The "overlap" method performs two operations:

First, regardless of the n value, any rate regressions which are completely contained within another are removed. This is also the only operation if n = 1.

Secondly, for each regression in ⁠$summary⁠ starting from the bottom of the summary table (usually the lowest ranked result, but this depends on the analysis used and if any reordering has been already occurred), the function checks if it overlaps with any others (accounting for n). If not, the next lowest is checked, and the function progresses up the summary table until it finds one that does. The first to be found overlapping is then removed, and the process repeats starting again from the bottom of the summary table. If no reordering to the results has occurred, this means lower ranked results are removed first. This is repeated iteratively until only non-overlapping rates (accounting for n) remain.

If n = 0, only rates which do not overlap at all, that is share no data, are retained. If n = 1, only rates which are 100% contained within at least one other are removed.

Reordering results

Several methods can be used to reorder results rather than select them, by not entering an n input (that is, letting the n = NULL default be applied). Several of these methods are named the same as those in auto_rate for consistency and have equivalent outcomes, so this allows results to be reordered to the equivalent of that method's results without re-running the auto_rate analysis.

The "row" and "rolling" methods reorder sequentially by the starting row of each regression (⁠$row⁠ column).

The "time" method reorders sequentially by the starting time of each regression (⁠$time⁠ column).

"linear" and "density" are essentially identical, reordering by the ⁠$density⁠ column. This metric is only produced by the auto_rate linear method, so will not work with any other results.

"rep" or "rank" both reorder by the ⁠$rep⁠ then ⁠$rank⁠ columns. What these represents is context dependent - see Replicate and Rank columns section above. Each summary row rep and rank value is retained unchanged regardless of how the results are subsequently selected or reordered, so this will restore the original ordering after other methods have been applied.

"rsq" reorders by ⁠$rsq⁠ from highest value to lowest.

"intercept" and "slope" reorder by the ⁠$intercept_b0⁠ and ⁠$slope_b1⁠ columns from lowest value to highest.

"highest" and "lowest" reorder by absolute values of the ⁠$rate.output⁠ column, that is highest or lowest in magnitude regardless of the sign. They can only be used when rates all have the same sign.

"maximum" and "minimum" reorder by numerical values of the ⁠$rate.output⁠ column, that is maximum or minimum in numerical value taking account of the sign, and can be used when rates are a mix of negative and positive.

Numeric input conversions

For convert_rate objects which contain rates which have been converted from numeric values, the summary table will contain a limited amount of information, so many of the selection or reordering methods will not work. In this case a warning is given and the original input is returned.

Plot

There is no plotting functionality in select_rate. However since the output is a convert_rate object it can be plotted. See the Plot section in help("convert_rate"). To plot straight after a selection operation, pipe or enter the output in plot(). See Examples.

More

This help file can be found online here, where it is much easier to read.

For additional help, documentation, vignettes, and more visit the respR website at https://januarharianto.github.io/respR/

Value

The output of select_rate is a list object which retains the convert_rate class, with an additional convert_rate_select class applied.

It contains two additional elements: ⁠$original⁠ contains the original, unaltered convert_rate object, which will be retained unaltered through multiple selection operations, that is even after processing through the function multiple times. ⁠$select_calls⁠ contains the calls for every selection operation that has been applied to the ⁠$original⁠ object, from the first to the most recent. These additional elements ensure the output contains the complete, reproducible history of the convert_rate object having been processed.

Examples


## Object to filter
 ar_obj <- inspect(intermittent.rd, plot = FALSE) %>%
   auto_rate(plot = FALSE) %>%
   convert_rate(oxy.unit = "mg/L",
                time.unit = "s",
                output.unit = "mg/h",
                volume = 2.379) %>%
   summary()

 ## Select only negative rates
 ar_subs_neg <- select_rate(ar_obj, method = "negative") %>%
   summary()

 ## Select only rates over 1000 seconds duration
 ar_subs_dur <- select_rate(ar_obj, method = "duration", n = c(1000, Inf)) %>%
   summary()

 ## Reorder rates sequentially (i.e. by starting row)
 ar_subs_dur <- select_rate(ar_obj, method = "row") %>%
   summary()

 ## Select rates with r-squared higher than 0.99,
 ## then select the lowest 10th percentile of the remaining rates,
 ## then take the mean of those
 inspect(squid.rd, plot = FALSE) %>%
   auto_rate(method = "linear",
             plot = FALSE) %>%
   convert_rate(oxy.unit = "mg/L",
                time.unit = "s",
                output.unit = "mg/h",
                volume = 2.379) %>%
   summary() %>%
   select_rate(method = "rsq", n = c(0.99, 1)) %>%
   select_rate(method = "lowest_percentile", n = 0.1) %>%
   mean()
   

januarharianto/respR documentation built on April 20, 2024, 4:34 p.m.