# Introduction to boundary line analysis" In BLA: Boundary Line Analysis

y<-SoilP$yield vals <- data.frame(x,y) The variables x and y and now be extracted from the results of the bag plot with outliers removed x<-vals[,1] y<-vals[,2] ### Test for presence of boundary in dataset The function exp_boundary() can be used to evaluate evidence that observations are clustered near an upper boundary in a data set, testing this against an unbounded bivariate normal distribution as a null hypothesis. The standard deviation (sd) of the the Euclidean distance of the boundary points to the centre of the data set is used to measure the density of the points at the upper edges of the data. The smaller the sd value, the denser the distribution. This function uses the convex hull to select the points at the upper boundary. The default is selecting the first 10 consecutive convex hulls (shells=10). The convex hulls are then splits into two sections, the right and left sections, and evidence of boundary existence in both sections is checked by determining the probability of having the observed density of points at the upper edges of the data under the bivariate normal null hypothesis. More detail is provided by Miti et al. 2024b. We reject the null hypothesis if p\< 0.05. bound_test<-expl_boundary(x,y,shells = 10, simulations = 100, pch=16, col="grey") # bound_test The p-values in the left and right sections are both less than 0.05. These results indicate evidence of an upper boundary in both the left and right sections of the scatter. This suggests that there is a justification to fitting the boundary to the data. A graphical representation of the scatter plot with the boundary points is also given as well as the density histograms showing the observed standard deviation given 10000 simulated standard deviations from normal unbounded data . ## Fitting boundary line using different methods The exploratory tests indicated that the data provides evidence of an upper boundary, there are no outliers and the variables, x and y, appear normally distributed. We therefore, proceed to fit a boundary line model to the data set. There are several methods that can be used to fit a boundary line to the data set which can be classified as heuristic (Bolides, Binning & quantile regression) and statistical methods (censored bivariate normal). Miti et al. (2024a) give more detail on each of these methods. ### Bolides algorithm This method fit the boundary line following the boundary line determination technique proposed by Schnug et al. (1995). To fit the boundary line using the BOLIDES algorithm , the bolides() function can be used. To check the required arguments for the function, the help page can be launched. ?bolides The arguments x and y are the independent and dependent variable respectively and start is a vector of starting values . The model argument is used to specify the model of the boundary line e.g. "blm" for the linear model. The xmax is an argument that describes the maximum value of the independent variable beyond which the relation of x and y is no longer theoretically feasible. Other arguments relate to the plot parameters as in the plot() function. All boundary fitting methods require initial starting values for the parameters of a proposed model. The initial starting values are optimized to find the parameters of the proposed model as in the optim() function in base R. To get the start starting values, the bolides() function is run with the argument model="explore". This allows us to view the selected boundary points using the boundary line determination technique. bolides(x,y,model = "explore", pch=16, col="grey") From the plot, it can be seen that a "trapezium" model will be more appropriate for this data set. The function startValues() can be used to determine initial start values. For more information on startValues() function see ?startValues(). ?startValues() With a scatter plot of y against x active in the plot window in R, run the function startValues("trapezium"), then use the mouse to click on the plot at five boundary points that follow the trapezium model in order of increasing x values. startValues("trapezium") # then select the five points at the edge of the dataset that make up the trapezium model in order of increasing x values. The proposed start values will be produced. Note that this can be done for other models as well. Once all the arguments are set, the function can be run start<-c(4,3,14,104,-22) # start values is a vector of five consists of intercept, slope, plateau yield, intercept2 and slope2. model1<-bolides(x,y, start = start,model = "trapezium", xlab=expression("Phosphorus/ln(mg L"^-1*")"), ylab=expression("Yield/ t ha"^-1), pch=16, col="grey", bp_col="grey") model1 The results show that the optimized parameters and plot of the fitted model. There is no uncertainty in the parameters because this is a heuristic method. These parameters can then be used to determine boundary line response for any given value of x. Say you want to predict the maximum possible yield response at soil P values of 4.5, 7.4, 12.2, 20.1 and 54.5 mg/kg. Remember that our model was fitted on values of log soil P and therefore, these values must first be log transformed before the prediction is made. We can use the function predictBL() for this purpose. For more information on this function, see ?predictBL() . P<-c(4.5, 7.4, 12.2, 20.1, 54.5) P_log<-log(P) Max_Response<-predictBL(model1, P_log) # the argument inputs are the boundary line model and the independent values (in this case P_log) Max_Response ### Binning method The binning methodology involves splitting the data into several sections in the x-axis and selecting a boundary point in each section based on a set criteria (mostly the 95$^{\rm th}$and 99$^{\rm th}$percentile) (Shatar and McBratney, 2004). To fit the boundary line using the binning method, the blbin() function can be used. To check the required arguments for the function, the help page can be launched. ?blbin The arguments x and y are the independent and dependent variable respectively and start is a vector of starting values . The model argument is used to specify the model of the boundary line e.g. model="blm" for the linear model. The bins argument describes the size of the bins with a vector of length 3 containing the minimum and maximum independent variable values, and the size of bins to be used for the data respectively. We assume that the 99$^{\rm th}$percentile (tau=0.99) is the boundary. The initial start start values can be determined as previously shown in the previous section bins<-c(1.61,4.74,0.313) blbin(x,y, bins,model = "explore", tau=0.99, pch=16, col="grey") From the plot, it can be seen that a "trapezium" model will be more appropriate for this data set. startValues("trapezium") The values for start can now be obtained and the function can now be run. start<-c(4.75, 3.23, 13.3, 24.87,-2.95 ) model2<-blbin(x,y, bins,start = start,model = "trapezium", tau=0.99, ylab=expression("t ha"^-1), xlab=expression("Phosphorus/ln(mg L"^-1*")"), pch=16, col="grey", bp_col="grey") model2 The results show that the optimized parameters and plot. There is no uncertainty in the parameters because this is a heuristic method. These parameters can then be used to determine boundary line response for any given value of x using the predictBL() function. ### Quantile regression method This method fits the boundary line using the principle of quantile regression (Cade and Noon, 2003). To fit the boundary line using the quantile regression method, the blqr() function can be used. To check the required arguments for the function, the help page can be launched. ?blqr The arguments x and y are the independent and dependent variable respectively and start is a vector of starting values . The model argument is used to specify the model of the boundary line e.g. "blm" for the linear model. The argument tau describes the quantile value described as boundary. We assume that the 99$^{\rm th}\$ quantile (tau=0.99) value is the boundary. This is an arbitrary assumption, and for this reason we treat the method as heuristic.

The initial start start values can be determined as previously shown in the previous section. however, the blqr() function does not have the explore option and hence the startValues() function is used just on the plot of x and y directly according to the suggested model from the structure at the upper edge of the data. The trapezium model will be used for this data.

plot(x,y)

startValues("trapezium")

The startvalues can now be used in the blqr() function.

start<-c(4,3,13.5,31,-4.5)

model3<-blqr(x,y, start = start, model = "trapezium",
xlab=expression("Phosphorus/ mg L"^-1),
ylab=expression("Phosphorus/ln(mg L"^-1*")"),
pch=16,tau=0.99, col="grey") # may take a few seconds to ran

model3

The results show that the optimized parameters and plot. The quantile regression method will produce measures of uncertainty for parameters, but BLA does not report these because they are conditional on the arbitrary choice of tau. These parameters can then be used to determine boundary line response for any given value of x using the predictBL() function.

### Censored bivariate normal model

To fit the boundary line using the Censored bivariate normal model method, see the vignette("Censored_bivariate_normal_model").

### Using your own defined model

The illustrated methods for fitting the boundary line have some in-built models. These include the linear, linear plateau, mitscherlich, schmidt, logistic, logistic model proposed by Nelder (1961), the inverse logistic, double logistic, quadratic and the trapezium models. However, there are some cases where one wants to fit another model which is not part of the built in models. The following steps will illustrate how to fit a custom model. Though this will be illustrated using the bolides() function, it also applies for the blbin(), blqr() and cbvn() functions.

Assuming that the initial data exploratory step have been done, the first step is to check the structure of the boundary points using the argument model="explore" in the bolides() function.

bolides(x,y,model="explore", pch=16, col="grey")

Lets say you want to fit a model

$$y=\beta_0 - \beta_1(x-\beta_2)^2$$

The model is written in form of an R function and the parameters should always be written in alphabetical order as a*, b and c for a three parameter function, a, b,c and d*** for four parameter function and so on.

custom_function<-function(x,a,b,c){
y<- a - b*(x-c)^2
}

The next step is to suggest the initial start start values. These should be sensible values else the function will not converge. These should be arranges in alphabetical order as start=c(a,b,c). Replace a, b and c with numeric values of your choice.

The arguments of bolides() function can now be added. In this case, the argument model while be set to "other". The arguments equation is now set to your custom function (equation=custom_function)

start<-c(13.5,3,3.3)

model4<-bolides(x,y, start = start,model = "other",
equation=custom_function,
xlab=expression("Phosphorus/mg L"^-1),
ylab=expression("Phosphorus/ln(mg L"^-1*")"),
pch=16, ylim=c(3.8,14.5), col="grey",bp_col="grey")

model4

The parameters of the models are shown in the results. A prediction of the boundary response values for each value of x can the be done as previously shown using the predictBL() function.

## Closing remarks

The boundary line fitting methods illustrated here all use the optimization function to determine the parameters of the proposed models. To remove the risk of local optimum parameters, it is advised that the models are ran on several starting values and the results with the smallest error can be selected. Each method produces the error value in the output. It is residue mean square (RMS) for blbin() and bolides() while blqr() it is the residue sum squares (RSS). For the cbvn(), use the likelihood value.

options(scipen = original_scipen)# Restore the original scipen value

# References

1. Cade, B. S., & Noon, B. R. (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment, 1(8), 412-420. https://doi.org/10.1890/1540-9295(2003)001[0412:AGITQR]2.0.CO;2{.uri}

2. Lark, R. M., Gillingham, V., Langton, D., & Marchant, B. P. (2020). Boundary line models for soil nutrient concentrations and wheat yield in national-scale datasets. European Journal of Soil Science, 71 , 334-351. https://doi.org/10.1111/ejss.12891

3. Milne, A. E., Wheeler, H. C., & Lark, R. M. (2006). On testing biological data for the presence of a boundary. Annals of Applied Biology, 149 , 213-222. https://doi.org/10.1111/j.1744-7348.2006.00085.x

4. Miti, C., Milne, A., Giller, K., Sadras, V., & Lark, R. (2024). Exploration of data for analysis using boundary line methodology. Computers and Electronics in Agriculture, 219. https://doi.org/10.1016/j.compag.2024.108794

5. Miti, C., Milne, A., Giller, K., & Lark, R. (2024). The concepts and quantification of yield gap using boundary lines. a review. Field Crops Research, 311. https://doi.org/10.1016/j.fcr.2024.109365.

6. Nelder, J.A. 1961. The fitting of a generalization of the logistic curve. Biometrics 17: 89–110. https://doi.org/10.2307/2527498

7. Rousseeuw, P. J., Ruts, I., & Tukey, J. W. (1999). The bagplot: A bivariate boxplot. The American Statistician, 53, 382–387. https://doi.org/10.1080/00031305.1999.10474494

8. Shatar, T. M., & McBratney, A. B. (2004). Boundary-line analysis of field-scale yield response to soil properties. Journal of Agricultural Science, 142 , 553-560.

9. Schnug, E., Heym, J. M., & Murphy, D. P. L. (1995). Boundary line determination technique (bolides). In P. C. Robert, R. H. Rust, & W. E. Larson (Eds.), site specific management for agricultural systems (p. 899-908). Wiley Online Library. https://doi.org/10.2134/1995.site-specificmanagement.c66

10. Webb, R. A. (1972). Use of the boundary line in analysis of biological data. Journal of Horticultural Science, 47, 309--319. https://doi.org/10.1080/00221589.1972.11514472

## Try the BLA package in your browser

Any scripts or data that you put into this service are public.

BLA documentation built on May 29, 2024, 10:32 a.m.