knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(regtreeseg)
knitr::opts_chunk$set(fig.height = 8, fig.width = 12)

Introduction to Regression Tree Segmentation using regtreeseg Routines

Introduction

This document aims to give an overview on the methods and functions in the regtreeseg package. The regtreeseg package utilizes many of the ideas and uses the linear regression tree function from RPART by Terry M. Therneau and Elizabeth J. Atkinson.

The regtreeseg package functions segment the $log_2$ ratio relative to the copy number variants (CNV) signal from the genomic data.The functions depend on the the $log_2$ ratio between the sample genome and the reference genome data, as well as the genomic position. The functions use a regression tree approach with optimized complexity parameter values, iterative regression trees, weighting, or a combination to segment the data.

The package has functions built to segment the whole genome and also segment individual chromosomes.

Data

data("genomesample")
data("chr1sample")
data("chr3sample")

The dataframes inputted into the regtreeseg functions are required to have a columns labeled "log2r", "Start.Pos", and "Chr". These columns contain the $log_2$ ratio data, the genomic location, and the Chromosome. The chromosome column should include information in the format "chr1","chr2",..."chr22", "chrX". In building these functions Mayo Clinic's WANDY formatted dataframes with and added "log2r" column were used.

Fitting the Regression Tree

Regression trees partition the data into branches based on the explanatory data. The regression tree evaluates a split at each explanatory variable and value and calculates the MSE for that split. The tree then chooses to split at the explanatory variable and value that has the lowest MSE. Then this process continues and the tree is split into smaller partitions. Each partition predicts the response variable for the values in that partition. In this case the response variable is the $log_2$ ratio and the explanatory variable is the genomic location. Therefore the data is partitioned into segments that predict the $log_2$ ratio for the genomic positions in each segment.

There are multiple parameters that affect when the tree stops splitting. The key parameter focused on in the linear regression trees used for segmentation in regtreeseg is the complexity parameter. The complexity parameter can be thought of as the "minimum benefit" that a the $R^2$ must improve by adding the split in the tree for the split in the tree to be included. The cpopt function returns the complexity parameter (cp) value that has the lowest cross validation error. The cross validation error is referred to by xerror in the rpart package.

When the function is at default (when conserve == FALSE), the function looks at a list of cp values provided by rpart where the smallest value on that list is around .001. Based on the $log_2$ ratio and the genomic position data, the function determines which value has the smallest cross validation error and returns that value.

cpopt(chr1sample)

The larger the cp value, the larger the improvement must be for splitting the tree. As a result, a larger cp value creates a tree with fewer splits than a tree with a smaller cp value. The option to have a more conservative cp value is available (conserve == TRUE). When this option is chosen the function looks at a list of cp values provided by rpart where the smallest cp value is around .01. The function returns the value from the list that has the smallest cross validation error.

cpopt(chr1sample, conserve = TRUE)

Segmenting CNV's using the optimal cp values

Whole Genome

To segment the whole genome, there is one regression tree for each chromosome that predicts the segments in the data. Each regression tree uses it's own optimal cp value based on the $log_2$ ratio and genomic position data from that chromosome. A report of the optimal cp values can be found (cpdf).

example1a <- seg.genome(genomesample, png_filename = "example.png")
example1a$cpdf

There are other ways to customize the cp values. One way to use the conservative optimized cp value. This is the equivalent to using the cp value from cpopt(sample, conserve == TRUE) for each individual chromosome.

example1b <- seg.genome(genomesample, "example.png", conserve = TRUE)
example1b$cpdf

The user can also specify the cp value that is used for all of the chromosomes.

example1c <- seg.genome(genomesample, "example.png", cpvalue = .005)
example1c$cpdf

After segmenting the the genomic data according to the assigned cp value, a .png file is exported with a plot of the $log_2$ ratio data and the genomic position. The colors of the data alternate by even and odd chromosome to designate where one chromosome starts and where one begins. The predictions from each regression tree are bound together to create a prediction for the whole genome. The predictions are shown smoothed over the genomic data picking up the spikes and dips in the data.

The predictions from each of the regression trees, combined into one dataframe can be found (regtreepred). The predictions dataframe includes the columns of the inputted dataframe, the predicted $log_2$ ratio from the regression trees for each genomic postion("pred"), the chromosome number as a string ("chrom"), and the chromosome number as a numeric value ("chrN").

head(example1a$regtreepred,5)

The predictions are then converted into a dataframe with information on the segments. The information in the dataframe is the chromsome("chr"), the chromosome number ("chrnumber"), the start of segment ("start"), the end of the segment ("end"), the $log_2$ ratio predicted by the regression tree ("meanlog2ratio"), the median location of the segment ("location"), and the width of each of the segments ("widths).

head(example1a$segments,5)

Single Chromosome

To segment a specific chromosome, the user can specify the chromosome of interest. The user should enter the chromid in the same format as the Chr column (e.g "chr1", "chr2",...."chr22", "chrX").

To segment a single chromosome the user can input a dataframe with only one chromosome or a dataframe with multiple chromosomes. If a dataframe has multiple chromosomes, all the chromosomes will be segmented, but only the information for the specified chromosome will be returned and only the specified chromosome will be plotted.To reduce computation time, inputting a dataframe with only the chromosome of interest is recommended.

example2 <- seg.chr(chr3sample, chromid = "chr3")

Just as in the whole genome example, there are the same options for choosing a cp value. The default will be to use the optimal cp value, but there are options to specify a cp value or use the conservative optimal cp value.

Also, like the whole genome example this returns a dataframe with the predictions from the regression tree (regtreepred), a dataframe with the segments and corresponding information (segments), a dataframe with the cp value used to segment any of the data in the inputted dataframe (cpdf), and a plot of the chromosome data segmented (chrplot.

head(example2$regtreepred,5)
head(example2$segments, 5)
example2$cpdf
example2$chrplot

Segmenting CNV's using a 3 iterative regression tree approach (and optimal cp values)

As seen in the segmentation using only the optimized cp, the obvious spikes in the data are segmented out, but there are still segments that the model misses. To help remedy this an iterative regression tree approach to segment the data was implemented. Similar to the segmenting approach using the optimal cp value shown, this option is available for the whole genome and a single chromosome. The iterative method uses the optimal cp approach, but uses three iterations, and therefore three regression trees, to pick up spikes that were potentially missed in the first pass. The steps of the iterative regression tree approach for a single chromosome sample are as follows:

  1. A regression tree using the optimal cp value based on the chromosome data is built for the chromosome
  2. The residual error from the regression tree's predictions (pred1) and the actual points are calculated
  3. A regression tree using the optimal cp value based on the residual data is built to fit the residual error
  4. The residual error from the regression tree's predictions (pred2) and the actual residual points are calculated
  5. The two iterations of regression tree's predictions are added together (pred1 + pred2)
  6. The residual error from the added regression tree's predictions (pred1+pred2) and the actual points are calculated
  7. A regression tree using the optimal cp value based on the residual data is built to fit the residual error
  8. The third iteration's regression tree predictions are added to the first two regression tree predictions. The three regression tree predictions added together are the final prediction.

This approach is set to 3 iterations. At some point the residual error from the prediction does not have any spikes, so the regression tree is a straight, horizontal line. Thus, more iterations are not beneficial.

Using the iterseg.chr function this will return a dataframe with the regression tree prediction information for the chromosome of interest (regtreepred), a dataframe with the segments information for the chromosome of interest (segments), a dataframe that includes the cp values used for the first regression tree iteration (cpdf), a plot with the chromosome data segmented after 3 iterations (chrplot), and a list of plots that show the iterations (plots).

example3 <- iterseg.chr(chr3sample, "chr3")
head(example3$regtreepred,5)
head(example3$segments, 5)
example3$cpdf
example3$chrplot
example3$plots

Using the iterseg.genome function works in the same way as the iterseg.chr function, but returns slightly different information. Each chromosome is segmented individually and then the predictions are compiled together to create a whole genome report. This function returns a dataframe that includes the regression tree prediction information (regtreepred), a dataframe with the segments information (segments), and a dataframe that includes the cp values used for the first regression tree iteration (cpdf), and a .png file with the segmentation after 3 iterations for the whole genome.

NOTE: Segmenting the entire genome could take up to a minute for computation.

example4 <- iterseg.genome(genomesample, "example4.png")
head(example4$regtreepred, 5)
head(example4$segments, 5)
example4$cpdf

By using the iterative approach any spikes that are missed in the first regression tree are then hopefully picked up in the second regression tree. Then , this idea applies again for the third regression tree to pick up any spikes missed by the first or second regression tree.

Weighting

While the cp optimizing method is working to minimize the cross validation, the clinical purpose of the regression tree is to segment out the spikes in the $log_2$ ratio data. Iteration helps catch missed spikes, but weighting helps to incentivize the regression tree to reach for the points that are farther from a log2ratio of 0. If the log2ratio has a magnitude less than or equal to the mean absolute deviation difference (mad.diff) of all the log2ratio points, then the weight of the point is 1 (it is a “regular” point). If the log2ratio has a magnitude greater than the mad.diff, then the weight is the magnitude of the log2r/mad.diff. Thus, a larger log2r will have a larger weight to act as incentive for the regression tree.

Using the weighted approach allows for the model to reach up closer to the top point of each segment. The weighted approach may also be incentivized to follow spikes that have a high magnitude, but fewer points, a scenario where the model may previously have not segmented out any data.

The iterseg.chr.weighted and iterseg.genome.weighted functions return the same items as their corresponding unweighted functions. Weighting assists for some sets of genomic data and does not help or makes no change for other sets of data. The weight can work to help identify more spikes and allow the model to reach to the full extent of each spikes, but also has the potential to overfit the data (especially when the data is variable and has many extreme values). The tool of the weighted feature is up to the user.

example5 <- iterseg.chr.weighted(chr3sample, "chr3")
head(example5$regtreepred,5)
head(example5$segments, 5)
example5$cpdf
example5$chrplot
example5$plots

References

  1. T.M Therneau and E.J Atkinson. An introduction to recursive partitioning using the rpart routines. Mayo Foundation, 2019.


annikacleven/regtreesegpackage documentation built on Dec. 19, 2021, 3:40 a.m.