knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE
)
library(tidyverse)
library(mvpart)
# source("C:/Users/dora/Downloads/mvpart_1.1-1/mvpart/R/mvpart.R")

Installation

The mvpart package is no longer active on CRAN but can be installed from the archives.

# install.packages("devtools")
devtools::install_github("cran/mvpart")

Or download a realease from https://github.com/cran/mvpart/releases and install it with something like:

install.packages(
  "C:/Users/dora/Desktop/mvpart-1.6-2.tar.gz", 
  repos = NULL, type = "source"
)

Some people reported installation issues (https://goo.gl/oDjjz8).

Data

devtools::load_all()
KC3spp20 <- read_csv_alt("KC3spp20.csv", "./data-raw/mvpart/KC3spp20.csv")
kc.hab <- read_csv_alt("kc.hab.csv", "./data-raw/mvpart/kc.hab.csv")

Overview data

2 files attached:

# See a few columns from the beginning, middle and end
KC3spp20 %>% dplyr::select(1:2, 250:252, 583:585)

# For a cleaner output for interpretation, normalized tree spp data to a mean of
# 0 and standard deviation of 1 (https://goo.gl/zDLdMi).
kc.hab %>% dplyr::glimpse()

Results

In this section, we first explore one result in detail; then we'll re-run the exact same model twice more and we'll compare the results.

Build and run the model

abundance <- data.matrix(KC3spp20)
environmental_variables <- kc.hab

formula <- abundance ~ RB_NO3 + RB_NH4 + RB_PO4 + Al + pH_water + Na + Mn + 
  Mg + K + Fe + Ca + BS + ECEC + Bray_P + meanelev + convex + slope

# Set a new seed for random numbers to ensure results are reproducible
set.seed(1221)

# See `?mvpart()` for argument details
mvpart_run1 <- mvpart(
  form = formula, 
  data = environmental_variables,
  all.leaves = TRUE,  # annotate all nodes
  rsq = TRUE,  # give "rsq" plot
  pca = TRUE,  # plot PCA of group means and add species and site information
  wgt.ave.pca = TRUE  # plot weighted averages acorss sites for species
)

Model details

Structure

str(mvpart_run1)

Full summary

summary(mvpart_run1)

Groupings

table(mvpart_run1$where)

Map

plot_where <- function(run) {
  run <- get(run)
  # Data
  index <- 1:600
  grouped_quadrats <- index %>% 
    index.to.gxgy(plotdim = c(600, 400)) %>%
    as_tibble() %>% 
    mutate(
      index = index,
      group = as.factor(run$where)
    )

  # Plot
  ggplot(grouped_quadrats, aes(gx, gy)) +
    geom_raster(aes(fill = group)) +
    geom_text(aes(label = index)) +
    theme_minimal()
}
plot_where("mvpart_run1")

Evaluation

Compare multiple runs of the same model

Two runs of the same model are different because the algorithm uses random numbers.

set.seed(1234)  # Set a new seed for random numbers
mvpart_run2 <- mvpart(form = formula, data = environmental_variables)
# Set a different seed and re-run the exact same model
set.seed(4321)  
mvpart_run3 <- mvpart(form = formula, data = environmental_variables)
# compare all models run
all.equal(mvpart_run1, mvpart_run2)
all.equal(mvpart_run1, mvpart_run3)
all.equal(mvpart_run2, mvpart_run3)

Visualize grouping of three different runs

paste0("mvpart_run", 1:3) %>% purrr::map(plot_where)

mvpart() fails with scaled abundance data

The clustering actually is weird. It groups the top of the hill and the bottom
of the hill (red and green on the map) together in the second split. The
species are really different between these habitats, which suggests there is
an error.

The classification tree seems to be entirely driven by the soils, and the
species don’t carry much weight.

I wonder if the results would differ after scaling the species data (scale function in base package).

These are data for each tree species normalized to a mean of 0 and standard deviation of 1, which creates a cleaner output for interpretation

--https://goo.gl/zDLdMi

# Standarize

abundance <- data.matrix(KC3spp20)
abund_scaled <- scale(abundance)

# Ad-hoc function to check that all columns have mean = 0 and sd = 1
are_all_columns_near <- function(.data, .near, .f) {
  # Arguments:
  #   .data: Dataframe or matrix.
  #   .near: Scalar.
  #   .f: Summary function such as mean and sd to apply to all columns of data.
  # Value:
  #   Returns TRUE if all columns are near .near; FALSE otherwise.
  .data %>% 
    as_tibble() %>%
    summarise_all(.f) %>% 
    purrr::map(dplyr::near, .near) %>%
    as.logical() %>% 
    all()
}
abundance %>% are_all_columns_near(0, mean)
abund_scaled %>% are_all_columns_near(0, mean)
abundance %>% are_all_columns_near(1, sd)
abund_scaled %>% are_all_columns_near(1, sd)
# Re-run analysis that resulted in mvpart_run1

formula_scaled <- abund_scaled ~ RB_NO3 + RB_NH4 + RB_PO4 + Al + pH_water +
  Na + Mn + Mg + K + Fe + Ca + BS + ECEC + Bray_P + meanelev + convex + slope

set.seed(1221) # same seed as for mvpart_run1

# This one fails
mvpart_run1_scaled <- mvpart(
  form = formula_scaled, 
  data = environmental_variables,
  all.leaves = TRUE,  # annotate all nodes
  rsq = TRUE,  # give "rsq" plot
  pca = TRUE,  # plot PCA of group means and add species and site information
  wgt.ave.pca = TRUE  # plot weighted averages acorss sites for species
)

The scaled data fails. Repeating mvpart_run1 to check it runs again:

<<mvpart-run1>>

Miscelaneas

Example: Path from root to leaf node in mvpart

I was recently asked by a R user about how one could extract the “rule” in a classification/regression tree. The requirement was to obtain the path traced from the root node to the leaf nodes and obtain all the paths or “rules”

path.rpart() function in the mvpart package provides this convenience

--https://goo.gl/DgTNFE

library(mvpart)

# Create a classification tree
ozone <- mvpart(Ozone ~ ., data = airquality)
print(ozone) # Gives you the various splits in the tree

# Issue the two commands below to see the graphical representation
plot(ozone)
text(ozone)

# To obtain the summary of the created tree
summary(ozone)

# To obtain the path to the leaf nodes
ozone$frame
leafnodeRows <- grepl("leaf", ozone$frame$var)
nodevals <- as.numeric(rownames(ozone$frame)[leafnodeRows])
rules <- path.rpart(ozone, nodevals)

rulesdf <- do.call(
  "rbind", 
  lapply(rules, function(x) paste(x, collapse = " -AND- "))
)
rulesdf <- data.frame(
  nodeNumber = rownames(rulesdf), 
  rule = rulesdf[, 1], 
  stringsAsFactors = FALSE
)
rulesdf

Package rpart does not handle multi variate data

The rpart package seems like a good alternative because ?mvpart::mvpart() says it's a wrapper of rpart(). However, mvpart::rpart() works with multivariate data, but rpart::rpart() does not.

mvpart::rpart(form = formula, data = environmental_variables)  # passes
rpart::rpart(form = formula, data = environmental_variables)  # fails

rpart is active but mvpart is not. Maybe the authors can inform where else mvpart functions can be found.

Access rpart's vignettes from R with:

browseVignettes("rpart")

# An Introduction to Recursive Partitioning Using the RPART Rutines (62 pages)
vignette("longintro")

Learning more



forestgeo/ctfs documentation built on May 3, 2019, 6:44 p.m.