Motivation"

knitr::opts_chunk$set(echo = TRUE)

Objective

Define a Distance Measure for Numerical and Categorical Features

Let $x_i = \left(x_i^1,\cdots,x_i^p\right)^T$ and $x_j = \left(x_j^1,\cdots,x_j^p\right)^T$ be the feature vectors of tow distinct patients $i$ and $j$. A first rough idea may be to calculate the $L_1-$norm of this two feature vectors: $$ \|x_i - x_j\|{L_1}=\frac{1}{p}\sum\limits{k=1}^p|x_i^k - x_j^k| $$

But this naive approach arise some problems:

  1. This is well defined for numerical features, but what's about categorical features, e.g. gender (male/female). How shall we define the distance here?

  2. $|x_i^k - x_j^k|$ is not scale invariant, e.g. if you change from meters to centimetres, the contribution of this feature would explode by a factor of 100. Features with large values will dominate the distance.

  3. All features are treated the same way, e.g. lung cancer: smoker (no/yes) seems to be more important than the city the patient is living.

We propose a machine learning approach for overcoming this issues.

Weighted Distance Measure

Two steps have to be done to overcome above problems: 1. Generalize the L1-distance on the feature scale, such that it can deal with different variable types 2. Each feature needs to be weighted suitable a. to rule out dependency on the scale b. account for varying importance across features c. impact size of each feature

This leads to following weighted distance measure:

$$ d(x_i, x_j) = \sum\limits_{k=1}^p |\alpha(x_i^k, x_j^k) d(x_i^k, x_j^k)| $$

Statistical Model

Let $\alpha(x_i^k, x_j^k)$ the weighting and $d(x_i^k, x_j^k)$ the distance for feature $k$. They are defined as follows:

If feature $k$ is numerical, then

$$d(x_i^k, x_j^k) = x_i^k - x_j^k$$ and

$$\alpha(x_i^k, x_j^k) = \frac{\hat{\beta_k}}{\sum \hat{\beta_k}}$$

If feature $k$ is categorical, then we define

$$d(x_i^k, x_j^k) = 1 \text{ when } x_i^k = x_j^k \text{ else } 0$$ and $$\alpha(x_i^k, x_j^k) = \frac{\hat{\beta_k}^i - \hat{\beta_k}^j}{\sum \hat{\beta_k}}.$$ Where the $\hat{\beta_k}^k$ are coefficients of a regression model (linear, logistic, or Cox model).

This approach solves all three problems described above.

Random Survival Forests

Proximity Measure

Two observations $i$ and $j$ are more similar when the fraction of trees in which patient $i$ and $j$ share the same terminal node is close to one (Breiman, 2002).

$$ d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T 1_{[x_i \text{ and } x_j \text{ share the same terminal node in tree } t]}, $$ where $M$ is the number of trees that contains both observations and $T$ is the number of trees. A drawback of this measure is that the decision is binary and possibly similar observations may counted as not similar, e.g. assume a final cut off is always made around age 58 and observation 1 is 56 and observation is 60, then the distance of observation 1 to 2 is the same as from observation 1 to an observation with age 80.

Depth Measure: A modified proximity measure

The depth measure uses instead of final noed of each three the number of edges between two observations, that is finally averaged over all trees. More precicely, this distance measure is defined as:

$$ d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T \exp{(-1\cdot \omega\cdot g_{ij})}, $$ where $\omega \geq 0$ is a weighting factor and $g_{ij}$ is the number of edges between end nodes of observation $i$ and $j$ in tree $t$. More details can be found in the publication Englund and Verikas A novel approach to estimate proximity in a random forest: An exploratory study.



Try the CaseBasedReasoning package in your browser

Any scripts or data that you put into this service are public.

CaseBasedReasoning documentation built on June 12, 2018, 5:18 p.m.