Let \(x_i = \left(x_i^1,\cdots,x_i^p\right)^T\) and \(x_j = \left(x_j^1,\cdots,x_j^p\right)^T\) be the feature vectors of tow distinct patients \(i\) and \(j\). A first rough idea may be to calculate the \(L_1-\)norm of this two feature vectors: \[ \|x_i - x_j\|_{L_1}=\frac{1}{p}\sum\limits_{k=1}^p|x_i^k - x_j^k| \]
But this naive approach arise some problems:
This is well defined for numerical features, but what’s about categorical features, e.g. gender (male/female). How shall we define the distance here?
\(|x_i^k - x_j^k|\) is not scale invariant, e.g. if you change from meters to centimetres, the contribution of this feature would explode by a factor of 100. Features with large values will dominate the distance.
All features are treated the same way, e.g. lung cancer: smoker (no/yes) seems to be more important than the city the patient is living.
We propose a machine learning approach for overcoming this issues.
Two steps have to be done to overcome above problems: 1. Generalize the L1-distance on the feature scale, such that it can deal with different variable types 2. Each feature needs to be weighted suitable a. to rule out dependency on the scale b. account for varying importance across features c. impact size of each feature
This leads to following weighted distance measure:
\[ d(x_i, x_j) = \sum\limits_{k=1}^p |\alpha(x_i^k, x_j^k) d(x_i^k, x_j^k)| \]
Let \(\alpha(x_i^k, x_j^k)\) the weighting and \(d(x_i^k, x_j^k)\) the distance for feature \(k\). They are defined as follows:
If feature \(k\) is numerical, then
\[d(x_i^k, x_j^k) = x_i^k - x_j^k\] and
\[\alpha(x_i^k, x_j^k) = \frac{\hat{\beta_k}}{\sum \hat{\beta_k}}\]
If feature \(k\) is categorical, then we define
\[d(x_i^k, x_j^k) = 1 \text{ when } x_i^k = x_j^k \text{ else } 0\] and \[\alpha(x_i^k, x_j^k) = \frac{\hat{\beta_k}^i - \hat{\beta_k}^j}{\sum \hat{\beta_k}}.\] Where the \(\hat{\beta_k}^k\) are coefficients of a regression model (linear, logistic, or Cox model).
This approach solves all three problems described above.
Two observations \(i\) and \(j\) are more similar when the fraction of trees in which patient \(i\) and \(j\) share the same terminal node is close to one (Breiman, 2002).
\[ d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T 1_{[x_i \text{ and } x_j \text{ share the same terminal node in tree } t]}, \] where \(M\) is the number of trees that contains both observations and \(T\) is the number of trees. A drawback of this measure is that the decision is binary and possibly similar observations may counted as not similar, e.g. assume a final cut off is always made around age 58 and observation 1 is 56 and observation is 60, then the distance of observation 1 to 2 is the same as from observation 1 to an observation with age 80.
The depth measure uses instead of final noed of each three the number of edges between two observations, that is finally averaged over all trees. More precicely, this distance measure is defined as:
\[ d(x_i, x_j)^2 = 1 - \frac{1}{M}\sum\limits_{t=1}^T \exp{(-1\cdot \omega\cdot g_{ij})}, \] where \(\omega \geq 0\) is a weighting factor and \(g_{ij}\) is the number of edges between end nodes of observation \(i\) and \(j\) in tree \(t\). More details can be found in the publication Englund and Verikas A novel approach to estimate proximity in a random forest: An exploratory study.