In the first example, we use the Cox-Model and the ovarian
data set from the survival
package. In the first step we initialize the R6 data object.
library(tidyverse)
library(survival)
library(CaseBasedReasoning)
ovarian$resid.ds <- factor(ovarian$resid.ds)
ovarian$rx <- factor(ovarian$rx)
ovarian$ecog.ps <- factor(ovarian$ecog.ps)
# initialize R6 object
coxBeta <- CoxBetaModel$new(Surv(futime, fustat) ~ age + resid.ds + rx + ecog.ps)
All cases with missing values in the learning and end point variables are dropped (na.omit
) and the reduced data set without missing values is saved internally. You get a text output on how many cases were dropped. character
variables will be transformed to factor
.
After the initialization, we may want to get for each case in the query data the most similar case from the learning data.
n <- nrow(ovarian)
trainID <- sample(1:n, floor(0.8 * n), F)
testID <- (1:n)[-trainID]
# fit model
ovarian[trainID, ] %>%
coxBeta$fit()
## Dropped cases with missing values: 0
## Start learning...
## Learning finished in: 0.49 seconds.
# get similar cases
ovarian[trainID, ] %>%
coxBeta$get_similar_cases(queryData = ovarian[testID, ], k = 3) -> matchedData
## Start caclulating similar cases...
## Warning: package 'bindrcpp' was built under R version 3.4.4
## Similar cases calculation finished in: 0.19 seconds.
knitr::kable(head(matchedData))
futime | fustat | age | resid.ds | rx | ecog.ps | scDist | caseId | scCaseId | group |
---|---|---|---|---|---|---|---|---|---|
59 | 1 | 72.3315 | 2 | 1 | 1 | 0.0000000 | 0 | 1 | Query Data |
115 | 1 | 74.4932 | 2 | 1 | 1 | 0.6225564 | 1 | 1 | Matched Data |
156 | 1 | 66.4658 | 2 | 1 | 2 | 2.4957528 | 2 | 1 | Matched Data |
365 | 1 | 64.4247 | 2 | 2 | 1 | 4.1382432 | 3 | 1 | Matched Data |
477 | 0 | 64.1753 | 2 | 1 | 1 | 0.0000000 | 0 | 2 | Query Data |
156 | 1 | 66.4658 | 2 | 1 | 2 | 0.1468168 | 1 | 2 | Matched Data |
You may e | xtract th | en the sim | ilar cases | and t | he verum d | ata and put | them toge | ther: |
Note 1: In the initialization step, we dropped all cases with missing values in the variables of data
and endPoint
. So, you need to make sure that NA handling is done by you.
Note 2: The data.table
returned from coxBeta$get_similar_cases
has four additional columns:
caseId
: By this column you may map the similar cases to cases in data, e.g. if you had chosen k = 3
, then the first three elements in the column caseId
will be 1
(following three 2
and so on). This means that this three cases are the three most similar cases to case 0
in verum data.scDist
: The calculated distancescCaseId
: Grouping number of query with matched datagroup
: Grouping matched or query dataAlternatively, you may just be interested in the distance matrix, then you go this way:
ovarian %>%
coxBeta$calc_distance_matrix() -> ditMatrix
## Start calculating distance matrix...
## Distance matrix calculation finished in: 0.01 seconds.
coxBeta$calc_distance_matrix()
calculates the full distance matrix. This matrix the dimension: cases of data versus cases of query data. If the query dataset is bot available, this functions calculates a n times n distance matrix of all pairs in data. The distance matrix is saved internally in the CoxBetaModel object: coxBeta$distMat
.
pp <- coxBeta$check_ph()
pp
TBD