Creating FFTrees

Nathaniel Phillips

2016-08-23

test test2

This function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several FFT (more details about the algorithms coming soon…)

heartdisease example

Let’s start with an example, we’ll create FFTs fitted to the heartdisease dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:

head(heartdisease)
##   age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope ca
## 1  63   1 ta      145  233   1 hypertrophy     150     0     2.3  down  0
## 2  67   1  a      160  286   0 hypertrophy     108     1     1.5  flat  3
## 3  67   1  a      120  229   0 hypertrophy     129     1     2.6  flat  2
## 4  37   1 np      130  250   0      normal     187     0     3.5  down  0
## 5  41   0 aa      130  204   0 hypertrophy     172     0     1.4    up  0
## 6  56   1 aa      120  236   0      normal     178     0     0.8    up  0
##     thal diagnosis
## 1     fd         0
## 2 normal         1
## 3     rd         1
## 4 normal         0
## 5 normal         0
## 6 normal         0

The critical dependent variable is diagnosis which indicates whether a patient has heart diesease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.

Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:

set.seed(100)
samples <- sample(c(T, F), size = nrow(heartdisease), replace = T)
heartdisease.train <- heartdisease[samples,]
heartdisease.test <- heartdisease[samples == 0,]

We’ll create a new fft object called heart.fft using the FFTrees() function. We’ll specify diagnosis as the (binary) dependent variable, and include all independent varaibles with formula = diagnosis ~ .:

heart.fft <- FFTrees(
  formula = diagnosis ~.,
  data = heartdisease.train,
  data.test = heartdisease.test
  )

Elements of an fft object

As you can see, FFTrees() returns an object with the fft class

class(heart.fft)
## [1] "FFTrees"

There are many elements in an fft object, here are their names:

names(heart.fft)
##  [1] "formula"        "data.train"     "data.test"      "cue.accuracies"
##  [5] "tree.stats"     "lr.stats"       "cart.stats"     "auc"           
##  [9] "lr.model"       "cart.model"     "decision.train" "decision.test" 
## [13] "levelout.train" "levelout.test"

Printing an fft object

You can view basic information about the fft object by printing its name. This will give you a quick summary of the object, includeing how many trees it has, which cues the tree(s) use, and how well they performed.

heart.fft
## [1] "An FFTrees object containing 8 trees using 4 cues {thal,cp,exang,slope} out of an original 13"
## [1] "Data were trained on 149 exemplars, and tested on 154 new exemplars"
## [1] "Trees AUC: (Train = 0.88, Test = 0.85)"
## [1] "My favorite tree is #5 [Training: HR = 0.91, FAR = 0.24], [Testing: HR = 0.79, FAR = 0.25]"

Cue accuracy statistics: cue.accuracies

You can obtain marginal cue accuracy statistics from the cue.accuracies dataframe. This dataframe contains the original, marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) is chosen.

heart.fft$cue.accuracies
##    cue.name cue.class      level.threshold level.sigdirection hi.train
## 1       age   numeric                53.89                 >=       43
## 2        ca   numeric                    0                  >       41
## 3      chol   numeric               252.32                  >       36
## 4        cp character             np,aa,ta                 !=       50
## 5     exang   numeric                    1                 >=       40
## 6       fbs   numeric                    0                  >       12
## 7   oldpeak   numeric                 0.98                  >       42
## 8   restecg character hypertrophy,abnormal                  =       39
## 9       sex   numeric                    1                 >=       51
## 10    slope character                   up                 !=       52
## 11     thal character               normal                 !=       47
## 12  thalach   numeric               144.32                 <=       36
## 13 trestbps   numeric               138.74                  >       27
##    mi.train fa.train cr.train hr.train far.train    v.train dprime.train
## 1        21       42       43 0.671875 0.4941176 0.17775735    0.2299210
## 2        23       21       64 0.640625 0.2470588 0.39356618    0.5219521
## 3        28       26       59 0.562500 0.3058824 0.25661765    0.3324334
## 4        14       22       63 0.781250 0.2588235 0.52242647    0.7116992
## 5        24       11       74 0.625000 0.1294118 0.49558824    0.7239078
## 6        52       11       74 0.187500 0.1294118 0.05808824    0.1210148
## 7        22       24       61 0.656250 0.2823529 0.37389706    0.4890579
## 8        25       34       51 0.609375 0.4000000 0.20937500    0.2655188
## 9        13       47       38 0.796875 0.5529412 0.24393382    0.3487076
## 10       12       31       54 0.812500 0.3647059 0.44779412    0.6165273
## 11       17       17       68 0.734375 0.2000000 0.53437500    0.7338601
## 12       28       21       64 0.562500 0.2470588 0.31544118    0.4205425
## 13       37       16       69 0.421875 0.1882353 0.23363971    0.3436595
##    correction hr.weight hi.test mi.test fa.test cr.test   hr.test
## 1        0.25       0.5      57      18      29      50 0.7600000
## 2        0.25       0.5      52      23      13      66 0.6933333
## 3        0.25       0.5      60      15      54      25 0.8000000
## 4        0.25       0.5      55      20      17      62 0.7333333
## 5        0.25       0.5      36      39      12      67 0.4800000
## 6        0.25       0.5      65      10      67      12 0.8666667
## 7        0.25       0.5      51      24      21      58 0.6800000
## 8        0.25       0.5      44      31      35      44 0.5866667
## 9        0.25       0.5      63      12      45      34 0.8400000
## 10       0.25       0.5      51      24      27      52 0.6800000
## 11       0.25       0.5      54      21      17      62 0.7200000
## 12       0.25       0.5      48      27      14      65 0.6400000
## 13       0.25       0.5      74       1      74       5 0.9866667
##     far.test     v.test dprime.test
## 1  0.3670886 0.39291139  0.52293838
## 2  0.1645570 0.52877637  0.74061064
## 3  0.6835443 0.11645570  0.18199408
## 4  0.2151899 0.51814346  0.70573386
## 5  0.1518987 0.32810127  0.48908518
## 6  0.8481013 0.01856540  0.04122383
## 7  0.2658228 0.41417722  0.54659740
## 8  0.4430380 0.14362869  0.18112497
## 9  0.5696203 0.27037975  0.40952522
## 10 0.3417722 0.33822785  0.43766510
## 11 0.2151899 0.50481013  0.68569175
## 12 0.1772152 0.46278481  0.64224441
## 13 0.9367089 0.04995781  0.34432180

You can also view the cue accuracies in an ROC-type plot with showcues():

showcues(heart.fft, 
         main = "Heartdisease Cue Accuracy")

Tree definitions and accuracy statistics: fft.stats

The fft.stats dataframe contains all tree definitions and training (and possibly test) statistics for all (\(2^{max.levels - 1}\)) trees. For our heart.fft example, there are \(2^{4 - 1} = 8\) trees.

heart.fft$fft.stats
## NULL

You can also use the generic summary() function to get the trees dataframe

summary(heart.fft)  # Same thing as heart.fft$fft.stats

Tree definitions (exit directions, cue order, and cue thresholds) are contained in columns 1 through 6. Training statistics are contained in columns 7:15 and have the .train suffix. For our heart disease dataset, it looks like tree 2 had the highest training v (HR - FAR) values. Test statistics are contained in columns 16:24 and have the .test suffix. It looks like trees 2 and 6 also had the highest test v (HR - FAR) values.

Area under the curve (AUC): auc

AUC (area under the curve) statistics are in the auc dataframe:

heart.fft$auc
##             fft        lr      cart
## train 0.8848346 0.8611213 0.8533088
## test  0.8462447 0.7973840 0.7390717

Other information

train.decision.df, test.decision.df

The train.decision.df and test.decision.df contain the raw classification decisions for each tree for each training (and test) case.

Here are each of the 8 tree decisions for the first 5 training cases.

heart.fft$decision.train[1:5,]
##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      0      0      0      1      1      1      1      1
## 2      0      0      0      0      1      1      1      1
## 3      0      0      0      0      0      0      0      1
## 4      0      0      0      0      0      0      0      0
## 5      0      0      0      0      0      0      0      0

train.levelout.df, test.levelout.df

The train.levelout.df and test.levelout.df contain the levels at which each case was classified for each tree.

Here are the levels at which the first 5 training cases were classified:

heart.fft$levelout.train[1:5,]
##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      2      2      3      4      1      1      1      1
## 2      1      1      1      1      3      3      2      2
## 3      1      1      1      1      2      2      3      4
## 4      1      1      1      1      2      2      3      4
## 5      1      1      1      1      2      2      3      4

Selecting cues

If you want to select specific cues for a tree, just include them in the formula argument.

For example, the following tree heart.as.fft will only consider the cues sex and age:

heart.as.fft <- FFTrees(formula = diagnosis ~ age + sex,
                    data = heartdisease
                    )

Plotting trees

Once you’ve created an fft object using FFTrees() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree (tree 2) applied to the test data:

plot(heart.fft,
     main = "Heart Disease",
     decision.names = c("Healthy", "Disease")
     )

See the vignette on plot.fft vignette("fft_plot", package = "fft") for more details.

Additional arguments

The FFTrees() function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!