test test2
This function is at the heart of the FFTrees
package. The function takes a training dataset as an argument, and generates several FFT (more details about the algorithms coming soon…)
Let’s start with an example, we’ll create FFTs fitted to the heartdisease
dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:
head(heartdisease)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 63 1 ta 145 233 1 hypertrophy 150 0 2.3 down 0
## 2 67 1 a 160 286 0 hypertrophy 108 1 1.5 flat 3
## 3 67 1 a 120 229 0 hypertrophy 129 1 2.6 flat 2
## 4 37 1 np 130 250 0 normal 187 0 3.5 down 0
## 5 41 0 aa 130 204 0 hypertrophy 172 0 1.4 up 0
## 6 56 1 aa 120 236 0 normal 178 0 0.8 up 0
## thal diagnosis
## 1 fd 0
## 2 normal 1
## 3 rd 1
## 4 normal 0
## 5 normal 0
## 6 normal 0
The critical dependent variable is diagnosis
which indicates whether a patient has heart diesease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.
Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:
set.seed(100)
samples <- sample(c(T, F), size = nrow(heartdisease), replace = T)
heartdisease.train <- heartdisease[samples,]
heartdisease.test <- heartdisease[samples == 0,]
We’ll create a new fft object called heart.fft
using the FFTrees()
function. We’ll specify diagnosis
as the (binary) dependent variable, and include all independent varaibles with formula = diagnosis ~ .
:
heart.fft <- FFTrees(
formula = diagnosis ~.,
data = heartdisease.train,
data.test = heartdisease.test
)
As you can see, FFTrees()
returns an object with the fft class
class(heart.fft)
## [1] "FFTrees"
There are many elements in an fft object, here are their names:
names(heart.fft)
## [1] "formula" "data.train" "data.test" "cue.accuracies"
## [5] "tree.stats" "lr.stats" "cart.stats" "auc"
## [9] "lr.model" "cart.model" "decision.train" "decision.test"
## [13] "levelout.train" "levelout.test"
You can view basic information about the fft object by printing its name. This will give you a quick summary of the object, includeing how many trees it has, which cues the tree(s) use, and how well they performed.
heart.fft
## [1] "An FFTrees object containing 8 trees using 4 cues {thal,cp,exang,slope} out of an original 13"
## [1] "Data were trained on 149 exemplars, and tested on 154 new exemplars"
## [1] "Trees AUC: (Train = 0.88, Test = 0.85)"
## [1] "My favorite tree is #5 [Training: HR = 0.91, FAR = 0.24], [Testing: HR = 0.79, FAR = 0.25]"
You can obtain marginal cue accuracy statistics from the cue.accuracies
dataframe. This dataframe contains the original, marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) is chosen.
heart.fft$cue.accuracies
## cue.name cue.class level.threshold level.sigdirection hi.train
## 1 age numeric 53.89 >= 43
## 2 ca numeric 0 > 41
## 3 chol numeric 252.32 > 36
## 4 cp character np,aa,ta != 50
## 5 exang numeric 1 >= 40
## 6 fbs numeric 0 > 12
## 7 oldpeak numeric 0.98 > 42
## 8 restecg character hypertrophy,abnormal = 39
## 9 sex numeric 1 >= 51
## 10 slope character up != 52
## 11 thal character normal != 47
## 12 thalach numeric 144.32 <= 36
## 13 trestbps numeric 138.74 > 27
## mi.train fa.train cr.train hr.train far.train v.train dprime.train
## 1 21 42 43 0.671875 0.4941176 0.17775735 0.2299210
## 2 23 21 64 0.640625 0.2470588 0.39356618 0.5219521
## 3 28 26 59 0.562500 0.3058824 0.25661765 0.3324334
## 4 14 22 63 0.781250 0.2588235 0.52242647 0.7116992
## 5 24 11 74 0.625000 0.1294118 0.49558824 0.7239078
## 6 52 11 74 0.187500 0.1294118 0.05808824 0.1210148
## 7 22 24 61 0.656250 0.2823529 0.37389706 0.4890579
## 8 25 34 51 0.609375 0.4000000 0.20937500 0.2655188
## 9 13 47 38 0.796875 0.5529412 0.24393382 0.3487076
## 10 12 31 54 0.812500 0.3647059 0.44779412 0.6165273
## 11 17 17 68 0.734375 0.2000000 0.53437500 0.7338601
## 12 28 21 64 0.562500 0.2470588 0.31544118 0.4205425
## 13 37 16 69 0.421875 0.1882353 0.23363971 0.3436595
## correction hr.weight hi.test mi.test fa.test cr.test hr.test
## 1 0.25 0.5 57 18 29 50 0.7600000
## 2 0.25 0.5 52 23 13 66 0.6933333
## 3 0.25 0.5 60 15 54 25 0.8000000
## 4 0.25 0.5 55 20 17 62 0.7333333
## 5 0.25 0.5 36 39 12 67 0.4800000
## 6 0.25 0.5 65 10 67 12 0.8666667
## 7 0.25 0.5 51 24 21 58 0.6800000
## 8 0.25 0.5 44 31 35 44 0.5866667
## 9 0.25 0.5 63 12 45 34 0.8400000
## 10 0.25 0.5 51 24 27 52 0.6800000
## 11 0.25 0.5 54 21 17 62 0.7200000
## 12 0.25 0.5 48 27 14 65 0.6400000
## 13 0.25 0.5 74 1 74 5 0.9866667
## far.test v.test dprime.test
## 1 0.3670886 0.39291139 0.52293838
## 2 0.1645570 0.52877637 0.74061064
## 3 0.6835443 0.11645570 0.18199408
## 4 0.2151899 0.51814346 0.70573386
## 5 0.1518987 0.32810127 0.48908518
## 6 0.8481013 0.01856540 0.04122383
## 7 0.2658228 0.41417722 0.54659740
## 8 0.4430380 0.14362869 0.18112497
## 9 0.5696203 0.27037975 0.40952522
## 10 0.3417722 0.33822785 0.43766510
## 11 0.2151899 0.50481013 0.68569175
## 12 0.1772152 0.46278481 0.64224441
## 13 0.9367089 0.04995781 0.34432180
You can also view the cue accuracies in an ROC-type plot with showcues()
:
showcues(heart.fft,
main = "Heartdisease Cue Accuracy")
The fft.stats
dataframe contains all tree definitions and training (and possibly test) statistics for all (\(2^{max.levels - 1}\)) trees. For our heart.fft
example, there are \(2^{4 - 1} = 8\) trees.
heart.fft$fft.stats
## NULL
You can also use the generic summary()
function to get the trees dataframe
summary(heart.fft) # Same thing as heart.fft$fft.stats
Tree definitions (exit directions, cue order, and cue thresholds) are contained in columns 1 through 6. Training statistics are contained in columns 7:15 and have the .train
suffix. For our heart disease dataset, it looks like tree 2 had the highest training v (HR - FAR) values. Test statistics are contained in columns 16:24 and have the .test
suffix. It looks like trees 2 and 6 also had the highest test v (HR - FAR) values.
AUC (area under the curve) statistics are in the auc
dataframe:
heart.fft$auc
## fft lr cart
## train 0.8848346 0.8611213 0.8533088
## test 0.8462447 0.7973840 0.7390717
The train.decision.df
and test.decision.df
contain the raw classification decisions for each tree for each training (and test) case.
Here are each of the 8 tree decisions for the first 5 training cases.
heart.fft$decision.train[1:5,]
## tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1 0 0 0 1 1 1 1 1
## 2 0 0 0 0 1 1 1 1
## 3 0 0 0 0 0 0 0 1
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
The train.levelout.df
and test.levelout.df
contain the levels at which each case was classified for each tree.
Here are the levels at which the first 5 training cases were classified:
heart.fft$levelout.train[1:5,]
## tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1 2 2 3 4 1 1 1 1
## 2 1 1 1 1 3 3 2 2
## 3 1 1 1 1 2 2 3 4
## 4 1 1 1 1 2 2 3 4
## 5 1 1 1 1 2 2 3 4
If you want to select specific cues for a tree, just include them in the formula
argument.
For example, the following tree heart.as.fft
will only consider the cues sex
and age
:
heart.as.fft <- FFTrees(formula = diagnosis ~ age + sex,
data = heartdisease
)
Once you’ve created an fft object using FFTrees()
you can visualize the tree (and ROC curves) using plot()
. The following code will visualize the best training tree (tree 2) applied to the test data:
plot(heart.fft,
main = "Heart Disease",
decision.names = c("Healthy", "Disease")
)
See the vignette on plot.fft
vignette("fft_plot", package = "fft")
for more details.
The FFTrees()
function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!
train.p
: What percent of the data should be used for training? train.p = .1
will randomly select 10% of the data for training and leave the remaining 90% for testing. Settting train.p = 1
will fit the trees to the entire dataset (with no testing).
rank.method
: As trees are being built, should cues be selected based on their marginal accuracy (rank.method = "m"
) applied to the entire dataset, or on their conditional accuracy (rank.method = "c"
) applied to all cases that have not yet been classified? Each method has potential pros and cons. The marginal method is much faster to implement and may be prone to less over-fitting. However, the conditional method could capture important conditional dependencies between cues that the marginal method misses.