This function simulates multiple cross-validations of data with the fft() function.
Let’s start with an example, we’ll create FFTs fitted to the breastcancer dataset. Here’s how the dataset looks:
head(breastcancer)## thickness cellsize.unif cellshape.unif adhesion epithelial nuclei.bare
## 1 5 1 1 1 2 1
## 2 5 4 4 5 7 10
## 3 3 1 1 1 2 2
## 4 6 8 8 1 3 4
## 5 4 1 1 3 2 1
## 6 8 10 10 8 7 10
## chromatin nucleoli mitoses diagnosis
## 1 3 1 1 B
## 2 3 2 1 B
## 3 3 1 1 B
## 4 3 7 1 B
## 5 3 1 1 B
## 6 9 7 1 M
We’ll create a new fft object called heart.fft using the fft() function. We’ll set the criterion to breastcancer$diagnosis == "M" and use all other columns (breastcancer[,names(breastcancer) != "diagnosis"] as potential predictors. Additionally, we’ll define two parameters:
train.p = .1: Train the trees on a random sample of 10% of the original training dataset, and test the trees on the remaining 50%sim.n = 10: Do 10 simulationsset.seed(100) # For reproducability
bcancer.fft.sim <- simfft(
train.cue.df = breastcancer[,names(breastcancer) != "diagnosis"],
train.criterion.v = breastcancer$diagnosis == "M",
train.p = .1,
sim.n = 10
)The function will return a dataframe with fitting and test results for each simulation:
bcancer.fft.sim## train.p sim fft.hr.train fft.far.train fft.hr.test fft.far.test
## 1 0.1 1 0.9523810 0.04255319 0.9220183 0.04534005
## 2 0.1 2 0.9545455 0.04347826 0.7926267 0.02261307
## 3 0.1 3 1.0000000 0.05000000 0.9573460 0.10396040
## 4 0.1 4 0.9523810 0.04255319 0.9587156 0.06045340
## 5 0.1 5 1.0000000 0.02040816 0.8636364 0.05063291
## 6 0.1 6 1.0000000 0.04444444 0.9722222 0.08020050
## 7 0.1 7 1.0000000 0.00000000 0.6711712 0.01526718
## 8 0.1 8 1.0000000 0.06976744 0.9953271 0.18703242
## 9 0.1 9 1.0000000 0.04255319 0.9954128 0.16372796
## 10 0.1 10 0.9583333 0.06818182 0.9395349 0.06250000
## fft.level.class fft.level.name fft.level.exit
## 1 integer;integer cellsize.unif;cellshape.unif 1;0.5
## 2 integer;integer cellshape.unif;epithelial 0;0.5
## 3 integer;integer cellsize.unif;cellshape.unif 1;0.5
## 4 integer;integer cellsize.unif;cellshape.unif 0;0.5
## 5 integer;integer epithelial;cellsize.unif 0;0.5
## 6 numeric;integer nuclei.bare;cellshape.unif 1;0.5
## 7 integer;integer cellshape.unif;cellsize.unif 0;0.5
## 8 integer;numeric cellsize.unif;nuclei.bare 1;0.5
## 9 integer;numeric epithelial;nuclei.bare 1;0.5
## 10 numeric;integer nuclei.bare;chromatin 1;0.5
## fft.level.threshold fft.level.sigdirection lr.hr.train lr.far.train
## 1 4;3 >=;> 1 0
## 2 3;2 >;> 1 0
## 3 3;4 >=;>= 1 0
## 4 1;2 >;> 1 0
## 5 3;3 >=;>= 1 0
## 6 4;4 >=;>= 1 0
## 7 4;3 >;> 1 0
## 8 2;2 >;>= 1 0
## 9 2;2 >;> 1 0
## 10 5;3 >=;> 1 0
## lr.hr.test lr.far.test cart.hr.train cart.far.train cart.hr.test
## 1 0.7844037 0.05289673 0.9523810 0.04255319 0.8348624
## 2 0.8341014 0.02010050 1.0000000 0.08695652 0.8525346
## 3 0.9241706 0.04950495 1.0000000 0.05000000 0.9431280
## 4 0.8761468 0.04534005 1.0000000 0.08510638 0.9816514
## 5 0.9181818 0.03291139 1.0000000 0.06122449 0.9454545
## 6 0.7500000 0.03759398 0.9565217 0.04444444 0.8333333
## 7 0.8603604 0.02290076 1.0000000 0.00000000 0.7162162
## 8 0.9345794 0.05985037 0.9200000 0.02325581 0.9532710
## 9 0.7706422 0.03778338 0.9523810 0.02127660 0.9036697
## 10 0.8232558 0.03250000 0.8750000 0.04545455 0.7767442
## cart.far.test
## 1 0.02267003
## 2 0.03768844
## 3 0.08910891
## 4 0.17884131
## 5 0.08860759
## 6 0.02255639
## 7 0.01781170
## 8 0.09226933
## 9 0.11335013
## 10 0.03500000
You can plot the results using the simfftplot() function.
If you set roc = F you’ll see a bar chart showing how often each of the possible cues was used in trees. This gives you an indication of how important each cue is. For example, if a cue is used in >95% of simulations, this suggests that the cue is a consistently good predictor of the criterion across a wide range of training samples.
simfftplot(bcancer.fft.sim,
roc = F
)If you set roc = T, you’ll see a distribution of hit-rates and false-alarm rates for trees across all simulations. You can also specify which data (training or test) to display with which.data.
Here, we can see the distribution of HR and FAR for the training data:
simfftplot(bcancer.fft.sim,
roc = T,
which.data = "train"
)Now let’s do the testing data. We should expect the trees to do a bit worse here:
simfftplot(bcancer.fft.sim,
roc = T,
which.data = "test"
)To add curves for CART and Logistic Regression by including the arguments lr = T and cart = T. Let’s look at the performance of CART and LR compared to the trees for the training data:
simfftplot(bcancer.fft.sim,
roc = T,
lr = T,
cart = T,
which.data = "train"
)It looks like LR dominated both CART and FFTs for the training data (in fact, for this simulation, LR always gave a perfect fit). Now let’s look at the test data:
simfftplot(bcancer.fft.sim,
roc = T,
lr = T,
cart = T,
which.data = "test"
)Here, we can see that for the testing data, all three algorithms performed similarly well.