FeatureFinder
In featurefinder: Feature Finder

FeatureFinder is designed to give comprehensive and accurate sets of features which can be used in modelling, either to build a new model or to enhance and diagnose an existing model. Both methods are available through a single function, FindFeatures. The following give examples for each method.

Identifying Features for a model target

If you have not yet built a model, findFeatures can be used to identify promising features. A typical modelling scenario involves a table consisting of a set of predictors ${x_i}$ and a model target $y$.

Identifying Features for a model residual

If you already have a model, and want to find features to further improve it, findFeatures can be (repeatedly) run using residuals. A typical modelling scenario involves a table consisting of a set of predictors ${x_i}$ and a model residual $r = y - p$, where $p$ is a prediction from a previously-fit model and $y$ is the model target.

Scanning over partitions of the data

The function generates a decision tree for the entire table, as well as decision trees for every possible subset of the table. Subsets are defined using factor-valued columns in the data. Fators can either be user-defined or already included as predictors. The more factors that are created in the data, the more partitions that will be tested and so it is helpful to create a comprehensive set of factors for this purpose. One easy way to create factors is to bin each predictor into 10 bands and create a factor for each. Factor labels should be prefixed with a string character so that they are interpreted as factors and not numerics, for example "s1", "s2" rather than "1","2" labels.

The feature-finding process

For the case where no model has been fitted yet, we simply define $r=y$, and for the case when a model has been fitted already we use the residual $r = y - p$. We supply a single table consisting of all predictors together with $residual=r, actual=y,expected=p$ and call the findFeatures function.

Each decision tree will consist of an rpart tree as shown in the following example. Leaves are labelled with a residual and a leaf volume $n$. Nodes are also labelled with the cut-rule for the node, and these are used to identify the leaves. The code scans each leaf for cases with sufficiently high volume and a residual value which exceeds a user-specified threshold. These parameters are outlined in help(findFeatures). When a leaf meeting the criteria is found, it is printed in the txt file for the partition being scanned. These text files can then be manually or automatically parsed and included in models as required. Often a leaf will be a clue rather than the final form of a feature, and so manual inspection can be of assistance.

knitr::include_graphics("./vignettefigures/vignettefigure.png")

The summary of residual nodes according to user-specified criteria for residual value and leaf volume will be generated in txt files (for example treesAll.txt and allfactors\.[partitionvariable]factor.txt). These contain a summary of each significant term with its definition, volume and other parameters as shown:

Partition variable (categorical factor)
The level of the partition variable
Residual
Leaf volume vs partition volume (percentage)
Leaf volume
Partition volume
Expected value average in leaf
Actual value average in leaf
Residual (checkval)
Leaf definition within this partition of the partition variable

In the examples, partitioning enables significant leaves to be found for each partition, although the full dataset does not yield leaves in the fitted tree. This illustrates the benefits of the partitioning technique.

Standard dataset

Once features are found, they can be customised and added to the model as shown:

library(featurefinder)
# mpg cars example
data(mpgdata)
data=mpgdata

# define some categorical factors here, for use in partition scanning. Define as many as desired.
data$transfactor=paste("trans",as.matrix(data$trans),sep="")
data$transfactor=as.factor(data$transfactor)

# define data dimensions
n=dim(data)[1] # total dimension
nn=floor(dim(data)[1]/2) # split point for training and test
data=as.data.frame(data)
nm=names(data)
nm[8]='y' ## select a column to be the target of the model
names(data)=nm

data0=data # retain full dataset
data=data[c("manufacturer","displ","year","transfactor","y") ] # select a subset for our first model

firstmodel=lm(formula=y ~ .,data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
summary(firstmodel)

# drop terms that are not significant and refit model
data$manufacturerchevrolet=(data$manufacturer=='chevrolet')
data$manufacturerford=(data$manufacturer=='ford')
data$manufacturerhonda=(data$manufacturer=='honda')
data$manufacturernissan=(data$manufacturer=='nissan')
data$manufacturerpontiac=(data$manufacturer=='pontiac')
data$manufacturertoyota=(data$manufacturer=='toyota')
data$manufacturervolkswagen=(data$manufacturer=='volkswagen')
#data$displ
data$transfactortransautol4=(data$transfactor=='transauto(l4)')
data$transfactortransautol5=(data$transfactor=='transauto(l5)')
firstmodel=lm(formula=y ~ manufacturerchevrolet+
                manufacturerford+
                manufacturerhonda+
                manufacturernissan+
                manufacturerpontiac+
                manufacturertoyota+
                manufacturervolkswagen+
                displ+
                year
                #transfactortransautol4
                #transfactortransautol5
               , data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
summary(firstmodel)

CSVPath=tempdir()
data1=cbind(data0,expected, actual, residual)
fcsv=paste(CSVPath,"/mpgdata.csv",sep="")
write.csv(data1[(nn+1):(length(data1$y)),],file=fcsv,row.names=FALSE)

exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=20,
         treeSummaryMinBucket=30,
         useSubDir=FALSE,
         tempDirFolderName="mpg")  

# potential terms identified in residual scan
# RESIDUAL: ALL,ALL,0.575,34.2,40,117,16,16.6,0.575,model< 16.5 and hwy< 28.5 and 
#                                                   manufacturer=jeep,lincoln,mercury,nissan,pontiac,subaru
# RESIDUAL: fl,r,1.11,38.4,33,86,20.1,21.2,1.11,hwy>=26.5
# RESIDUAL: class,compact,0.816,100,32,32,0,0,0,NA and root
# RESIDUAL: manufacturer,toyota,7.14e-13,100,34,34,0,0,0,NA and root
# RESIDUAL: trans,manual(m5),0.566,100,36,36,0,0,0,NA and root
# RESIDUAL: transfactor,transmanual(m5),0.566,100,36,36,0,0,0,NA and root

# add terms to dataset and refit
data$hwy=data0$hwy
data$fl=data0$fl
data$model=as.numeric(as.factor(data0$model))
data$model16hwy28manufacturer=(data$model< 16.5) & (data$hwy< 28.5)&(data$manufacturer=="jeep"|data$manufacturer=="lincoln"|data$manufacturer=="mercury"|data$manufacturer=="nissan"|data$manufacturer=="pontiac"|data$manufacturer=="subaru")
data$flr_hwy26=(data$fl=="r") & (data$hwy>=26.5)
data$transfactortransmanualm5=(data$transfactor=='transmanual(m5)')
data$manufacturertoyota=(data$manufacturer=='toyota')
data$classcompact=(data0$class=='compact')
data$flr=(data$fl=='r')
secondmodel=lm(formula=y ~ manufacturerchevrolet+
                            manufacturerford+
                            manufacturerhonda+
                            manufacturernissan+
                            manufacturerpontiac+
                            manufacturertoyota+
                            manufacturervolkswagen+
                            displ+
                            year+
                            # new terms
                            #model16hwy28manufacturer+
                            flr_hwy26+
                            transfactortransmanualm5+
                            manufacturertoyota
                            #classcompact+
                            #flr
               , data=data)
expected=predict(secondmodel,data)

summary(firstmodel)
summary(secondmodel)
# Append new features from the scan to a dataframe automatically
dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_")
head(dataWithNewFeatures)

# https://vincentarelbundock.github.io/Rdatasets/datasets.html
# http://www.public.iastate.edu/~hofmann/data_in_r_sortable.html

# move example to a subfolder 
unlink("mpgexample", recursive=TRUE)
dir.create("mpgexample", showWarnings = FALSE)
file.copy(Sys.glob("./*.Rdata"), "./mpgexample/")
file.copy(Sys.glob("./*.png"), "./mpgexample/", recursive=TRUE)
file.copy(Sys.glob("./*.txt"), "./mpgexample/", recursive=TRUE)
unlink(Sys.glob("./*.Rdata"), recursive=FALSE)
unlink(Sys.glob("./*.png"), recursive=FALSE)
unlink(Sys.glob("./*.txt"), recursive=FALSE)

The adjusted R-squared has improved from 0.721 to 0.745 as a result of the newly found features.

A more challenging dataset

A more challenging dataset is stock index data, where the target is to predict future relative movements of two indices, DAX and SMI. FindFeatures is able to identify features which can be added to the model as shown:

library(featurefinder)
data(futuresdata)
data=futuresdata
data$SMIfactor=paste("smi",as.matrix(data$SMIfactor),sep="")
n=length(data$DAX)
nn=floor(length(data$DAX)/2)

# Can we predict the relative movement of DAX and SMI?
data$y=data$DAX*0 # initialise the target to 0
data$y[1:(n-1)]=((data$DAX[2:n])-(data$DAX[1:(n-1)]))/
  (data$DAX[1:(n-1)])-(data$SMI[2:n]-(data$SMI[1:(n-1)]))/(data$SMI[1:(n-1)])

# Fit a simple model
firstmodel=lm(formula=y ~ DAX+SMI+
                            #CAC+
                            FTSE,
                            #SMIfactorsmi1
                            data=data)
expected=predict(firstmodel,data)
actual=data$y
residual=actual-expected
data0=data
data=cbind(data,expected, actual, residual)

CSVPath=tempdir()
fcsv=paste(CSVPath,"/futuresdata.csv",sep="")
write.csv(data[(nn+1):(length(data$y)),],file=fcsv,row.names=FALSE)

exclusionVars="\"residual\",\"expected\", \"actual\",\"y\""
factorToNumericList=c()

# Now the dataset is prepared, try to find new features
tempDir=findFeatures(outputPath="NoPath", fcsv, exclusionVars,factorToNumericList,                     
         treeGenerationMinBucket=30,
         treeSummaryMinBucket=50,
         useSubDir=FALSE,
         tempDirFolderName="futures")  

newfeat1=((data$SMIfactor=="smi0") & (data$CAC < 2253) & (data$CAC< 1998) & (data$CAC>=1882)) * 1.0
newfeat2=((data$SMIfactor=="smi1") & (data$SMI < 7837) & (data$SMI >= 7499)) * 1.0
newfeatures=cbind(newfeat1, newfeat2) # create columns for the newly found features
datanew=cbind(data0,newfeatures)

secondmodel=lm(formula=y ~ DAX+SMI+
                           #CAC+
                           FTSE+
                           #SMIfactorsmi1+
                           newfeat1+newfeat2,
                data=datanew[,])
expectednew=predict(secondmodel,datanew)

require(Metrics)
OriginalRMSE = rmse(data$y,expected)
NewRMSE = rmse(data$y,expectednew)

print(paste("OriginalRMSE = ",OriginalRMSE))
print(paste("NewRMSE = ",NewRMSE))


summary(firstmodel)
summary(secondmodel)

# Append new features from the scan to a dataframe automatically
dataWithNewFeatures = addFeatures(df=data0, path=tempDir, prefix="auto_")
head(dataWithNewFeatures)

The newly discovered features are statistically significant, with improved adjusted R-squared, lower residual errors and improved F-statistic and p-value.

# move example to a subfolder 
unlink("daxexample", recursive=TRUE)
dir.create("daxexample", showWarnings = FALSE)
file.copy(Sys.glob("./*.Rdata"), "./daxexample/")
file.copy(Sys.glob("./*.png"), "./daxexample/", recursive=TRUE)
file.copy(Sys.glob("./*.txt"), "./daxexample/", recursive=TRUE)
unlink(Sys.glob("./*.Rdata"), recursive=FALSE)
unlink(Sys.glob("./*.png"), recursive=FALSE)
unlink(Sys.glob("./*.txt"), recursive=FALSE)