--- title: "Custom Predict and Model Functions" author: "Patrick Schratz" # date: "June 10 2017" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Custom Predict and Model Functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction {sperrorest} is a generic framework which aims to work with all R models/packages. In statistical learning, model setups, their formulas and error measures all depend on the family of the response variable. Various families exist (numeric, binary, multiclass) which again include sub-families (e.g. gaussian or poisson distribution of a numeric response). This detail needs to be specified via the respective function, e.g. when using `glm()` with a binary response, one needs to set `family = "binomial"` to make sure that the model does something meaningful. Most of the time, the same applies to the generic `predict()` function. For the `glm()` case, one would need to set `type = "response"` if the predicted values should reflect probabilities instead of log-odds. These settings can be specified using `model_args` and `pred_args` in `sperrorest()`. So fine, "why do we need to write all these wrappers and custom model/predict functions then?!" ## User-defined Model Functions ### Problem `model_fun` expects at least formula argument and a data.frame with the learning sample. All arguments, including the additional ones provided via `model_args`, are getting passed to `model_fun` via a `do.call()` call. However, if `model_fun` does not have an argument named `formula` but e.g. `fixed` (like it is the case for `glmmPQL()`) the `do.call()` call will fail because `sperrorest()` tries to pass an argument named `formula` but `glmmPQL` expects an argument named `fixed`. ### Solution In this case, we need to write a wrapper function for `glmmPQL` (named `glmmPQL_modelfun` here) which accounts for this naming problem. Here, we are passing the `formula` argument to our custom model function which then does the actual call to `glmmPQL()` using the supplied `formula` object as the `fixed` argument of `glmmPQL`. By default, `glmmPQL()` has further arguments like `family` or `random`. If we want to use these, we pass them to `model_args` which then appends these to the arguments of `glmmPQL_modelfun`. ```{r} glmmPQL_modelfun <- function(formula = NULL, data = NULL, random = NULL, family = NULL) { fit <- glmmPQL(fixed = formula, data = data, random = random, family = family) return(fit) } ``` ## User-defined Predict Functions ### Problem Unless specified explicitly, `sperrorest()` tries to use the generic `predict()` function. This function works differently depending on the class of the provided fitted model, i.e. many models slightly differ in the naming (and availability) of their arguments. For example, when fitting a Support Vector Machine (SVM) with a binary response variable, package `kernlab` expects an argument `type = "probabilities"` in its `predict()` call to receive predicted probabilities while in package `e1071` it is `"probability = TRUE"`. Similar to `model_args`, this can be accounted for in the `pred_args` of `sperrorest()`. However, `sperrorest()` expects that the predicted values (of any response type) are stored directly in the returned object of the `predict()` function. While this is the case for many models, mainly with a numeric response, classification cases often behave differently. Here, the predicted values (classes in this case) are often stored in a sub-object named `class` or `predicted`. ### Solution Since there is no way to account for this in a general way (when every package may return the predicted values in a different format/column), we need to account for it by providing a custom predict function which returns only the predicted values so that `sperrorest()` can continue properly. This time we are showing two examples. The first takes again a binary classification using `randomForest`. #### randomForest When calling predict on a fitted `randomForest` model with a binary response variable, the predicted values are actually stored in the resulting object returned by `predict()` (here called `pred`). So why do we have trouble here then? Simply because `pred` is a matrix containing both probabilities for the `FALSE` (= 0) and `TRUE` (= 1) case. `sperrorest()` needs a vector containing only the predicted values of the `TRUE` case to pass these further onto `err_fun()` which then takes care of calculating all the error measures. So the important part is to subset the resulting matrix in the `pred` object to `TRUE` cases only and return the result. ```{r, eval = FALSE} rf_predfun <- function(object = NULL, newdata = NULL, type = NULL) { pred <- predict(object = object, newdata = newdata, type = type) pred <- pred[, 2] } ``` #### svm The same case (binary response) using `svm` from the `e1071` package. Here, the predicted probabilities are stored in a sub-object of `pred`. We can address it using the `attr()` function. Then again, we only need the `TRUE` cases for `sperrorest()`. ```{r} svm_predfun <- function(object = NULL, newdata = NULL, probability = NULL) { pred <- predict(object, newdata = newdata, probability = TRUE) pred <- attr(pred, "probabilities")[, 2] } ```