6 Boosting

6.1 Introduction

In this exercise session, we will consider multiple advanced machine learning models. Our base model will not be a penalized logistic regression as in Session 05 rather than a random forest. The models we are considering subsequently are also widely used in application as their performance on classification tasks is superb! However, similar to random forests, their explainability is still subpar compared to a simple logistic regression or classification tree. Before we learn how to train and finetune these models, we will discuss some theoretical aspects.

6.1.1 Confusion matrices in R

Since we will be working with on a classification task in the exercises, being able to construct a confusion matrix is crucial.

Consider the following example data set which is part of the {yardstick} library:

library(tidyverse)
library(yardstick)

two_class_example %>% glimpse()

Rows: 500
Columns: 4
$ truth     <fct> Class2, Class1, Class2, Class1, Class2, Class1, Class1, Clas…
$ Class1    <dbl> 0.0035892426, 0.6786210540, 0.1108935221, 0.7351617031, 0.01…
$ Class2    <dbl> 9.964108e-01, 3.213789e-01, 8.891065e-01, 2.648383e-01, 9.83…
$ predicted <fct> Class2, Class1, Class2, Class1, Class2, Class1, Class1, Clas…

Say, we want to label change the label Class1 to Positive and Class2 to Negative. Then, we can simply apply the mutate() function:

two_class_example <- two_class_example %>%
  mutate(
    truth = fct_relabel(truth,
                        ~if_else(.=="Class1","Positive","Negative")
                        ),
    predicted = fct_relabel(predicted,
                          ~if_else(.=="Class1","Positive","Negative")
                          )
  )

To create a simple confusion matrix, we can use the conf_mat function that is also part of the {yardstick} library:

(cm_example <- two_class_example %>%
   conf_mat(truth =truth,estimate = predicted))

          Truth
Prediction Positive Negative
  Positive      227       50
  Negative       31      192

We can also use the ggplot function to create a visually more appealing version of this matrix. To do so, we first have to convert the confusion matrix into a proper data frame and set the levels of the Predictions and Truth.

cm_tib<- as_tibble(cm_example$table)%>% 
  mutate(
    Prediction = factor(Prediction,
                        levels = rev(levels(factor(Prediction)))),
    Truth = factor(Truth)
)

In the snippet above we had to reverse the levels of the variable Prediction so that we can place the TP values in the top left, and TN values in the bottom right of the confusion matrix.

Once the confusion matrix has been converted to a data frame, we can pass it into the ggplot function with the argument fill set to n. The geom_tile() function places a tile at each coordinate provided by the data frame (Note: Coordinates are discrete and given by Negative and Positive). The argument colour = "gray50" adds a gray border to each tile. By adding the geom_text() function where the aesthetics are provided by the label argument, we can add the number of samples falling into each class (TP, TN, FP, and FN) to each tile. The scale_fill_gradient() function, allows to change the colors of the tiles with respect to the value of n. Here, a low value of n indicates the the tile will be "white" and a high value of n indicates that the color of the tile will be light green (with HEX Code "#9AEBA3"). Setting the theme to minimal, and removing the legend yields a cleaner representation of the confusion matrix.

cm_tib %>% ggplot(aes(x = Prediction, y = Truth,fill = n)) +
    geom_tile( colour = "gray50")+
    geom_text(aes(label = n))+
    scale_fill_gradient(low = "white", high = "#9AEBA3")+
    theme_minimal()+
    theme(legend.position = "none")

6.1.2 Tuning an XGBoost model

Since we have intensively covered random forests in previous exercises, we only consider an XGBoost model in this introduction. AdaBoost is rarely used in practice anymore, which is why we will directly move towards training an XGBoost model. The approach is similar to training and tuning every other model but compared to previous exercises we will not perform cross validation, rather than a simple training/validation/test split to save some time.

We can create an XGBoost model by using the boost_tree function. Looking at the documentation, you will notice that there are quite a few parameters for us to consider:

Parameter	Description
`trees`	Number of trees contained in the ensamble
`tree_depth`	Integer for the maximum depth of the trees
`min_n`	Minimum number of data points in a node required for a split
`mtry`	Number of randomly selected at each split

The parameters above are not new to us. In fact, they are the exact same parameters we use for training a random forest model. That is why we will not go into detail with respect to the ones above.

There are, however, a few new parameters that are worth an explanation:

Parameter	Description
`loss_reduction`	Number for the reduction in loss that is required to split further $\in [0, \infty]$
`sample_size`	Subsample ratio of the training instances $\in (0, 1] .$
`learn_rate`	Rate at which the algorithm adapts from iteration to iteration $\in [0, 1]$ .

The three parameters above have only been referenced in the lectures so far, so let’s quickly describe them in a bit more detail.

6.1.2.0.1 `loss_reduction` :

In Exercise 6.2, we will derive the optimal expansion coefficient $α$ (similar to the coefficients in linear regression) which solves the minimization problem

$(α_{b}, h_{b}) = \arg min_{α > 0, h \in H} \sum_{n = 1}^{N} L (y_{n}, {\hat{f}}^{(b - 1)} (x_{n}) + α h (x_{n}))$

Here, $L$ denotes a loss function that we aim to minimize with respect to $α$ and an additional (potentially weak) learner $h$ that we add to the previous estimator.

If the term

$| \sum_{n = 1}^{N} L (y_{n}, {\hat{f}}^{(b - 1)} (x_{n}) + α h (x_{n})) - \sum_{n = 1}^{N} L (y_{n}, {\hat{f}}^{(b)} (x_{n}) + α h (x_{n})) |,$

i.e., the loss reduction between step $b$ and $b + 1$ is smaller than the parameter loss_reduction, the algorithm stops.

6.1.2.0.2 `sample_size` :

Let $q \in (0, 1]$ denote the sample_size parameter and $N$ the number of samples in our training data. Then, XGBoost selects $q \cdot N$ samples prior to growing trees. This subsampling occurs once in every boosting iteration.

6.1.2.0.3 `learn_rate` :

In simple terms, the learning rate specifies how quickly the model adapts to the training data. An analogy can be drawn to gradient based models that use gradient descent on a loss function. Here, the goal is to minimize the loss function by stepping towards its minimum. To illustrate the learning rate in a gradient descent context, consider the following examples where we can imagine the polynomial of degree four to be a loss function that we try to minimize.

Choosing a learning rate that is too high, might result in missing an optimal model because it is being stepped over, while a learning rate chosen too small might result in the objective never being reached at all.

Similar to choosing a learning rate that is too high, we could also choose a learning rate that is too low, resulting in the global minimum never being reached at all.

The learning rate in the XGBoost algorithm describes a factor $γ$ that scales the output of the most recently fit tree that is added to the model. In simple terms, the learning rate in the XGBoost algorithm describes a shrinkage parameter.

In the following example, we will try to predict the base rent prices in Munich using an XGBoost model. The data Apartment rental offers in Germany is the same as in Exercise 04.

library(tidymodels)

data_muc <- read.csv("data/rent_muc.csv")

Instead of using a cross validation approach, we will use a simple training/validation/test split to reduce computing time.

By using the validation split function on the training data, we split the training data into a training and validation subset. The data_val object can then be passed into the tune_grid function in the same fashion as we did with a cross validation object.

set.seed(24)
split_rent <- initial_split(data_muc)
data_train <- training(split_rent)
data_val <- validation_split(data_train)
data_test <- testing(split_rent)

Preprocessing of the data is handled by the following recipe.

rec_rent <- recipe(
    formula = baseRent ~., 
    data = data_train
  ) %>%
  update_role(scoutId, new_role = "ID") %>%
  step_select(!c("serviceCharge","heatingType","picturecount",
                 "totalRent",   "firingTypes","typeOfFlat",
                 "noRoomsRange", "petsAllowed",
                 "livingSpaceRange","regio3","heatingCosts",
                 "floor","date", "pricetrend")) %>%
  step_mutate(
    interiorQual = factor(
      interiorQual,
      levels = c("simple", "normal", "sophisticated", "luxury"),
      ordered = TRUE
      ),
    condition = factor(condition,
      levels = c("need_of_renovation", "negotiable","well_kept",
                 "refurbished","first_time_use_after_refurbishment",
                 "modernized","fully_renovated", "mint_condition",
                 "first_time_use"),
      ordered = TRUE),
    geo_plz = factor(geo_plz),
              across(where(is.logical),~as.numeric(.))) %>%
  step_string2factor(all_nominal_predictors(),
                     all_logical_predictors()) %>%
  step_ordinalscore(all_ordered_predictors())%>%
  step_novel(all_factor_predictors())%>%
  step_unknown(all_factor_predictors()) %>%
  step_dummy(geo_plz)%>%
  step_impute_knn(all_predictors()) %>%
  step_filter(baseRent <= 4000, livingSpace <= 200)

After preprocessing the data, we can create a workflow object and specify our XGBoost model.

wf_rent <- workflow() %>%
  add_recipe(rec_rent)

We want to tune every parameter except for trees. Since we are using a regression model, we need to use the mode "regression".

set.seed(121)

xgb_model <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  mtry = tune(),         
  loss_reduction = tune(),                     
  learn_rate = tune()                          
) %>%
  set_mode("regression")

wf_rent <- wf_rent %>% add_model(xgb_model)

To tune the model and select the best model based on the performance on the validation data, we use the tune_grid function.

multi_metrics <- metric_set(rmse,rsq,mae)

xgb_tune_res <- wf_rent %>%
  tune_grid(
    resamples = data_val,
    metrics = multi_metrics,
    grid = 20,
  )

After tuning the model parameters, we use the optimal candidate hyperparameters to train a final model on all the training data and evaluate it on the test data.

xgb_best_parm <- xgb_tune_res %>% select_best(metric = "rmse")

last_xgb_fit <- wf_rent %>%
  finalize_workflow(xgb_best_parm) %>%
  last_fit(split_rent)

last_xgb_fit %>% collect_metrics()

# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard     543.    Preprocessor1_Model1
2 rsq     standard       0.723 Preprocessor1_Model1

In Exercise 04, where we performed the same regression task with a decision tree, the OOS performance was substantially worse:

Metric	Estimate
RMSE	622
$R^{2}$	0.616

We can, therefore, conclude that the XGBoost model is more suitable for estimating the base rent for rental apartments in Munich.

6.2 Exercises

6.2.1 Theoretical exercises

Exercise 6.1 Explain in your own words, the difference between Boosting (Trees), Bagging (Trees), and Random Forests.

Exercise 6.2 On slide 89 in the lecture notes, the AdaBoost algorithm stated as follows.

In the third line of Figure 6.1 , the scaling coefficients $α_{b}$ at step $b \in {1, . ., B}$ are set to

$α_{b} = \frac{1}{2} \log (\frac{1 - {err}_{b}}{{err}_{b}}) .$

The goal of this exercise is to figure out, why the scaling coefficients are defined that way. The essence of this derivation lies in the more general idea of boosting, where the minimization problem at step $b \in {1, . . ., B}$ is given by (cf. Slide 91)

$\begin{matrix} (6.1) & (α_{b}, h_{b}) = \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} L (y_{i}, {\hat{f}}^{(b - 1)} (x_{i}) + α h (x_{i})) \end{matrix}$

For AdaBoost, the loss function $L$ is defined by

$\begin{matrix} (6.2) & L (y, \hat{f} (x)) = \exp (- y \hat{f} (x)) \end{matrix}$

By minimizing Equation 6.1 with respect to $α$ , we obtain the desired coefficient.

Show that $\begin{matrix} (6.3) & \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} L (y_{i}, {\hat{f}}^{(b - 1)} (x_{i}) + α h (x_{i})) = \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} w_{b} (i) \exp (- α y_{i} h (x_{i})) \end{matrix}$
Show that the objective function of the right hand side of Equation 6.3 can be expressed as

$\begin{matrix} (6.4) & e^{- α} \sum_{y_{i} = h (x_{i})} w_{b} (i) + e^{α} \sum_{y_{i} \neq h (x_{i})} w_{b} (i) \end{matrix}$
Show that Equation 6.4 is equal to $\begin{matrix} (6.5) & (e^{α} - e^{- α}) \sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i})) + e^{- α} \sum_{i = 1}^{n} w_{b} (i) \end{matrix}$
Argue by using Equation 6.5 that for any $α > 0$ the solution to Equation 6.1 for $h$ is given by

$\begin{matrix} (6.6) & h_{b} = \underset{h}{argmin} \sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i})) . \end{matrix}$
Finally, plug the objective function Equation 6.5 into Equation 6.3 and show that minimizing the loss function for $α$ yields

$α_{b} = \frac{1}{2} \log (\frac{1 - {err}_{b}}{{err}_{b}}),$

where ${err}_{b} = \frac{\sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h_{b} (x_{i}))}{\sum_{i = 1}^{n} w_{b} (i)} .$

Hint: You can assume that the candidate for $α$ is indeed a minimizer.

6.2.2 Programming Exercises

The following exercise is similar to Exercise 5.3.2. However, instead of fitting penalized logistic regression and classification tree, we fit a XGBoost and LightGBM model on the credit card data.

library("finetune")
library("bonsai")
library("patchwork")
library("ggtext")

The dataset we will consider in this exercise will be the Credit Card Customers data set that we already used in previous exercises. You can either download it again using the provided link or the button below.

Download BankChurners

Recall that the data set consists of 10,127 entries that represent individual customers of a bank including but not limited to their age, salary, credit card limit, and credit card category.

The goal is to find out whether a customer will stay or leave the bank given the above features.

The following training, validation and test split should be used for training the models of the subsequent exercises.

credit_info <- read.csv("data/BankChurners.csv")

set.seed(121)
split <- initial_split(credit_info, strata = Attrition_Flag)
data_train_ci <- training(split)
data_val_ci <- validation_split(data_train_ci)
data_test_ci <- testing(split)

Preprocessing of the data is handled by the following recipe.

levels_income <- c("Less than $40K","$40K - $60K",
                   "$60K - $80K","$80K - $120K","$120K +")

levels_education <- c("Uneducated", "High School","College",
                      "Graduate",  "Post-Graduate", "Doctorate")

rec_ci <- recipe(Attrition_Flag ~., data = data_train_ci) %>%
  update_role(CLIENTNUM, new_role = "ID") %>%
  step_mutate_at(all_nominal_predictors(),
               fn = ~if_else(.%in% c("Unknown","unknown"),NA,.)
  ) %>%
  step_mutate(Attrition_Flag = factor(
                 Attrition_Flag,
                 labels = c("Positive", "Negative")
               )
               ) %>%
  step_string2factor(Income_Category,
                     levels = levels_income,
                     ordered = TRUE) %>%
  step_string2factor(Education_Level,
                     levels = levels_education,
                     ordered = TRUE) %>%
  step_ordinalscore(all_ordered_predictors()) %>%
  step_unknown(all_factor_predictors()) %>%
  step_impute_knn(all_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_corr(all_predictors())

ci_wf <- workflow() %>%
  add_recipe(rec_ci) 

multi_metrics <- metric_set(roc_auc,pr_auc,accuracy,recall)

Note, that we encoded the target variable Attrition_Flag with new labels, namely Positive and Negative. Positive corresponds to a customer leaving the bank, while Negative corresponds to a customer staying with the bank.

Exercise 6.3 Create and train a random forest model using the provided recipe with $1000$ trees and tune the parameters mtry and min_n.

Tune the model on a grid of size 20 using the tune_grid function on the validation split generated with the training data.

Find the best model by evaluating the tuning results with respect to the models’ accuracy.

Based on these parameters train a model on the whole training data.

Exercise 6.4 Create two tibbles containing the data necessary to plot a ROC- and PR curve. When creating the tibbles, add a column containing the model name "Random forest", so that we can correctly identify the models later during model evaluation.

Exercise 6.5 Tune a XGBoost model in the same fashion as the random forest. Set the number of trees to 1000, and every other parameter, except for sample_size, to tune().

After tuning and refitting the best model on the whole training data, repeat Exercise 6.4 for this XGBoost model on the test data.

Note

The following model is not relevant for the exam. However, it is extremely relevant in today’s ML landscape, so I encourage you to solve the following exercises as well.

Exercise 6.6 (Bonus Exercise) The last model we want to train is called LightGBM. It was developed by Microsoft and is, as well as XGBoost, a gradient-based ensemble learner. An advantage compared to XGBoost is the focus on performance and scalability, meaning that it is designed to work well on CPUs while trying to at least match the performance of XGBoost.

The steps for training a LightGBM model are exactly the same as for training an XGBoost model, except for the model specification. Here we set the engine to "lightgbm" instead of "xgboost". Every other parameter stays the same, thanks to the {tidymodels} framework.

Repeat Exercise 6.5 for a LightGBM model.

Tip

If you get stuck recreating the following plots, revisit the solutions to Exercise Sheet 05, where we created the same plot for a penalized logistic regression, a classification tree, and a random forest.

Exercise 6.7 Create a plot showing the ROC- and PR-curve for each of the models we trained in the previous exercises (Random Forest, XGBoost, LightGBM). Compare the performances visually and decide which model performed the best. For reference, you can find what such a plot could look like below.

Exercise 6.8 For each of the previously trained models (Random Forest, XGBoost, LightGBM), create a confusion matrix based on the test sets to evaluate which model performed best on unseen data.

Exercise 6.9 For the confusion matrices above, find out which model has the overall best out-of-sample performance. For this best model, calculate the following metrics:

Sensitivity
Precision
Accuracy

6.3 Solutions

Solution 6.1 (Exercise 6.1).

Bagging (bootstrap aggregation) is a special case of random forests. Here, we also create a predetermined number of trees. However, the main difference is that in Bagging the full set of features is considered when creating a split for a node. In a random forest, only a subset of all features is randomly considered when creating a split for a new node.
Boosting (Trees) combines many weak learners, e.g., tree stumps, to make a prediction. Compared to Bagging and Random forests, those weak learners are weighted, e.g., one tree stump has more say than another when making a final decision. Furthermore, weak learners are not created independently because each weak learner is built by considering the previous learners’ mistakes.

Solution 6.2 (Exercise 6.2).

Plugging Equation 6.2 into Equation 6.1 yields

$\begin{aligned} \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} L (y_{i}, {\hat{f}}^{(b - 1)} (x_{i}) + α h (x_{i})) & = \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} \exp (- y_{i} ({\hat{f}}^{(b - 1)} (x_{i}) + α h (x_{i}))) \\ = \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} \underset{: = w_{b} (i)}{\underset{⏟}{\exp (- y_{i} {\hat{f}}^{(b - 1)} (x_{i}))}} \exp (- α y_{i} h (x_{i})) \\ = \underset{α > 0, h \in H}{argmin} \sum_{i = 1}^{n} w_{b} (i) \exp (- α y_{i} h (x_{i})) . \end{aligned}$
Since $y_{i} \in {1, - 1}$ and $h (x_{i}) \in {- 1, 1}$ as well, either $y_{i} \cdot h (x_{i}) = 1$ if $y_{i} = h (x_{i})$ , or $y_{i} \cdot h (x_{i}) = - 1$ if $y_{i} \neq h (x_{i})$ (since one of the two terms is equal to $- 1$ and the other equal to $1$ ). The condition above can be formalized as
$y_{i} \cdot h (x_{i}) = {\begin{cases} 1 & if y_{i} = h (x_{i}) \\ - 1 & if y_{i} \neq h (x_{i}) \end{cases},$ and rewriting the right hand side of Equation 6.3 using these conditions yields $\begin{aligned} \sum_{i = 1}^{n} w_{b} (i) \exp (- α y_{i} h (x_{i})) & = \sum_{i = 1}^{n} (I (y_{i} = h (x_{i})) + I (y_{i} \neq h (x_{i})) w_{b} (i) \exp (- α y_{i} h (x_{i})) \\ = \sum_{i = 1}^{n} I (y_{i} = h (x_{i})) w_{b} (i) \exp (- α y_{i} h (x_{i})) \\ + \sum_{i = 1}^{n} I (y_{i} \neq h (x_{i})) w_{b} (i) \exp (- α y_{i} h (x_{i})) \\ = \sum_{y_{i} = h (x_{i})} w_{b} (i) \exp (- α \cdot 1) + \sum_{y_{i} \neq h (x_{i})} w_{b} (i) \exp (- α \cdot - 1) \\ = e^{- α} \sum_{y_{i} = h (x_{i})} w_{b} (i) + e^{α} \sum_{y_{i} \neq h (x_{i})} w_{b} (i) \end{aligned}$
By expanding and rearranging Equation 6.5, we obtain $\begin{aligned} e^{α} \sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i})) + e^{- α} \sum_{i = 1}^{n} w_{b} (x_{i}) - w_{b} (i) I (y_{i} \neq h (x_{i})) \\ = e^{α} \sum_{y_{i} \neq h (x_{i})}^{n} w_{b} (i) + e^{- α} \sum_{I (y_{i} = h (x_{i}))}^{n} w_{b} (x_{i}) . \end{aligned}$
Using the results of sub tasks 1)-3), we can rewrite the minimization problem of Equation 6.1 as follows:

$\begin{matrix} (6.7) & \underset{α > 0, h \in H}{argmin} {(e^{α} - e^{- α}) \sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i})) + e^{- α} \sum_{i = 1}^{n} w_{b} (i)} \end{matrix}$

Equation 6.7 only contains one term that depends $h$ , i.e.

$(e^{α} - e^{- α}) \sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i})) .$

Therefore, any function $h$ that minimizes Equation 6.6 also minimizes Equation 6.7.
To minimize Equation 6.5 with respect to $α$ , we have to set the derivative of Equation 6.5 with respect to $α$ to 0. Define $t : = \sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i}))$ and $s : = \sum_{i = 1}^{n} w_{b} (i)$ . Then, $\frac{\partial}{\partial α} (e^{α} - e^{- α}) t + e^{- α} s) = t e^{α} - (s - t) e^{- α} .$ Now, $\begin{aligned} \frac{\partial}{\partial α} ((e^{α} - e^{- α}) t + e^{- α} s) = t e^{α} - (s - t) e^{- α} & \overset{!}{=} 0 \\ ⟺ t e^{α} = (s - t) e^{- α} \\ ⟺ e^{2 α} = \frac{(s - t)}{t} \\ ⟺ 2 α = \log (\frac{(s - t)}{t}) \\ ⟺ α = \frac{1}{2} \log (\frac{(s - t)}{t}) . \end{aligned}$ Defining ${err}_{b} : = \frac{\sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i}))}{\sum_{i = 1}^{n} w_{b} (i)}$ yields $\frac{1 - {err}_{b}}{{err}_{b}} = \frac{1}{{err}_{b}} - 1 = \frac{\sum_{i = 1}^{n} w_{b} (i)}{\sum_{i = 1}^{n} w_{b} (i) I (y_{i} \neq h (x_{i}))} - 1 = \frac{s}{t} - 1 = \frac{s - t}{t} .$ Finally, re-substituting $s$ and $t$ in $\frac{1}{2} \log (\frac{(s - t)}{t})$ yields $α = \frac{1}{2} \log (\frac{1 - {err}_{b}}{{err}_{b}}) .$

Solution 6.3 (Exercise 6.3).

set.seed(121)

cores <- parallel::detectCores()

rf_model <- rand_forest(
  mode = "classification",
  mtry = tune(),
  min_n = tune(),
  trees = 1000
) %>%
  set_engine("ranger",
             num.threads = cores
  )

ci_wf <- ci_wf %>% add_model(rf_model)

rf_tune_res <- ci_wf %>% 
    tune_grid(grid = 20,
              resamples = data_val_ci,
              metrics = multi_metrics
    )

i Creating pre-processing data to finalize unknown parameter: mtry

rf_best_parm <- rf_tune_res %>%
  select_best(metric = "accuracy")

last_rf_fit <- ci_wf %>%
  finalize_workflow(rf_best_parm) %>%
  last_fit(split)

Solution 6.4 (Exercise 6.4).

rf_roc <- last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(Attrition_Flag, .pred_Positive) %>% 
  mutate(model = "Random Forest")

rf_pr <- last_rf_fit %>% 
  collect_predictions() %>% 
  pr_curve(Attrition_Flag, .pred_Positive) %>% 
  mutate(model = "Random Forest")

Solution 6.5 (Exercise 6.5).

set.seed(121)

xgb_model <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  mtry = tune(),         
  loss_reduction = tune(),                     
  learn_rate = tune()                          
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

ci_wf <- ci_wf %>% update_model(xgb_model)

doParallel::registerDoParallel()

xgb_tune_res <- tune_grid(
  ci_wf,
  resamples = data_val_ci,
  grid = 20,
  metrics = multi_metrics
)

i Creating pre-processing data to finalize unknown parameter: mtry

xgb_best_parm <- xgb_tune_res %>% select_best(metric = "accuracy")

last_xgb_fit <- ci_wf %>%
  finalize_workflow(xgb_best_parm) %>%
  last_fit(split)

xgb_roc <- 
  last_xgb_fit %>% 
  collect_predictions() %>% 
  roc_curve(Attrition_Flag, .pred_Positive) %>% 
  mutate(model = "XGBoost")

xgb_pr <- 
  last_xgb_fit %>% 
  collect_predictions() %>% 
  pr_curve(Attrition_Flag, .pred_Positive) %>% 
  mutate(model = "XGBoost")

Solution 6.6 (Exercise 6.6).

set.seed(121)

lightgbm_model <- boost_tree(
  trees = 1000,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),                     
  mtry = tune(),         
  learn_rate = tune()                          
) %>%
  set_engine("lightgbm") %>%
  set_mode("classification")

ci_wf <- ci_wf %>% update_model(lightgbm_model)

lightgbm_res <- tune_grid(
  ci_wf,
  resamples = data_val_ci,
  grid = 20,
  metrics = multi_metrics
)

i Creating pre-processing data to finalize unknown parameter: mtry

lightgbm_res_best <- lightgbm_res %>% select_best(metric = "accuracy")

last_lightgbm_fit <- ci_wf %>%
  finalize_workflow(lightgbm_res_best) %>%
  last_fit(split)

lightgbm_roc <- last_lightgbm_fit %>% 
  collect_predictions() %>% 
  roc_curve(Attrition_Flag, .pred_Positive) %>% 
  mutate(model = "lightGBM")

lightgbm_pr <- last_lightgbm_fit %>% 
  collect_predictions() %>% 
  pr_curve(Attrition_Flag, .pred_Positive) %>% 
  mutate(model = "lightGBM")

Solution 6.7 (Exercise 6.7).

cols <- c("#80003A","#506432","#FFC500")
names(cols) <- c("lgbm", "rf", "xgb")
plot_title <- glue::glue("ROC- and PR-Curve for a <span style='color:{cols['rf']};'>Random Forest</span>,<br>
                         <span style='color:{cols['xgb']};'>XGBoost model</span>,
                         and <span style='color:{cols['lgbm']};'>LightGBM model</span>")
p1 <- bind_rows(rf_roc, xgb_roc, lightgbm_roc) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  geom_abline(lty = 3) + 
  coord_equal() + 
  scale_color_manual(values = unname(cols))+
  theme_minimal(base_size = 14)+
  theme(legend.position = "none")

p2 <- bind_rows(rf_pr, xgb_pr, lightgbm_pr) %>% 
  ggplot(aes(x = recall, y = precision, col = model)) + 
  geom_path(lwd = 1.5, alpha = 0.8) +
  coord_equal() + 
  scale_color_manual(values = unname(cols))+
  theme_minimal(base_size = 14)+
  theme(legend.position = "none")

(p1|p2) +
  plot_annotation(
  title = plot_title,
  theme = theme(plot.title = element_markdown()))

Solution 6.8 (Exercise 6.8).

cols <- c("#80003A","#506432","#FFC500")
names(cols) <- c("lgbm", "rf", "xgb")

title_tib <- tibble(
  x=0,
  y=1,
  label = glue::glue("<p><b>Confusion matrices for a <span style='color:{cols['rf']};'>random forest</span>, <br/> 
                         <span style='color:{cols['xgb']};'>XGBoost model</span>,
                         and <span style='color:{cols['lgbm']};'>LightGBM model</span>.</b></p>
                     <p> Looking at the number of <b>True Positives </b>(top left panel) <br/>
                      and <b>True Negatives</b> (bottom right panel), it becomes <br />
                      clear that the <span style='color:{cols['lgbm']};'>LightGBM model</span> performs best.<br />
                      Additionally, the <b>True Positive rate</b> (ratio of customers <br />
                      that have been correctly identified to truely leave the bank)<br />
                      is the highest, and the number of <b>False Positives</b> <br />
                      (top right panel) is the lowest for the <span style='color:{cols['lgbm']};'>LightGBM model</span>.</p>")
)

cm_plot <- function(last_fit_model,high){ 
  cm <- last_fit_model %>%
    collect_predictions() %>%
    conf_mat(Attrition_Flag, .pred_class)
  
  cm_tib <- as_tibble(cm$table)%>% mutate(
    Prediction = factor(Prediction),
    Truth = factor(Truth),
    Prediction = factor(Prediction, 
                        levels = rev(levels(Prediction)))
  )
  
  cm_tib %>% ggplot(aes(x = Prediction, y = Truth,fill = n)) +
    geom_tile( colour = "gray50")+
    geom_text(aes(label = n))+
    scale_fill_gradient(low = "white", high = high)+
    theme_minimal()+
    theme(legend.position = "none")
}

# Random Forest
cm1<- cm_plot(last_rf_fit,"#506432")

# XGBoost
cm2<- cm_plot(last_xgb_fit,"#FFC500")

# LightGBM
cm3 <- cm_plot(last_lightgbm_fit,"#80003A")

title_pane <- ggplot()+
  geom_richtext(
    data = title_tib,
    aes(x, y, label = label),
    hjust = 0, vjust = 1, 
    label.color = NA
  ) +
  xlim(0, 1) + ylim(0, 1)+
  theme_void()

cm1+cm2+cm3+title_pane+
plot_layout(ncol =2, widths = c(1,1.04))

Solution 6.9 (Exercise 6.9). According to the confusion matrices, ROC-, and PR-Curve the LightGBM model performs best.

Sensitivity: $\frac{TP}{P} = \frac{374}{374 + 33} = 0.9189189$
Precision: $\frac{TP}{PP} = \frac{374}{374 + 20} = 0.9492386$
Accuracy: $\frac{TP + TN}{P + N} = \frac{374 + 2105}{374 + 2105 + 20 + 33} = 0.9790679$

6.4 Session Info

sessionInfo()

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
[3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lightgbm_4.3.0     ranger_0.16.0      ggtext_0.1.2       patchwork_1.3.0   
 [5] bonsai_0.3.1       finetune_1.2.0     xgboost_1.7.7.1    workflowsets_1.1.0
 [9] workflows_1.1.4    tune_1.2.1         rsample_1.2.1      recipes_1.1.0     
[13] parsnip_1.2.1      modeldata_1.4.0    infer_1.0.7        dials_1.3.0       
[17] scales_1.3.0       broom_1.0.7        tidymodels_1.2.0   yardstick_1.3.1   
[21] lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4       
[25] purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1      
[29] ggplot2_3.5.1      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] doParallel_1.0.17   DiceDesign_1.10     tools_4.2.3        
 [4] backports_1.4.1     utf8_1.2.3          R6_2.5.1           
 [7] rpart_4.1.23        colorspace_2.1-0    nnet_7.3-19        
[10] withr_3.0.2         tidyselect_1.2.1    compiler_4.2.3     
[13] cli_3.6.2           xml2_1.3.6          labeling_0.4.3     
[16] commonmark_1.9.1    digest_0.6.35       rmarkdown_2.28     
[19] pkgconfig_2.0.3     htmltools_0.5.8.1   parallelly_1.37.1  
[22] lhs_1.1.6           fastmap_1.1.1       htmlwidgets_1.6.4  
[25] rlang_1.1.3         rstudioapi_0.17.1   farver_2.1.1       
[28] generics_0.1.3      jsonlite_1.8.8      magrittr_2.0.3     
[31] Matrix_1.6-0        Rcpp_1.0.12         munsell_0.5.1      
[34] fansi_1.0.4         GPfit_1.0-8         lifecycle_1.0.4    
[37] furrr_0.3.1         stringi_1.8.3       yaml_2.3.8         
[40] MASS_7.3-58.2       grid_4.2.3          parallel_4.2.3     
[43] listenv_0.9.1       lattice_0.22-6      splines_4.2.3      
[46] gridtext_0.1.5      hms_1.1.3           knitr_1.43         
[49] pillar_1.9.0        markdown_1.13       future.apply_1.11.2
[52] codetools_0.2-20    glue_1.6.2          evaluate_1.0.1     
[55] data.table_1.15.4   vctrs_0.6.5         tzdb_0.4.0         
[58] foreach_1.5.2       gtable_0.3.5        future_1.33.0      
[61] xfun_0.43           gower_1.0.1         prodlim_2023.08.28 
[64] class_7.3-22        survival_3.6-4      timeDate_4041.110  
[67] iterators_1.0.14    hardhat_1.4.0       lava_1.8.0         
[70] timechange_0.3.0    globals_0.16.3      ipred_0.9-14

6.1 Introduction

6.1.1 Confusion matrices in R

6.1.2 Tuning an XGBoost model

6.1.2.0.1 loss_reduction :

6.1.2.0.2 sample_size :

6.1.2.0.3 learn_rate :

6.2 Exercises

6.2.1 Theoretical exercises

6.2.2 Programming Exercises

6.3 Solutions

6.4 Session Info

6.1.2.0.1 `loss_reduction` :

6.1.2.0.2 `sample_size` :

6.1.2.0.3 `learn_rate` :