A note on reproduction of the contenct of this article

The r-markdown file of this article (and some other useful examples) can be found here: https://github.com/a-ghorbani/h2o_examples

You may also find a docker file in here, which contains all the instructions that are necessary to run this example, like installing h2o, r-studio and all the r packages.

You can pull the docker image from here, with running the following command:

> docker run -p 8787:8787 -p 54321:54321 -p 8080:8080 aghorbani/rstudio-h2o
Then you should be able to browse to http://localhost:8787, where you can get a web based r-studio. After login in using user guest and password guest, there should be a folder named h2o_examples-master. In this folder you can open Telco_Customer_Churn_H2O.Rmd, and … yoohoo run it.

Introduction

This is a data science example using H2O (h2o.ai) to investigate a classification problem, namely customer churn.

H2O is an open-source machine learning software for big-data. It is produced in 2011 by the start-up H2O.ai. The performance of H2O allows to perform various data science analyses possible through fitting hundreds of models, fast prediction of large amount of data and a lot other interesting thing that some of them we will be seen in this article.

The example I chose to investigate is classification of a telecom company customers into whether they would stay as customer or leave the company. Many companies are interested to know when, why, which customer are going to leave. Acquiring this information they can take appropriate actions accordingly.

I chose churn analysis here just as an example, any other (binary) classification problem can be done the same as well. A multi-class classification requires some adjustments.

I am going to cover the following analyses:

  • prediction of customer churn probability using gradient boosting machine (GBM),
  • parameter tuning using Bayesian optimization,
  • interpretation of the model:
    • specifying the co-variables that have high importance for the analysis,
    • specifying the most important co-variables that is responsible for churn probability for each individual customer,
    • constructing a single decision tree (for interpretation of the black box GBM model),
    • (partial) dependence of churn probabilities on each co-variables.

The purpose of this note is to show how the above mentioned analyses can be applied using H2O.

Required packages

The following packages are necessary for this analysis.

#=========================================
# TODO: remove unnecessary packages
#=========================================
library(rpart)  
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(partykit)
library(caret)
library(party)
library(rBayesianOptimization)
require(readr)
require(data.table)
require(reshape2)
library(pROC)
require(ggplot2)

require(h2o)

Init H2O (connect to a running H2O cluster)

We need to download H2O package, if we hadn’t installed it before. At the time of writing this text the latest version was 3.11.0.3717 and I downloaded from here. After downloading we need to unzip it.

There are different ways to start H2O cluster. I usually prefer to start it from command line. For more details on how to start H2O from command line see here.

> java -jar h2o.jar -nthreads 2 -ice_root ./tmp -port 54321 -name AwesomeCloud

Additionally you need to install H2O R API package in R, or if you would like to use H2O from Python you have to install Python API, or other languages you prefer to work with. H2O supports various APIs. Here I use R to connect to H2O cluster, therefore we need to install its R API.

install.packages("path_to_h2o_dir/h2o-3.11.0.3717/R/h2o_3.11.0.3717.tar.gz", repos=NULL, type = "source")

Finally use h2o.init() to connect the running H2O cluster (in our case with only one node).

# Connect to h2o cluster
h2o.init(port = 54321)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 days 21 hours 
##     H2O cluster version:        3.11.0.3719 
##     H2O cluster version age:    3 days  
##     H2O cluster name:           H2O_started_from_R_valar_gde774 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.37 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.2.3 (2015-12-10)
# Remove all objects in the h2o cluster.
h2o.removeAll()
## [1] 0

Load the data into h2o

In the example here, we use the data from http://www.dataminingconsultant.com/data/churn.txt. The description of the columns can be found here.

data_frame <- 
  h2o.importFile(
    path              = "http://www.dataminingconsultant.com/data/churn.txt",
    sep               = ",", 
    destination_frame = "data_frame")

Let’s have a look at the first few rows of the data we just loaded:

State Account Length Area Code Phone Int’l Plan VMail Plan VMail Message Day Mins Day Calls Day Charge Eve Mins Eve Calls Eve Charge Night Mins Night Calls Night Charge Intl Mins Intl Calls Intl Charge CustServ Calls Churn?
KS 128 415 382-4657 no yes 25 265.1 110 45.07 197.4 99 16.78 244.7 91 11.01 10.0 3 2.70 1 False.
OH 107 415 371-7191 no yes 26 161.6 123 27.47 195.5 103 16.62 254.4 103 11.45 13.7 3 3.70 1 False.
NJ 137 415 358-1921 no no 0 243.4 114 41.38 121.2 110 10.30 162.6 104 7.32 12.2 5 3.29 0 False.
OH 84 408 375-9999 yes no 0 299.4 71 50.90 61.9 88 5.26 196.9 89 8.86 6.6 7 1.78 2 False.
OK 75 415 330-6626 yes no 0 166.7 113 28.34 148.3 122 12.61 186.9 121 8.41 10.1 3 2.73 3 False.

We can use h2o.describe to obtain some basic statistics about the variables, like mean, min, max, number of missing values, etc.

h2o.describe(data_frame)
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
State enum 0 52 0 0 0 50 NA NA 51
Account Length int 0 0 0 0 1 243 101.064806480648 39.8221059285957 NA
Area Code int 0 0 0 0 408 510 437.182418241824 42.3712904856066 NA
Phone string 0 0 0 0 NaN NaN NA NA NA
Int’l Plan enum 0 3010 0 0 0 1 0.0969096909690969 0.295879145484415 2
VMail Plan enum 0 2411 0 0 0 1 0.276627662766277 0.447397870380064 2
VMail Message int 0 2411 0 0 0 51 8.09900990099009 13.6883653720386 NA
Day Mins real 0 2 0 0 0 350.8 179.775097509751 54.4673892023715 NA
Day Calls int 0 2 0 0 0 165 100.435643564356 20.0690842073009 NA
Day Charge real 0 2 0 0 0 59.64 30.5623072307231 9.2594345539305 NA
Eve Mins real 0 1 0 0 0 363.7 200.980348034804 50.713844425812 NA
Eve Calls int 0 1 0 0 0 170 100.114311431143 19.9226252939431 NA
Eve Charge real 0 1 0 0 0 30.91 17.0835403540353 4.31066764311035 NA
Night Mins real 0 0 0 0 23.2 395 200.87203720372 50.5738470136584 NA
Night Calls int 0 0 0 0 33 175 100.107710771077 19.5686093460585 NA
Night Charge real 0 0 0 0 1.04 17.77 9.03932493249326 2.27587283766003 NA
Intl Mins real 0 18 0 0 0 20 10.2372937293729 2.79183954840842 NA
Intl Calls int 0 18 0 0 0 20 4.47944794479448 2.4612142705461 NA
Intl Charge real 0 18 0 0 0 5.4 2.76458145814581 0.753772612663045 NA
CustServ Calls int 0 697 0 0 0 9 1.56285628562856 1.31549104486648 NA
Churn? enum 0 2850 0 0 0 1 0.144914491449145 0.352067423624126 2

The column Churn? specifies whether the customer has left the plan or not. For exact meaning of other columns see here.

In order to build and assess the model we are going to split the data into training, validation and testing data set. I assume that the analysis here is applied to a large data set. It is not true in this example, though, since we have only 3333 number of rows, which is not that big. In such a case it is better to split the data set into only training and testing. And for model selection and hyper-parameter tuning using the cross-validation technique.

As I assume the example here will be applied to a very large data set, otherwise why somebody wants to use H2O ;), the data set is split into training, validation and testing with ratio of %50, %25 and %25, respectively.

split_df  <- 
  h2o.splitFrame(
    data               = data_frame, 
    ratios             = c(0.5,0.25), 
    destination_frames = c("train_frame", "valid_frame", "test_frame"), 
    seed               = 2016)

train_frame <- split_df[[1]]
valid_frame <- split_df[[2]]
test_frame  <- split_df[[3]]

Building model

First the target variable, which is Churn? and the co-variables (the rest of the other variables) are defined.

y <- "Churn?"
x <- setdiff(names(data_frame),  c(y))

Building model using GBM

Let’s first just build a model using the gradient boosting machine (GBM) with arbitrarily chosen hyper-parameter. GBM is a machine learning algorithm for regression and classification problems. It builds a model in the form of an ensemble of (weak) prediction models like decision trees. For more details refer to this, this and this.

gbm.1 <- h2o.gbm(  
  x                   = x,
  y                   = y,
  training_frame      = train_frame,
  validation_frame    = valid_frame,
  ntrees              = 100,
  max_depth           = 3,
  learn_rate          = 0.2)

In the above example we used train_frame to train the model and valid_frame to evaluate the performance. We can see the performance in tho following:

 h2o.performance(gbm.1, valid = T)
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.05118193
## RMSE:  0.2262342
## LogLoss:  0.2170323
## Mean Per-Class Error:  0.126074
## AUC:  0.8941919
## Gini:  0.7883838
## 
## Confusion Matrix for F1-optimal threshold:
##        False. True.    Error     Rate
## False.    689    24 0.033661  =24/713
## True.      26    93 0.218487  =26/119
## Totals    715   117 0.060096  =50/832
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.226842 0.788136 113
## 2                       max f2  0.156576 0.789902 132
## 3                 max f0point5  0.530581 0.843373  70
## 4                 max accuracy  0.226842 0.939904 113
## 5                max precision  0.999967 1.000000   0
## 6                   max recall  0.000327 1.000000 399
## 7              max specificity  0.999967 1.000000   0
## 8             max absolute_mcc  0.226842 0.753161 113
## 9   max min_per_class_accuracy  0.068766 0.857143 178
## 10 max mean_per_class_accuracy  0.156576 0.878811 132
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Each algorithm has number of hyper-parameters to be set, like in case of GBM number of trees or learning rate, etc. As always one of the important question is which values to choose for the hyper-parameters. There are different ways to find the optimal values for hyper-parameters, like grid search, random search or other general optimization algorithm.

We can use Bayesian optimization to find out an approximation of the optimized hyper-parameters.

Bayesian Optimization

H2O supports grid search and random search. Grid search is kind of brute force (it increases exponentially with the number of parameters) and random search as its name suggests searches the hyper-parameter domain randomly.

Bayesian optimization (BO) is used for automatic tuning of hyper-parameters. In this approach the information obtained from previous function evaluations (i.e. building and evaluating models using various hyper-parameters) is used to approximate best next hyper-parameters to be tried out. BO builds a probabilistic model for the objective function, which here is the performance of the model. For more information have a look here and here.

I wished H2O guys have implemented Bayesian optimization. But no worries. There are bunch of Bayesian optimization algorithms one can use, in R, Python, C++, etc. On the other hand the heavy lifting part, which is function evaluation (i.e. building the models), will be carried out by H2O.

On the other hand there are some discussions whether BO worth the effort, e.g. see here. Any ways I think at least it worth mentioning how we can combine BO with H2O.

In the following it shall be shown how we can use rBaysianOptimisation package in R for finding the optimal hyper-parameters.

The function h2o_bayes is defined as a wrapper for h2o.gbm function, which builds and evaluates a GBM model. All the parameters that needs to be optimized are defined as an argument for this function. In order to score each model validation data set is used. However, for small data set (as in this example) usually it is better to use cross-validation (instead of validation data set) to decrease variation in the model score.

#============================
# Define the wrapper function
#============================
h2o_bayes <- function(
  max_depth, learn_rate, sample_rate, 
  col_sample_rate, balance_classes){
  bal.cl <- as.logical(balance_classes)
  gbm <- h2o.gbm(  
    x                   = x,
    y                   = y,
    training_frame      = train_frame,
    validation_frame    = valid_frame,
    #nfolds              = 3,
    ntrees              = 900,
    max_depth           = max_depth,
    learn_rate          = learn_rate,
    sample_rate         = sample_rate,
    col_sample_rate     = col_sample_rate,
    score_tree_interval = 5,
    stopping_rounds     = 2,
    stopping_metric     = "logloss",
    stopping_tolerance  = 0.005,
    balance_classes     = bal.cl)
    
  score <- h2o.auc(gbm, valid = T)
  list(Score = score,
       Pred  = 0)
}

#============================
# Find optimal values for the 
# parameters in the given range. 
#============================
OPT_Res <- BayesianOptimization(
  h2o_bayes,
  bounds = list(
    max_depth   = c(2L, 8L), 
    learn_rate  = c(1e-4, 0.2),
    sample_rate = c(0.4, 1), 
    col_sample_rate = c(0.4, 1), 
    balance_classes = c(0L, 1L)),
  init_points = 10,  n_iter = 10,
  acq = "ucb", kappa = 2.576, eps = 0.0,
  verbose = FALSE)
## 
##  Best Parameters Found: 
## Round = 10   max_depth = 6.0000  learn_rate = 0.1223 sample_rate = 0.7886    col_sample_rate = 0.4054    balance_classes = 0.0000    Value = 0.9001

Building the model using optimal parameters

After searching for the best hyper-parameters using Bayesian optimization algorithm, we use the values that were found above to train a GBM model.

gbm <- h2o.gbm(
  x                   = x,
  y                   = y,
  training_frame      = train_frame,
  validation_frame    = valid_frame,
  ntrees              = 900,
  max_depth           = OPT_Res$Best_Par["max_depth"],
  learn_rate          = OPT_Res$Best_Par["learn_rate"],
  sample_rate         = OPT_Res$Best_Par["sample_rate"],
  col_sample_rate     = OPT_Res$Best_Par["col_sample_rate"],
  balance_classes     = as.logical(OPT_Res$Best_Par["balance_classes"]),
  score_tree_interval = 5,
  stopping_rounds     = 2,
  stopping_metric     = "logloss",
  stopping_tolerance  = 0.005,
  model_id         = "my_awesome_GBM")

Train a model using only important co-variables

In the previous section we used all the columns to train a GBM model. Many times removing irrelevant data from the training data will improve the results, for various reason. Even if it doesn’t improve the result, it will reduce the amount of resources needed for calculation, e.g. CPU time and memory.

Now we remove the variables that are not important. However, note that the variable importance is obtained from previous calculation.

var.imp <- h2o.varimp(gbm)[h2o.varimp(gbm)$scaled_importance > 0.05, "variable"]
# The value of 0.05 is arbitrary, you might want to use other values.

setdiff(x, var.imp)
## [1] "Account Length" "Area Code"      "Phone"          "Day Calls"     
## [5] "Eve Calls"      "Night Calls"
gbm_varImp <- h2o.gbm(
  x                   = var.imp,
  y                   = y,
  training_frame      = train_frame,
  validation_frame    = valid_frame,
  ntrees              = 900,
  max_depth           = OPT_Res$Best_Par["max_depth"],
  learn_rate          = OPT_Res$Best_Par["learn_rate"],
  sample_rate         = OPT_Res$Best_Par["sample_rate"],
  col_sample_rate     = OPT_Res$Best_Par["col_sample_rate"],
  balance_classes     = as.logical(OPT_Res$Best_Par["balance_classes"]),
  score_tree_interval = 5,
  stopping_rounds     = 2,
  stopping_metric     = "logloss",
  stopping_tolerance  = 0.005,
  model_id         = "my_awesome_GBM_varImp")

Compare the two model

The two models are compared in the plots below, in terms of rmse and auc. It is can be seen that if we use only the important variables the performance of the model is not reduced significantly, if any.

shist <- gbm@model$scoring_history[, c("duration", "validation_rmse", "validation_auc")]
shist$algorithm <- "GBM" 
scoring_history <- shist

shist <- gbm_varImp@model$scoring_history[, c("duration", "validation_rmse", "validation_auc")]
shist$algorithm <- "GBM with var.imp." 
scoring_history <- rbind(scoring_history,shist)

scoring_history$duration <- as.numeric(
  gsub("sec", "", scoring_history$duration))

scoring_history <- melt(scoring_history, id = c("duration", "algorithm"))

ggplot(data = scoring_history, 
       aes(x     = duration, 
           y     = value, 
           color = algorithm,
           group = algorithm)) + 
  geom_line() + geom_point() +
  facet_grid(. ~ variable, scales = "free",shrink = TRUE,space = "free")

Choose the best model based on AUC measure (you could choose mse, logloss or other measures which you might find more appropriate in your case).

AUC_gbm        <- h2o.performance(gbm, valid = T)@metrics$AUC
AUC_gbm_varImp <- h2o.performance(gbm_varImp, valid = T)@metrics$AUC
if(AUC_gbm > AUC_gbm_varImp){
  bestModel <- gbm
}else{
  bestModel <- gbm_varImp
}
cat("The best model is '", bestModel@model_id, 
    "' with AUC of ", max(AUC_gbm, AUC_gbm_varImp), 
    " vs ",  min(AUC_gbm, AUC_gbm_varImp), "\n" )
## The best model is ' my_awesome_GBM_varImp ' with AUC of  0.8956946  vs  0.8955178

Model Assessment

Now we use the test data to measure the performance of the trained model:

bestPerf <- h2o.performance(bestModel, test_frame)

perfDF <- melt(as.data.frame(bestPerf@metrics$thresholds_and_metric_scores), 
           id = "threshold")

Performance of the model on the test data:

as.data.frame(
    bestPerf@metrics[c("MSE", "RMSE", "AUC", "r2", "logloss", "Gini", "mean_per_class_error")]
    )
MSE RMSE AUC r2 logloss Gini mean_per_class_error
0.066 0.257 0.902 0.485 0.242 0.804 0.172

As you know the prediction of the model is a probability of a customer churn. In order to have a visual feeling about performance of the model, distribution of the prediction is plotted against the true outcome.

Lets have a look at the prediction distribution, for each class:

The horizontal axis shows the true values, i.e. the values of Churn? column. The vertical axis shows the model prediction probability for churn. In principle we can define a criteria for probability which a customer with a higher value can be identified as churned customer and with lower value as loyal customer. This criteria could be a value like 0.5, or a value that has best accuracy which is 0.3259223 in this case. We would like to have all the data points be either in the top right (True Positive (TP)) or bottom left (True Negative (TN)). The data points in top left and bottom right are misclassified.

I have seen this kind of plot for the first time here.

So, which customer is going to leave? As mentioned above if the prediction value is more than 0.5 or based on other criteria. Here are some of the criteria and the maximum value we can get depending on which criteria we choose:

kable(bestPerf@metrics$max_criteria_and_metric_scores, digits = 3)
metric threshold value idx
max f1 0.237 0.725 111
max f2 0.122 0.767 147
max f0point5 0.459 0.785 74
max accuracy 0.326 0.921 96
max precision 0.993 1.000 0
max recall 0.003 1.000 394
max specificity 0.993 1.000 0
max absolute_mcc 0.237 0.679 111
max min_per_class_accuracy 0.061 0.844 202
max mean_per_class_accuracy 0.122 0.863 147

and here are the plots of few of them against various criteria.

scores_to_plot <- c("accuracy", "precision", "recall", "min_per_class_accuracy")

ggplot(data = perfDF[perfDF$variable %in% scores_to_plot, ],  
       aes(x     = threshold, 
           y     = value, 
           color = variable,
           group = variable)) + 
  geom_line() + geom_point() 

#kable(bestPerf@metrics, digits = 3)

If we know the actual costs (in the business) of false positives and false negatives (or benefits of true positives) we can have better decision on which criteria to choose.

Interpretation of the Model

So far so good, we built the model and have evaluated it. The algorithm that we used above was GBM, which is a black box model. That means it only gives us prediction on which customers are going to leave, i.e. the probabilities. The interpretation of the model, like whys, is not that straightforward.

However, we would like to have some more insights and be able to answer questions like what are the reasons that customers are leaving or a specific customer might to leave.

Variable Importance

In many situations not all the co-variables are equally important. Some variable are more important than the others. Using variable importance we can focus more on important variables.

h2o.varimp_plot(bestModel)

Variable Importance per Customer

The above mentioned variables were measured based on the whole training data set and is only valid on average. However the importance of each variable might be different for each individual customer.

We can use the trained model to approximate the variable importance based on each individual customer.

In this approach a prediction is done on the data set. And then we leave a column out and will do the prediction again and measure the difference in probabilities. The difference in probabilities can indicate the importance of that variable for each customer.

This have a little bit of calculation in R side, like calculating differences and ranking each row based on these differences. Hence, in order to speed up things I have implemented a C++ code (get_high_rank_values.cpp) which also needed to be sourced. Using Rcpp package one can simply use C++ code in R script.

require(Rcpp)
Rcpp::sourceCpp("get_high_rank_values.cpp")
source("rowVarImp.R")

df <- rowVarImp.h2o(model=bestModel, cols=var.imp, data_frame=test_frame, n=3)

Here only first 3 important variables for each customer are calculated. The first 3 columns show the differences in the probabilities by not including the variable. The next 3 columns show their values and the last 3 columns the variables name.

Knowing these information the business unit can take measures based on each individual customer (not averaged on all customer like the general variable importance).

head(df)
colName_VI1 value_VI1 imp_VI1 colName_VI2 value_VI2 imp_VI2 colName_VI3 value_VI3 imp_VI3
Intl Mins 12.7 -0.095 Eve Charge 19.42 -0.059 Intl Charge 3.43 -0.037
Day Mins 187.7 -0.010 Intl Charge 2.46 -0.009 Eve Mins 163.4 -0.002
State IA -0.375 Intl Mins 13.1 -0.112 Eve Charge 26.11 -0.091
Intl Mins 10.6 -0.008 CustServ Calls 0 -0.004 Day Charge 26.37 -0.002
State CO -0.120 Intl Calls 6 -0.017 Intl Charge 1.54 -0.012
Day Mins 124.3 -0.053 Eve Charge 23.55 -0.052 CustServ Calls 3 -0.044

Partial Dependence Plots

In the previous section we talked about importance of each variable, but we didn’t answer the question of how changing variables would change the outcome.
However, it is very important to know if we change value of a variable what will be the impact on the outcome.

This kind of question can be answered by the so called partial dependence plots. Partial dependence plots depicts, indeed, the marginal effect of the variables. In the following the partial dependence plots are shown for the 5 top important variables.

In order to calculate partial dependence we can use partialDependencePlot.R. However, since at least version 3.11.0.3717 h2o has h2o.partialPlot function to calculate one dimensional partial dependence. I like my own implementation better because of its functionality, however h2o.partialPlot is much faster.

cols <- var.imp[1:5]
pdps <- h2o.partialPlot(bestModel, data_frame, cols, nbins = 100, plot = F)
mean.orig <- mean(h2o.predict(bestModel, data_frame)[,3])

lapply(1:length(cols), function(x){
  x_ <- pdps[[x]] 
  colnames(x_) <- c("variable", "diff")
  x_$diff <- x_$diff - mean.orig
  if(class(x_$variable) == 'factor' || class(x_$variable) == 'character'){
    # order based on diff
    x_ <- within(x_, variable <- factor(variable, levels=variable[order(-diff)]))
    p <- ggplot(data = x_, aes(x = variable, y = diff)) +
      geom_bar(stat="identity" )
  }else{
    p <- ggplot(data = x_, aes(x = variable, y = diff)) +
      geom_line()
  }
  p + ggtitle(cols[x])
})

The above plots show only impact of single variable if others are kept constant. But in many cases, specially if there is interaction between variables, we are interested to know the impact on changing two or more variables. For this also we can use partialDependencePlot.R. As an example, in the following we can see the impact of “Day Charge” for each state (the colored thin lines), and also on average (the thick black line). It can be seen that there are quite variances among different states.

source("partialDependencePlot.R")
cols <- c("State", "Day Charge")
pdp1 <- partialDependencePlot(bestModel, cols, data_frame)

pdp2 <- partialDependencePlot(bestModel, "Day Charge", data_frame)

pdp1 <- melt(pdp1)
pdp2 <- melt(pdp2)
pdp2$State <- "AVG"

colnames(pdp1) <- c("State", "Day.Charge", "diff")
colnames(pdp2) <- c("Day.Charge", "diff", "State")
ggplot(data = pdp1, 
       aes(x     = Day.Charge, 
           y     = diff,   
           color = State,
           group = State)) + 
  geom_line() +
  geom_line(data = pdp2, size=3, color='black')

A Simplified Single Decision Tree

Alright, so far we did quite a lot. We found important variables, impact of changing each variables on the outcome. However, we would like to have the big picture. A high level insight about the model.

For that, we can build a single decision tree. But we already know that if we use single decision tree to build a model it wouldn’t be that accurate.

In order to improve the accuracy of the single decision tree a little bit, we use the decision tree to approximate the GBM model (instead of training it on the original data). We train a decision tree with that data but the target variable is replaced by the predictions of GBM. The premise behind this, according to what a friend of me Armin Monecke told once, is that using predictions of GBM (or other complex models) kind of removes the noise from the data, and makes it easier for less complex model (like single decision trees) to train.

Additionally, here, we use only rows that are predicted with high confidence, i.e. probabilities close to 0 and 1.

Note: since we sample the data points, we need to take care of prior distribution (e.g. sampling with the same distribution of target column or specifying it for example using a prior parameter in the algorithm).

denoise.h2o.df <- function(model, data_frame, destination_frame, lower.q=0.2, higher.q=0.8){
  # Removes the data points that cannot be classified significantly.
  # That means the data points for which the prediction is far away from 0 and 1, 
  # like those close to 0.5
  #
  # Args:
  #   model: A H2O model.
  #   data_frame: A H2O data frame.
  #   destination_frame: ID for the result frame.
  #   lower.q: The lower quantile, from which the data points will be ignored.
  #   higher.q: The higher quantile, to which the data points will be ignored.
  #
  # Returns:
  #   The resulting H2O data frame.
  
  pred   <- h2o.predict(model, data_frame)
  predCol <- names(pred)[3]
  quant  <- h2o.quantile(pred[,predCol], probs = c(lower.q, higher.q))
  lower  <- min(quant)
  higher <- max(quant)
  data_pred <- h2o.cbind(data_frame, pred)
  data_denoised <- data_pred[
    data_pred[,predCol] < lower | 
      data_pred[,predCol] > higher,]
  data_denoised <- h2o.assign(data_denoised, destination_frame) 
  return(data_denoised)  
}

denoised_train_df <- 
  denoise.h2o.df(
    model             = bestModel, 
    data_frame        = train_frame, 
    destination_frame = "denoised_train_frame", 
    lower.q           = 0.1,
    higher.q          = 0.9)

denoised_test_df <- 
  denoise.h2o.df(
    model             = bestModel, 
    data_frame        = test_frame, 
    destination_frame = "denoised_test_frame", 
    lower.q           = 0.1,
    higher.q          = 0.9)
train_df <- as.data.frame(denoised_train_df)
test_df  <- as.data.frame(test_frame)
test_df1 <- as.data.frame(denoised_test_df)

# Ignore the columns that is not in used in training 
tmp.df <- train_df[, -which(names(train_df)  %in% c("Churn.","False.","True.", "Phone" ))]
colnames(tmp.df)[colnames(tmp.df) == "predict"] <- "Churn."

counts <- as.data.frame(h2o.table(data_frame[,"Churn?"]))

priorDist <- counts$Count[1] / sum(counts$Count)
priorDist <- c(priorDist, 1 - priorDist)
               
tree <- rpart(
    formula  = Churn. ~ ., 
    data     = tmp.df,
    maxdepth = 5, 
    cp       = 0.02,
    parms    = list(prior = priorDist, 
                    split = "information"))

fancyRpartPlot(tree)

id <- which(!(test_df$State %in% levels(train_df$State)))
test_df$State[id] <- NA

Pred1 <- predict(tree, test_df)
test_df$Prediction1 <- Pred1[,"True."]
ROC1 <- roc(test_df$Churn., test_df$Prediction1)
ROC1$auc
## Area under the curve: 0.6869

Conclusion

The purpose of this article was to show how we can go through a data science task of a classification problem using H2O, one of the most powerful open source library for big data. Its speed and performance allows to accomplish analytics tasks for huge amount of data, like:

  • building many models,
  • parameter tuning,
  • variable importance at customer level,
  • (low dimensional) partial dependence analysis.

For example note that for a two dimensional partial dependence analysis with only 20 points for each dimension we need to perform prediction over all data set 400 times (i.e. 20*20).

Acknowledgement

I would like to thank Jan Simen for his comments that improved the content of this post.

Disclaimer

I am not affiliated, associated, authorized, endorsed by, or in any way officially connected with h2o.ai. I just enjoy using this product.



Published

24 November 2016

Tags