Introduction
Required packages
Init H2O (connect to a running H2O cluster)
Load the data into h2o
Building model
Model Assessment
Interpretation of the Model
Conclusion
Acknowledgement
Disclaimer

A note on reproduction of the contenct of this article

The r-markdown file of this article (and some other useful examples) can be found here: https://github.com/a-ghorbani/h2o_examples

You may also find a docker file in here, which contains all the instructions that are necessary to run this example, like installing h2o, r-studio and all the r packages.

You can pull the docker image from here, with running the following command:

> docker run -p 8787:8787 -p 54321:54321 -p 8080:8080 aghorbani/rstudio-h2o

Then you should be able to browse to http://localhost:8787, where you can get a web based r-studio. After login in using user guest and password guest, there should be a folder named h2o_examples-master. In this folder you can open Telco_Customer_Churn_H2O.Rmd, and … yoohoo run it.

Introduction

This is a data science example using H2O (h2o.ai) to investigate a classification problem, namely customer churn.

H2O is an open-source machine learning software for big-data. It is produced in 2011 by the start-up H2O.ai. The performance of H2O allows to perform various data science analyses possible through fitting hundreds of models, fast prediction of large amount of data and a lot other interesting thing that some of them we will be seen in this article.

The example I chose to investigate is classification of a telecom company customers into whether they would stay as customer or leave the company. Many companies are interested to know when, why, which customer are going to leave. Acquiring this information they can take appropriate actions accordingly.

I chose churn analysis here just as an example, any other (binary) classification problem can be done the same as well. A multi-class classification requires some adjustments.

I am going to cover the following analyses:

prediction of customer churn probability using gradient boosting machine (GBM),
parameter tuning using Bayesian optimization,
interpretation of the model:
- specifying the co-variables that have high importance for the analysis,
- specifying the most important co-variables that is responsible for churn probability for each individual customer,
- constructing a single decision tree (for interpretation of the black box GBM model),
- (partial) dependence of churn probabilities on each co-variables.

The purpose of this note is to show how the above mentioned analyses can be applied using H2O.

Required packages

The following packages are necessary for this analysis.

#=========================================
# TODO: remove unnecessary packages
#=========================================
library(rpart)  
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(partykit)
library(caret)
library(party)
library(rBayesianOptimization)
require(readr)
require(data.table)
require(reshape2)
library(pROC)
require(ggplot2)

require(h2o)

Init H2O (connect to a running H2O cluster)

We need to download H2O package, if we hadn’t installed it before. At the time of writing this text the latest version was 3.11.0.3717 and I downloaded from here. After downloading we need to unzip it.

There are different ways to start H2O cluster. I usually prefer to start it from command line. For more details on how to start H2O from command line see here.

> java -jar h2o.jar -nthreads 2 -ice_root ./tmp -port 54321 -name AwesomeCloud

Additionally you need to install H2O R API package in R, or if you would like to use H2O from Python you have to install Python API, or other languages you prefer to work with. H2O supports various APIs. Here I use R to connect to H2O cluster, therefore we need to install its R API.

install.packages("path_to_h2o_dir/h2o-3.11.0.3717/R/h2o_3.11.0.3717.tar.gz", repos=NULL, type = "source")

Finally use h2o.init() to connect the running H2O cluster (in our case with only one node).

# Connect to h2o cluster
h2o.init(port = 54321)

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 days 21 hours 
##     H2O cluster version:        3.11.0.3719 
##     H2O cluster version age:    3 days  
##     H2O cluster name:           H2O_started_from_R_valar_gde774 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.37 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     R Version:                  R version 3.2.3 (2015-12-10)

# Remove all objects in the h2o cluster.
h2o.removeAll()

## [1] 0

Load the data into h2o

In the example here, we use the data from http://www.dataminingconsultant.com/data/churn.txt. The description of the columns can be found here.

data_frame <- 
  h2o.importFile(
    path              = "http://www.dataminingconsultant.com/data/churn.txt",
    sep               = ",", 
    destination_frame = "data_frame")

Let’s have a look at the first few rows of the data we just loaded:

State	Account Length	Area Code	Phone	Int’l Plan	VMail Plan	VMail Message	Day Mins	Day Calls	Day Charge	Eve Mins	Eve Calls	Eve Charge	Night Mins	Night Calls	Night Charge	Intl Mins	Intl Calls	Intl Charge	CustServ Calls	Churn?
KS	128	415	382-4657	no	yes	25	265.1	110	45.07	197.4	99	16.78	244.7	91	11.01	10.0	3	2.70	1	False.
OH	107	415	371-7191	no	yes	26	161.6	123	27.47	195.5	103	16.62	254.4	103	11.45	13.7	3	3.70	1	False.
NJ	137	415	358-1921	no	no	0	243.4	114	41.38	121.2	110	10.30	162.6	104	7.32	12.2	5	3.29	0	False.
OH	84	408	375-9999	yes	no	0	299.4	71	50.90	61.9	88	5.26	196.9	89	8.86	6.6	7	1.78	2	False.
OK	75	415	330-6626	yes	no	0	166.7	113	28.34	148.3	122	12.61	186.9	121	8.41	10.1	3	2.73	3	False.

We can use h2o.describe to obtain some basic statistics about the variables, like mean, min, max, number of missing values, etc.

h2o.describe(data_frame)

Label	Type	Zeros	Min	Max	Mean	Sigma	Cardinality
State	enum	52	0	50	NA	NA	51
Account Length	int	0	1	243	101.064806480648	39.8221059285957	NA
Area Code	int	0	408	510	437.182418241824	42.3712904856066	NA
Phone	string	0	NaN	NaN	NA	NA	NA
Int’l Plan	enum	3010	0	1	0.0969096909690969	0.295879145484415	2
VMail Plan	enum	2411	0	1	0.276627662766277	0.447397870380064	2
VMail Message	int	2411	0	51	8.09900990099009	13.6883653720386	NA
Day Mins	real	2	0	350.8	179.775097509751	54.4673892023715	NA
Day Calls	int	2	0	165	100.435643564356	20.0690842073009	NA
Day Charge	real	2	0	59.64	30.5623072307231	9.2594345539305	NA
Eve Mins	real	1	0	363.7	200.980348034804	50.713844425812	NA
Eve Calls	int	1	0	170	100.114311431143	19.9226252939431	NA
Eve Charge	real	1	0	30.91	17.0835403540353	4.31066764311035	NA
Night Mins	real	0	23.2	395	200.87203720372	50.5738470136584	NA
Night Calls	int	0	33	175	100.107710771077	19.5686093460585	NA
Night Charge	real	0	1.04	17.77	9.03932493249326	2.27587283766003	NA
Intl Mins	real	18	0	20	10.2372937293729	2.79183954840842	NA
Intl Calls	int	18	0	20	4.47944794479448	2.4612142705461	NA
Intl Charge	real	18	0	5.4	2.76458145814581	0.753772612663045	NA
CustServ Calls	int	697	0	9	1.56285628562856	1.31549104486648	NA
Churn?	enum	2850	0	1	0.144914491449145	0.352067423624126	2

The column Churn? specifies whether the customer has left the plan or not. For exact meaning of other columns see here.

In order to build and assess the model we are going to split the data into training, validation and testing data set. I assume that the analysis here is applied to a large data set. It is not true in this example, though, since we have only 3333 number of rows, which is not that big. In such a case it is better to split the data set into only training and testing. And for model selection and hyper-parameter tuning using the cross-validation technique.

As I assume the example here will be applied to a very large data set, otherwise why somebody wants to use H2O ;), the data set is split into training, validation and testing with ratio of %50, %25 and %25, respectively.

split_df  <- 
  h2o.splitFrame(
    data               = data_frame, 
    ratios             = c(0.5,0.25), 
    destination_frames = c("train_frame", "valid_frame", "test_frame"), 
    seed               = 2016)

train_frame <- split_df[[1]]
valid_frame <- split_df[[2]]
test_frame  <- split_df[[3]]

Building model

First the target variable, which is Churn? and the co-variables (the rest of the other variables) are defined.

y <- "Churn?"
x <- setdiff(names(data_frame),  c(y))

Building model using GBM

Let’s first just build a model using the gradient boosting machine (GBM) with arbitrarily chosen hyper-parameter. GBM is a machine learning algorithm for regression and classification problems. It builds a model in the form of an ensemble of (weak) prediction models like decision trees. For more details refer to this, this and this.

gbm.1 <- h2o.gbm(  
  x                   = x,
  y                   = y,
  training_frame      = train_frame,
  validation_frame    = valid_frame,
  ntrees              = 100,
  max_depth           = 3,
  learn_rate          = 0.2)

In the above example we used train_frame to train the model and valid_frame to evaluate the performance. We can see the performance in tho following:

 h2o.performance(gbm.1, valid = T)

## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.05118193
## RMSE:  0.2262342
## LogLoss:  0.2170323
## Mean Per-Class Error:  0.126074
## AUC:  0.8941919
## Gini:  0.7883838
## 
## Confusion Matrix for F1-optimal threshold:
##        False. True.    Error     Rate
## False.    689    24 0.033661  =24/713
## True.      26    93 0.218487  =26/119
## Totals    715   117 0.060096  =50/832
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.226842 0.788136 113
## 2                       max f2  0.156576 0.789902 132
## 3                 max f0point5  0.530581 0.843373  70
## 4                 max accuracy  0.226842 0.939904 113
## 5                max precision  0.999967 1.000000   0
## 6                   max recall  0.000327 1.000000 399
## 7              max specificity  0.999967 1.000000   0
## 8             max absolute_mcc  0.226842 0.753161 113
## 9   max min_per_class_accuracy  0.068766 0.857143 178
## 10 max mean_per_class_accuracy  0.156576 0.878811 132
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Each algorithm has number of hyper-parameters to be set, like in case of GBM number of trees or learning rate, etc. As always one of the important question is which values to choose for the hyper-parameters. There are different ways to find the optimal values for hyper-parameters, like grid search, random search or other general optimization algorithm.

We can use Bayesian optimization to find out an approximation of the optimized hyper-parameters.

Bayesian Optimization

H2O supports grid search and random search. Grid search is kind of brute force (it increases exponentially with the number of parameters) and random search as its name suggests searches the hyper-parameter domain randomly.

Bayesian optimization (BO) is used for automatic tuning of hyper-parameters. In this approach the information obtained from previous function evaluations (i.e. building and evaluating models using various hyper-parameters) is used to approximate best next hyper-parameters to be tried out. BO builds a probabilistic model for the objective function, which here is the performance of the model. For more information have a look here and here.

I wished H2O guys have implemented Bayesian optimization. But no worries. There are bunch of Bayesian optimization algorithms one can use, in R, Python, C++, etc. On the other hand the heavy lifting part, which is function evaluation (i.e. building the models), will be carried out by H2O.

On the other hand there are some discussions whether BO worth the effort, e.g. see here. Any ways I think at least it worth mentioning how we can combine BO with H2O.

In the following it shall be shown how we can use rBaysianOptimisation package in R for finding the optimal hyper-parameters.

The function h2o_bayes is defined as a wrapper for h2o.gbm function, which builds and evaluates a GBM model. All the parameters that needs to be optimized are defined as an argument for this function. In order to score each model validation data set is used. However, for small data set (as in this example) usually it is better to use cross-validation (instead of validation data set) to decrease variation in the model score.

#============================
# Define the wrapper function
#============================
h2o_bayes <- function(
  max_depth, learn_rate, sample_rate, 
  col_sample_rate, balance_classes){
  bal.cl <- as.logical(balance_classes)
  gbm <- h2o.gbm(  
    x                   = x,
    y                   = y,
    training_frame      = train_frame,
    validation_frame    = valid_frame,
    #nfolds              = 3,
    ntrees              = 900,
    max_depth           = max_depth,
    learn_rate          = learn_rate,
    sample_rate         = sample_rate,
    col_sample_rate     = col_sample_rate,
    score_tree_interval = 5,
    stopping_rounds     = 2,
    stopping_metric     = "logloss",
    stopping_tolerance  = 0.005,
    balance_classes     = bal.cl)
    
  score <- h2o.auc(gbm, valid = T)
  list(Score = score,
       Pred  = 0)
}

#============================
# Find optimal values for the 
# parameters in the given range. 
#============================
OPT_Res <- BayesianOptimization(
  h2o_bayes,
  bounds = list(
    max_depth   = c(2L, 8L), 
    learn_rate  = c(1e-4, 0.2),
    sample_rate = c(0.4, 1), 
    col_sample_rate = c(0.4, 1), 
    balance_classes = c(0L, 1L)),
  init_points = 10,  n_iter = 10,
  acq = "ucb", kappa = 2.576, eps = 0.0,
  verbose = FALSE)

## 
##  Best Parameters Found: 
## Round = 10   max_depth = 6.0000  learn_rate = 0.1223 sample_rate = 0.7886    col_sample_rate = 0.4054    balance_classes = 0.0000    Value = 0.9001

Building the model using optimal parameters

After searching for the best hyper-parameters using Bayesian optimization algorithm, we use the values that were found above to train a GBM model.

gbm <- h2o.gbm(
  x                   = x,
  y                   = y,
  training_frame      = train_frame,
  validation_frame    = valid_frame,
  ntrees              = 900,
  max_depth           = OPT_Res$Best_Par["max_depth"],
  learn_rate          = OPT_Res$Best_Par["learn_rate"],
  sample_rate         = OPT_Res$Best_Par["sample_rate"],
  col_sample_rate     = OPT_Res$Best_Par["col_sample_rate"],
  balance_classes     = as.logical(OPT_Res$Best_Par["balance_classes"]),
  score_tree_interval = 5,
  stopping_rounds     = 2,
  stopping_metric     = "logloss",
  stopping_tolerance  = 0.005,
  model_id         = "my_awesome_GBM")

Train a model using only important co-variables

In the previous section we used all the columns to train a GBM model. Many times removing irrelevant data from the training data will improve the results, for various reason. Even if it doesn’t improve the result, it will reduce the amount of resources needed for calculation, e.g. CPU time and memory.

Now we remove the variables that are not important. However, note that the variable importance is obtained from previous calculation.

var.imp <- h2o.varimp(gbm)[h2o.varimp(gbm)$scaled_importance > 0.05, "variable"]
# The value of 0.05 is arbitrary, you might want to use other values.

setdiff(x, var.imp)

## [1] "Account Length" "Area Code"      "Phone"          "Day Calls"     
## [5] "Eve Calls"      "Night Calls"

gbm_varImp <- h2o.gbm(
  x                   = var.imp,
  y                   = y,
  training_frame      = train_frame,
  validation_frame    = valid_frame,
  ntrees              = 900,
  max_depth           = OPT_Res$Best_Par["max_depth"],
  learn_rate          = OPT_Res$Best_Par["learn_rate"],
  sample_rate         = OPT_Res$Best_Par["sample_rate"],
  col_sample_rate     = OPT_Res$Best_Par["col_sample_rate"],
  balance_classes     = as.logical(OPT_Res$Best_Par["balance_classes"]),
  score_tree_interval = 5,
  stopping_rounds     = 2,
  stopping_metric     = "logloss",
  stopping_tolerance  = 0.005,
  model_id         = "my_awesome_GBM_varImp")

Compare the two model

The two models are compared in the plots below, in terms of rmse and auc. It is can be seen that if we use only the important variables the performance of the model is not reduced significantly, if any.

shist <- gbm@model$scoring_history[, c("duration", "validation_rmse", "validation_auc")]
shist$algorithm <- "GBM" 
scoring_history <- shist

shist <- gbm_varImp@model$scoring_history[, c("duration", "validation_rmse", "validation_auc")]
shist$algorithm <- "GBM with var.imp." 
scoring_history <- rbind(scoring_history,shist)

scoring_history$duration <- as.numeric(
  gsub("sec", "", scoring_history$duration))

scoring_history <- melt(scoring_history, id = c("duration", "algorithm"))

ggplot(data = scoring_history, 
       aes(x     = duration, 
           y     = value, 
           color = algorithm,
           group = algorithm)) + 
  geom_line() + geom_point() +
  facet_grid(. ~ variable, scales = "free",shrink = TRUE,space = "free")

Choose the best model based on AUC measure (you could choose mse, logloss or other measures which you might find more appropriate in your case).

AUC_gbm        <- h2o.performance(gbm, valid = T)@metrics$AUC
AUC_gbm_varImp <- h2o.performance(gbm_varImp, valid = T)@metrics$AUC
if(AUC_gbm > AUC_gbm_varImp){
  bestModel <- gbm
}else{
  bestModel <- gbm_varImp
}
cat("The best model is '", bestModel@model_id, 
    "' with AUC of ", max(AUC_gbm, AUC_gbm_varImp), 
    " vs ",  min(AUC_gbm, AUC_gbm_varImp), "\n" )

## The best model is ' my_awesome_GBM_varImp ' with AUC of  0.8956946  vs  0.8955178

Model Assessment

Now we use the test data to measure the performance of the trained model:

bestPerf <- h2o.performance(bestModel, test_frame)

perfDF <- melt(as.data.frame(bestPerf@metrics$thresholds_and_metric_scores), 
           id = "threshold")

Performance of the model on the test data:

as.data.frame(
    bestPerf@metrics[c("MSE", "RMSE", "AUC", "r2", "logloss", "Gini", "mean_per_class_error")]
    )

MSE	RMSE	AUC	r2	logloss	Gini	mean_per_class_error
0.066	0.257	0.902	0.485	0.242	0.804	0.172

As you know the prediction of the model is a probability of a customer churn. In order to have a visual feeling about performance of the model, distribution of the prediction is plotted against the true outcome.

Lets have a look at the prediction distribution, for each class:

The horizontal axis shows the true values, i.e. the values of Churn? column. The vertical axis shows the model prediction probability for churn. In principle we can define a criteria for probability which a customer with a higher value can be identified as churned customer and with lower value as loyal customer. This criteria could be a value like 0.5, or a value that has best accuracy which is 0.3259223 in this case. We would like to have all the data points be either in the top right (True Positive (TP)) or bottom left (True Negative (TN)). The data points in top left and bottom right are misclassified.

I have seen this kind of plot for the first time here.

So, which customer is going to leave? As mentioned above if the prediction value is more than 0.5 or based on other criteria. Here are some of the criteria and the maximum value we can get depending on which criteria we choose:

kable(bestPerf@metrics$max_criteria_and_metric_scores, digits = 3)

metric	threshold	value	idx
max f1	0.237	0.725	111
max f2	0.122	0.767	147
max f0point5	0.459	0.785	74
max accuracy	0.326	0.921	96
max precision	0.993	1.000	0
max recall	0.003	1.000	394
max specificity	0.993	1.000	0
max absolute_mcc	0.237	0.679	111
max min_per_class_accuracy	0.061	0.844	202
max mean_per_class_accuracy	0.122	0.863	147

and here are the plots of few of them against various criteria.

scores_to_plot <- c("accuracy", "precision", "recall", "min_per_class_accuracy")

ggplot(data = perfDF[perfDF$variable %in% scores_to_plot, ],  
       aes(x     = threshold, 
           y     = value, 
           color = variable,
           group = variable)) + 
  geom_line() + geom_point()

#kable(bestPerf@metrics, digits = 3)

If we know the actual costs (in the business) of false positives and false negatives (or benefits of true positives) we can have better decision on which criteria to choose.

Interpretation of the Model

So far so good, we built the model and have evaluated it. The algorithm that we used above was GBM, which is a black box model. That means it only gives us prediction on which customers are going to leave, i.e. the probabilities. The interpretation of the model, like whys, is not that straightforward.

However, we would like to have some more insights and be able to answer questions like what are the reasons that customers are leaving or a specific customer might to leave.

Variable Importance

In many situations not all the co-variables are equally important. Some variable are more important than the others. Using variable importance we can focus more on important variables.

h2o.varimp_plot(bestModel)

Variable Importance per Customer

The above mentioned variables were measured based on the whole training data set and is only valid on average. However the importance of each variable might be different for each individual customer.

We can use the trained model to approximate the variable importance based on each individual customer.

In this approach a prediction is done on the data set. And then we leave a column out and will do the prediction again and measure the difference in probabilities. The difference in probabilities can indicate the importance of that variable for each customer.

This have a little bit of calculation in R side, like calculating differences and ranking each row based on these differences. Hence, in order to speed up things I have implemented a C++ code (get_high_rank_values.cpp) which also needed to be sourced. Using Rcpp package one can simply use C++ code in R script.

require(Rcpp)
Rcpp::sourceCpp("get_high_rank_values.cpp")
source("rowVarImp.R")

df <- rowVarImp.h2o(model=bestModel, cols=var.imp, data_frame=test_frame, n=3)

Here only first 3 important variables for each customer are calculated. The first 3 columns show the differences in the probabilities by not including the variable. The next 3 columns show their values and the last 3 columns the variables name.

Knowing these information the business unit can take measures based on each individual customer (not averaged on all customer like the general variable importance).

head(df)

colName_VI1	value_VI1	imp_VI1	colName_VI2	value_VI2	imp_VI2	colName_VI3	value_VI3	imp_VI3
Intl Mins	12.7	-0.095	Eve Charge	19.42	-0.059	Intl Charge	3.43	-0.037
Day Mins	187.7	-0.010	Intl Charge	2.46	-0.009	Eve Mins	163.4	-0.002
State	IA	-0.375	Intl Mins	13.1	-0.112	Eve Charge	26.11	-0.091
Intl Mins	10.6	-0.008	CustServ Calls	0	-0.004	Day Charge	26.37	-0.002
State	CO	-0.120	Intl Calls	6	-0.017	Intl Charge	1.54	-0.012
Day Mins	124.3	-0.053	Eve Charge	23.55	-0.052	CustServ Calls	3	-0.044

Partial Dependence Plots

In the previous section we talked about importance of each variable, but we didn’t answer the question of how changing variables would change the outcome.
However, it is very important to know if we change value of a variable what will be the impact on the outcome.

This kind of question can be answered by the so called partial dependence plots. Partial dependence plots depicts, indeed, the marginal effect of the variables. In the following the partial dependence plots are shown for the 5 top important variables.

In order to calculate partial dependence we can use partialDependencePlot.R. However, since at least version 3.11.0.3717 h2o has h2o.partialPlot function to calculate one dimensional partial dependence. I like my own implementation better because of its functionality, however h2o.partialPlot is much faster.

cols <- var.imp[1:5]
pdps <- h2o.partialPlot(bestModel, data_frame, cols, nbins = 100, plot = F)
mean.orig <- mean(h2o.predict(bestModel, data_frame)[,3])

lapply(1:length(cols), function(x){
  x_ <- pdps[[x]] 
  colnames(x_) <- c("variable", "diff")
  x_$diff <- x_$diff - mean.orig
  if(class(x_$variable) == 'factor' || class(x_$variable) == 'character'){
    # order based on diff
    x_ <- within(x_, variable <- factor(variable, levels=variable[order(-diff)]))
    p <- ggplot(data = x_, aes(x = variable, y = diff)) +
      geom_bar(stat="identity" )
  }else{
    p <- ggplot(data = x_, aes(x = variable, y = diff)) +
      geom_line()
  }
  p + ggtitle(cols[x])
})

The above plots show only impact of single variable if others are kept constant. But in many cases, specially if there is interaction between variables, we are interested to know the impact on changing two or more variables. For this also we can use partialDependencePlot.R. As an example, in the following we can see the impact of “Day Charge” for each state (the colored thin lines), and also on average (the thick black line). It can be seen that there are quite variances among different states.

source("partialDependencePlot.R")
cols <- c("State", "Day Charge")
pdp1 <- partialDependencePlot(bestModel, cols, data_frame)

pdp2 <- partialDependencePlot(bestModel, "Day Charge", data_frame)

pdp1 <- melt(pdp1)
pdp2 <- melt(pdp2)
pdp2$State <- "AVG"

colnames(pdp1) <- c("State", "Day.Charge", "diff")
colnames(pdp2) <- c("Day.Charge", "diff", "State")

ggplot(data = pdp1, 
       aes(x     = Day.Charge, 
           y     = diff,   
           color = State,
           group = State)) + 
  geom_line() +
  geom_line(data = pdp2, size=3, color='black')

A Simplified Single Decision Tree

Alright, so far we did quite a lot. We found important variables, impact of changing each variables on the outcome. However, we would like to have the big picture. A high level insight about the model.

For that, we can build a single decision tree. But we already know that if we use single decision tree to build a model it wouldn’t be that accurate.

In order to improve the accuracy of the single decision tree a little bit, we use the decision tree to approximate the GBM model (instead of training it on the original data). We train a decision tree with that data but the target variable is replaced by the predictions of GBM. The premise behind this, according to what a friend of me Armin Monecke told once, is that using predictions of GBM (or other complex models) kind of removes the noise from the data, and makes it easier for less complex model (like single decision trees) to train.

Additionally, here, we use only rows that are predicted with high confidence, i.e. probabilities close to 0 and 1.

Note: since we sample the data points, we need to take care of prior distribution (e.g. sampling with the same distribution of target column or specifying it for example using a prior parameter in the algorithm).

denoise.h2o.df <- function(model, data_frame, destination_frame, lower.q=0.2, higher.q=0.8){
  # Removes the data points that cannot be classified significantly.
  # That means the data points for which the prediction is far away from 0 and 1, 
  # like those close to 0.5
  #
  # Args:
  #   model: A H2O model.
  #   data_frame: A H2O data frame.
  #   destination_frame: ID for the result frame.
  #   lower.q: The lower quantile, from which the data points will be ignored.
  #   higher.q: The higher quantile, to which the data points will be ignored.
  #
  # Returns:
  #   The resulting H2O data frame.
  
  pred   <- h2o.predict(model, data_frame)
  predCol <- names(pred)[3]
  quant  <- h2o.quantile(pred[,predCol], probs = c(lower.q, higher.q))
  lower  <- min(quant)
  higher <- max(quant)
  data_pred <- h2o.cbind(data_frame, pred)
  data_denoised <- data_pred[
    data_pred[,predCol] < lower | 
      data_pred[,predCol] > higher,]
  data_denoised <- h2o.assign(data_denoised, destination_frame) 
  return(data_denoised)  
}

denoised_train_df <- 
  denoise.h2o.df(
    model             = bestModel, 
    data_frame        = train_frame, 
    destination_frame = "denoised_train_frame", 
    lower.q           = 0.1,
    higher.q          = 0.9)

denoised_test_df <- 
  denoise.h2o.df(
    model             = bestModel, 
    data_frame        = test_frame, 
    destination_frame = "denoised_test_frame", 
    lower.q           = 0.1,
    higher.q          = 0.9)

train_df <- as.data.frame(denoised_train_df)
test_df  <- as.data.frame(test_frame)
test_df1 <- as.data.frame(denoised_test_df)

# Ignore the columns that is not in used in training 
tmp.df <- train_df[, -which(names(train_df)  %in% c("Churn.","False.","True.", "Phone" ))]
colnames(tmp.df)[colnames(tmp.df) == "predict"] <- "Churn."

counts <- as.data.frame(h2o.table(data_frame[,"Churn?"]))

priorDist <- counts$Count[1] / sum(counts$Count)
priorDist <- c(priorDist, 1 - priorDist)
               
tree <- rpart(
    formula  = Churn. ~ ., 
    data     = tmp.df,
    maxdepth = 5, 
    cp       = 0.02,
    parms    = list(prior = priorDist, 
                    split = "information"))

fancyRpartPlot(tree)

id <- which(!(test_df$State %in% levels(train_df$State)))
test_df$State[id] <- NA

Pred1 <- predict(tree, test_df)
test_df$Prediction1 <- Pred1[,"True."]
ROC1 <- roc(test_df$Churn., test_df$Prediction1)
ROC1$auc

## Area under the curve: 0.6869

Conclusion

The purpose of this article was to show how we can go through a data science task of a classification problem using H2O, one of the most powerful open source library for big data. Its speed and performance allows to accomplish analytics tasks for huge amount of data, like:

building many models,
parameter tuning,
variable importance at customer level,
(low dimensional) partial dependence analysis.

For example note that for a two dimensional partial dependence analysis with only 20 points for each dimension we need to perform prediction over all data set 400 times (i.e. 20*20).

Acknowledgement

I would like to thank Jan Simen for his comments that improved the content of this post.

Disclaimer

I am not affiliated, associated, authorized, endorsed by, or in any way officially connected with h2o.ai. I just enjoy using this product.

Published

24 November 2016

Data science with H2O A classification problem: Telco Customer Churn