Data science with H2O A classification problem: Telco Customer Churn
A note on reproduction of the contenct of this article
The r-markdown file of this article (and some other useful examples) can be found here: https://github.com/a-ghorbani/h2o_examples
You may also find a docker file in here, which contains all the instructions that are necessary to run this example, like installing h2o, r-studio and all the r packages.
You can pull the docker image from here, with running the following command:
> docker run -p 8787:8787 -p 54321:54321 -p 8080:8080 aghorbani/rstudio-h2o
Then you should be able to browse to http://localhost:8787, where you can get a web based r-studio. After login in using user guest
and password guest
, there should be a folder named h2o_examples-master
. In this folder you can open Telco_Customer_Churn_H2O.Rmd
, and … yoohoo run it.
Introduction
This is a data science example using H2O (h2o.ai) to investigate a classification problem, namely customer churn.
H2O is an open-source machine learning software for big-data. It is produced in 2011 by the start-up H2O.ai. The performance of H2O allows to perform various data science analyses possible through fitting hundreds of models, fast prediction of large amount of data and a lot other interesting thing that some of them we will be seen in this article.
The example I chose to investigate is classification of a telecom company customers into whether they would stay as customer or leave the company. Many companies are interested to know when, why, which customer are going to leave. Acquiring this information they can take appropriate actions accordingly.
I chose churn analysis here just as an example, any other (binary) classification problem can be done the same as well. A multi-class classification requires some adjustments.
I am going to cover the following analyses:
- prediction of customer churn probability using gradient boosting machine (GBM),
- parameter tuning using Bayesian optimization,
- interpretation of the model:
- specifying the co-variables that have high importance for the analysis,
- specifying the most important co-variables that is responsible for churn probability for each individual customer,
- constructing a single decision tree (for interpretation of the black box GBM model),
- (partial) dependence of churn probabilities on each co-variables.
The purpose of this note is to show how the above mentioned analyses can be applied using H2O.
Required packages
The following packages are necessary for this analysis.
#=========================================
# TODO: remove unnecessary packages
#=========================================
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(partykit)
library(caret)
library(party)
library(rBayesianOptimization)
require(readr)
require(data.table)
require(reshape2)
library(pROC)
require(ggplot2)
require(h2o)
Init H2O (connect to a running H2O cluster)
We need to download H2O package, if we hadn’t installed it before. At the time of writing this text the latest version was 3.11.0.3717 and I downloaded from here. After downloading we need to unzip it.
There are different ways to start H2O cluster. I usually prefer to start it from command line. For more details on how to start H2O from command line see here.
> java -jar h2o.jar -nthreads 2 -ice_root ./tmp -port 54321 -name AwesomeCloud
Additionally you need to install H2O R API package in R, or if you would like to use H2O from Python you have to install Python API, or other languages you prefer to work with. H2O supports various APIs. Here I use R to connect to H2O cluster, therefore we need to install its R API.
install.packages("path_to_h2o_dir/h2o-3.11.0.3717/R/h2o_3.11.0.3717.tar.gz", repos=NULL, type = "source")
Finally use h2o.init()
to connect the running H2O cluster (in our case with only one node).
# Connect to h2o cluster
h2o.init(port = 54321)
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 days 21 hours
## H2O cluster version: 3.11.0.3719
## H2O cluster version age: 3 days
## H2O cluster name: H2O_started_from_R_valar_gde774
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.37 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## R Version: R version 3.2.3 (2015-12-10)
# Remove all objects in the h2o cluster.
h2o.removeAll()
## [1] 0
Load the data into h2o
In the example here, we use the data from http://www.dataminingconsultant.com/data/churn.txt. The description of the columns can be found here.
data_frame <-
h2o.importFile(
path = "http://www.dataminingconsultant.com/data/churn.txt",
sep = ",",
destination_frame = "data_frame")
Let’s have a look at the first few rows of the data we just loaded:
State | Account Length | Area Code | Phone | Int’l Plan | VMail Plan | VMail Message | Day Mins | Day Calls | Day Charge | Eve Mins | Eve Calls | Eve Charge | Night Mins | Night Calls | Night Charge | Intl Mins | Intl Calls | Intl Charge | CustServ Calls | Churn? |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
KS | 128 | 415 | 382-4657 | no | yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False. |
OH | 107 | 415 | 371-7191 | no | yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False. |
NJ | 137 | 415 | 358-1921 | no | no | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False. |
OH | 84 | 408 | 375-9999 | yes | no | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False. |
OK | 75 | 415 | 330-6626 | yes | no | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False. |
We can use h2o.describe
to obtain some basic statistics about the variables, like mean, min, max, number of missing values, etc.
h2o.describe(data_frame)
Label | Type | Missing | Zeros | PosInf | NegInf | Min | Max | Mean | Sigma | Cardinality |
---|---|---|---|---|---|---|---|---|---|---|
State | enum | 0 | 52 | 0 | 0 | 0 | 50 | NA | NA | 51 |
Account Length | int | 0 | 0 | 0 | 0 | 1 | 243 | 101.064806480648 | 39.8221059285957 | NA |
Area Code | int | 0 | 0 | 0 | 0 | 408 | 510 | 437.182418241824 | 42.3712904856066 | NA |
Phone | string | 0 | 0 | 0 | 0 | NaN | NaN | NA | NA | NA |
Int’l Plan | enum | 0 | 3010 | 0 | 0 | 0 | 1 | 0.0969096909690969 | 0.295879145484415 | 2 |
VMail Plan | enum | 0 | 2411 | 0 | 0 | 0 | 1 | 0.276627662766277 | 0.447397870380064 | 2 |
VMail Message | int | 0 | 2411 | 0 | 0 | 0 | 51 | 8.09900990099009 | 13.6883653720386 | NA |
Day Mins | real | 0 | 2 | 0 | 0 | 0 | 350.8 | 179.775097509751 | 54.4673892023715 | NA |
Day Calls | int | 0 | 2 | 0 | 0 | 0 | 165 | 100.435643564356 | 20.0690842073009 | NA |
Day Charge | real | 0 | 2 | 0 | 0 | 0 | 59.64 | 30.5623072307231 | 9.2594345539305 | NA |
Eve Mins | real | 0 | 1 | 0 | 0 | 0 | 363.7 | 200.980348034804 | 50.713844425812 | NA |
Eve Calls | int | 0 | 1 | 0 | 0 | 0 | 170 | 100.114311431143 | 19.9226252939431 | NA |
Eve Charge | real | 0 | 1 | 0 | 0 | 0 | 30.91 | 17.0835403540353 | 4.31066764311035 | NA |
Night Mins | real | 0 | 0 | 0 | 0 | 23.2 | 395 | 200.87203720372 | 50.5738470136584 | NA |
Night Calls | int | 0 | 0 | 0 | 0 | 33 | 175 | 100.107710771077 | 19.5686093460585 | NA |
Night Charge | real | 0 | 0 | 0 | 0 | 1.04 | 17.77 | 9.03932493249326 | 2.27587283766003 | NA |
Intl Mins | real | 0 | 18 | 0 | 0 | 0 | 20 | 10.2372937293729 | 2.79183954840842 | NA |
Intl Calls | int | 0 | 18 | 0 | 0 | 0 | 20 | 4.47944794479448 | 2.4612142705461 | NA |
Intl Charge | real | 0 | 18 | 0 | 0 | 0 | 5.4 | 2.76458145814581 | 0.753772612663045 | NA |
CustServ Calls | int | 0 | 697 | 0 | 0 | 0 | 9 | 1.56285628562856 | 1.31549104486648 | NA |
Churn? | enum | 0 | 2850 | 0 | 0 | 0 | 1 | 0.144914491449145 | 0.352067423624126 | 2 |
The column Churn?
specifies whether the customer has left the plan or not. For exact meaning of other columns see here.
In order to build and assess the model we are going to split the data into training, validation and testing data set. I assume that the analysis here is applied to a large data set. It is not true in this example, though, since we have only 3333 number of rows, which is not that big. In such a case it is better to split the data set into only training and testing. And for model selection and hyper-parameter tuning using the cross-validation technique.
As I assume the example here will be applied to a very large data set, otherwise why somebody wants to use H2O ;), the data set is split into training, validation and testing with ratio of %50, %25 and %25, respectively.
split_df <-
h2o.splitFrame(
data = data_frame,
ratios = c(0.5,0.25),
destination_frames = c("train_frame", "valid_frame", "test_frame"),
seed = 2016)
train_frame <- split_df[[1]]
valid_frame <- split_df[[2]]
test_frame <- split_df[[3]]
Building model
First the target variable, which is Churn?
and the co-variables (the rest of the other variables) are defined.
y <- "Churn?"
x <- setdiff(names(data_frame), c(y))
Building model using GBM
Let’s first just build a model using the gradient boosting machine (GBM) with arbitrarily chosen hyper-parameter. GBM is a machine learning algorithm for regression and classification problems. It builds a model in the form of an ensemble of (weak) prediction models like decision trees. For more details refer to this, this and this.
gbm.1 <- h2o.gbm(
x = x,
y = y,
training_frame = train_frame,
validation_frame = valid_frame,
ntrees = 100,
max_depth = 3,
learn_rate = 0.2)
In the above example we used train_frame
to train the model and valid_frame
to evaluate the performance. We can see the performance in tho following:
h2o.performance(gbm.1, valid = T)
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.05118193
## RMSE: 0.2262342
## LogLoss: 0.2170323
## Mean Per-Class Error: 0.126074
## AUC: 0.8941919
## Gini: 0.7883838
##
## Confusion Matrix for F1-optimal threshold:
## False. True. Error Rate
## False. 689 24 0.033661 =24/713
## True. 26 93 0.218487 =26/119
## Totals 715 117 0.060096 =50/832
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.226842 0.788136 113
## 2 max f2 0.156576 0.789902 132
## 3 max f0point5 0.530581 0.843373 70
## 4 max accuracy 0.226842 0.939904 113
## 5 max precision 0.999967 1.000000 0
## 6 max recall 0.000327 1.000000 399
## 7 max specificity 0.999967 1.000000 0
## 8 max absolute_mcc 0.226842 0.753161 113
## 9 max min_per_class_accuracy 0.068766 0.857143 178
## 10 max mean_per_class_accuracy 0.156576 0.878811 132
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Each algorithm has number of hyper-parameters to be set, like in case of GBM number of trees or learning rate, etc. As always one of the important question is which values to choose for the hyper-parameters. There are different ways to find the optimal values for hyper-parameters, like grid search, random search or other general optimization algorithm.
We can use Bayesian optimization to find out an approximation of the optimized hyper-parameters.
Bayesian Optimization
H2O supports grid search and random search. Grid search is kind of brute force (it increases exponentially with the number of parameters) and random search as its name suggests searches the hyper-parameter domain randomly.
Bayesian optimization (BO) is used for automatic tuning of hyper-parameters. In this approach the information obtained from previous function evaluations (i.e. building and evaluating models using various hyper-parameters) is used to approximate best next hyper-parameters to be tried out. BO builds a probabilistic model for the objective function, which here is the performance of the model. For more information have a look here and here.
I wished H2O guys have implemented Bayesian optimization. But no worries. There are bunch of Bayesian optimization algorithms one can use, in R, Python, C++, etc. On the other hand the heavy lifting part, which is function evaluation (i.e. building the models), will be carried out by H2O.
On the other hand there are some discussions whether BO worth the effort, e.g. see here. Any ways I think at least it worth mentioning how we can combine BO with H2O.
In the following it shall be shown how we can use rBaysianOptimisation
package in R for finding the optimal hyper-parameters.
The function h2o_bayes
is defined as a wrapper for h2o.gbm
function, which builds and evaluates a GBM model. All the parameters that needs to be optimized are defined as an argument for this function. In order to score each model validation data set is used. However, for small data set (as in this example) usually it is better to use cross-validation (instead of validation data set) to decrease variation in the model score.
#============================
# Define the wrapper function
#============================
h2o_bayes <- function(
max_depth, learn_rate, sample_rate,
col_sample_rate, balance_classes){
bal.cl <- as.logical(balance_classes)
gbm <- h2o.gbm(
x = x,
y = y,
training_frame = train_frame,
validation_frame = valid_frame,
#nfolds = 3,
ntrees = 900,
max_depth = max_depth,
learn_rate = learn_rate,
sample_rate = sample_rate,
col_sample_rate = col_sample_rate,
score_tree_interval = 5,
stopping_rounds = 2,
stopping_metric = "logloss",
stopping_tolerance = 0.005,
balance_classes = bal.cl)
score <- h2o.auc(gbm, valid = T)
list(Score = score,
Pred = 0)
}
#============================
# Find optimal values for the
# parameters in the given range.
#============================
OPT_Res <- BayesianOptimization(
h2o_bayes,
bounds = list(
max_depth = c(2L, 8L),
learn_rate = c(1e-4, 0.2),
sample_rate = c(0.4, 1),
col_sample_rate = c(0.4, 1),
balance_classes = c(0L, 1L)),
init_points = 10, n_iter = 10,
acq = "ucb", kappa = 2.576, eps = 0.0,
verbose = FALSE)
##
## Best Parameters Found:
## Round = 10 max_depth = 6.0000 learn_rate = 0.1223 sample_rate = 0.7886 col_sample_rate = 0.4054 balance_classes = 0.0000 Value = 0.9001
Building the model using optimal parameters
After searching for the best hyper-parameters using Bayesian optimization algorithm, we use the values that were found above to train a GBM model.
gbm <- h2o.gbm(
x = x,
y = y,
training_frame = train_frame,
validation_frame = valid_frame,
ntrees = 900,
max_depth = OPT_Res$Best_Par["max_depth"],
learn_rate = OPT_Res$Best_Par["learn_rate"],
sample_rate = OPT_Res$Best_Par["sample_rate"],
col_sample_rate = OPT_Res$Best_Par["col_sample_rate"],
balance_classes = as.logical(OPT_Res$Best_Par["balance_classes"]),
score_tree_interval = 5,
stopping_rounds = 2,
stopping_metric = "logloss",
stopping_tolerance = 0.005,
model_id = "my_awesome_GBM")
Train a model using only important co-variables
In the previous section we used all the columns to train a GBM model. Many times removing irrelevant data from the training data will improve the results, for various reason. Even if it doesn’t improve the result, it will reduce the amount of resources needed for calculation, e.g. CPU time and memory.
Now we remove the variables that are not important. However, note that the variable importance is obtained from previous calculation.
var.imp <- h2o.varimp(gbm)[h2o.varimp(gbm)$scaled_importance > 0.05, "variable"]
# The value of 0.05 is arbitrary, you might want to use other values.
setdiff(x, var.imp)
## [1] "Account Length" "Area Code" "Phone" "Day Calls"
## [5] "Eve Calls" "Night Calls"
gbm_varImp <- h2o.gbm(
x = var.imp,
y = y,
training_frame = train_frame,
validation_frame = valid_frame,
ntrees = 900,
max_depth = OPT_Res$Best_Par["max_depth"],
learn_rate = OPT_Res$Best_Par["learn_rate"],
sample_rate = OPT_Res$Best_Par["sample_rate"],
col_sample_rate = OPT_Res$Best_Par["col_sample_rate"],
balance_classes = as.logical(OPT_Res$Best_Par["balance_classes"]),
score_tree_interval = 5,
stopping_rounds = 2,
stopping_metric = "logloss",
stopping_tolerance = 0.005,
model_id = "my_awesome_GBM_varImp")
Compare the two model
The two models are compared in the plots below, in terms of rmse and auc. It is can be seen that if we use only the important variables the performance of the model is not reduced significantly, if any.
shist <- gbm@model$scoring_history[, c("duration", "validation_rmse", "validation_auc")]
shist$algorithm <- "GBM"
scoring_history <- shist
shist <- gbm_varImp@model$scoring_history[, c("duration", "validation_rmse", "validation_auc")]
shist$algorithm <- "GBM with var.imp."
scoring_history <- rbind(scoring_history,shist)
scoring_history$duration <- as.numeric(
gsub("sec", "", scoring_history$duration))
scoring_history <- melt(scoring_history, id = c("duration", "algorithm"))
ggplot(data = scoring_history,
aes(x = duration,
y = value,
color = algorithm,
group = algorithm)) +
geom_line() + geom_point() +
facet_grid(. ~ variable, scales = "free",shrink = TRUE,space = "free")
Choose the best model based on AUC
measure (you could choose mse
, logloss
or other measures which you might find more appropriate in your case).
AUC_gbm <- h2o.performance(gbm, valid = T)@metrics$AUC
AUC_gbm_varImp <- h2o.performance(gbm_varImp, valid = T)@metrics$AUC
if(AUC_gbm > AUC_gbm_varImp){
bestModel <- gbm
}else{
bestModel <- gbm_varImp
}
cat("The best model is '", bestModel@model_id,
"' with AUC of ", max(AUC_gbm, AUC_gbm_varImp),
" vs ", min(AUC_gbm, AUC_gbm_varImp), "\n" )
## The best model is ' my_awesome_GBM_varImp ' with AUC of 0.8956946 vs 0.8955178
Model Assessment
Now we use the test data to measure the performance of the trained model:
bestPerf <- h2o.performance(bestModel, test_frame)
perfDF <- melt(as.data.frame(bestPerf@metrics$thresholds_and_metric_scores),
id = "threshold")
Performance of the model on the test data:
as.data.frame(
bestPerf@metrics[c("MSE", "RMSE", "AUC", "r2", "logloss", "Gini", "mean_per_class_error")]
)
MSE | RMSE | AUC | r2 | logloss | Gini | mean_per_class_error |
---|---|---|---|---|---|---|
0.066 | 0.257 | 0.902 | 0.485 | 0.242 | 0.804 | 0.172 |
As you know the prediction of the model is a probability of a customer churn. In order to have a visual feeling about performance of the model, distribution of the prediction is plotted against the true outcome.
Lets have a look at the prediction distribution, for each class:
The horizontal axis shows the true values, i.e. the values of Churn?
column. The vertical axis shows the model prediction probability for churn. In principle we can define a criteria for probability which a customer with a higher value can be identified as churned customer and with lower value as loyal customer. This criteria could be a value like 0.5
, or a value that has best accuracy which is 0.3259223 in this case. We would like to have all the data points be either in the top right (True Positive (TP)) or bottom left (True Negative (TN)). The data points in top left and bottom right are misclassified.
I have seen this kind of plot for the first time here.
So, which customer is going to leave? As mentioned above if the prediction value is more than 0.5 or based on other criteria. Here are some of the criteria and the maximum value we can get depending on which criteria we choose:
kable(bestPerf@metrics$max_criteria_and_metric_scores, digits = 3)
metric | threshold | value | idx |
---|---|---|---|
max f1 | 0.237 | 0.725 | 111 |
max f2 | 0.122 | 0.767 | 147 |
max f0point5 | 0.459 | 0.785 | 74 |
max accuracy | 0.326 | 0.921 | 96 |
max precision | 0.993 | 1.000 | 0 |
max recall | 0.003 | 1.000 | 394 |
max specificity | 0.993 | 1.000 | 0 |
max absolute_mcc | 0.237 | 0.679 | 111 |
max min_per_class_accuracy | 0.061 | 0.844 | 202 |
max mean_per_class_accuracy | 0.122 | 0.863 | 147 |
and here are the plots of few of them against various criteria.
scores_to_plot <- c("accuracy", "precision", "recall", "min_per_class_accuracy")
ggplot(data = perfDF[perfDF$variable %in% scores_to_plot, ],
aes(x = threshold,
y = value,
color = variable,
group = variable)) +
geom_line() + geom_point()
#kable(bestPerf@metrics, digits = 3)
If we know the actual costs (in the business) of false positives and false negatives (or benefits of true positives) we can have better decision on which criteria to choose.
Interpretation of the Model
So far so good, we built the model and have evaluated it. The algorithm that we used above was GBM, which is a black box model. That means it only gives us prediction on which customers are going to leave, i.e. the probabilities. The interpretation of the model, like whys, is not that straightforward.
However, we would like to have some more insights and be able to answer questions like what are the reasons that customers are leaving or a specific customer might to leave.
Variable Importance
In many situations not all the co-variables are equally important. Some variable are more important than the others. Using variable importance we can focus more on important variables.
h2o.varimp_plot(bestModel)
Variable Importance per Customer
The above mentioned variables were measured based on the whole training data set and is only valid on average. However the importance of each variable might be different for each individual customer.
We can use the trained model to approximate the variable importance based on each individual customer.
In this approach a prediction is done on the data set. And then we leave a column out and will do the prediction again and measure the difference in probabilities. The difference in probabilities can indicate the importance of that variable for each customer.
This have a little bit of calculation in R side, like calculating differences and ranking each row based on these differences. Hence, in order to speed up things I have implemented a C++ code (get_high_rank_values.cpp
) which also needed to be sourced. Using Rcpp
package one can simply use C++ code in R script.
require(Rcpp)
Rcpp::sourceCpp("get_high_rank_values.cpp")
source("rowVarImp.R")
df <- rowVarImp.h2o(model=bestModel, cols=var.imp, data_frame=test_frame, n=3)
Here only first 3 important variables for each customer are calculated. The first 3 columns show the differences in the probabilities by not including the variable. The next 3 columns show their values and the last 3 columns the variables name.
Knowing these information the business unit can take measures based on each individual customer (not averaged on all customer like the general variable importance).
head(df)
colName_VI1 | value_VI1 | imp_VI1 | colName_VI2 | value_VI2 | imp_VI2 | colName_VI3 | value_VI3 | imp_VI3 |
---|---|---|---|---|---|---|---|---|
Intl Mins | 12.7 | -0.095 | Eve Charge | 19.42 | -0.059 | Intl Charge | 3.43 | -0.037 |
Day Mins | 187.7 | -0.010 | Intl Charge | 2.46 | -0.009 | Eve Mins | 163.4 | -0.002 |
State | IA | -0.375 | Intl Mins | 13.1 | -0.112 | Eve Charge | 26.11 | -0.091 |
Intl Mins | 10.6 | -0.008 | CustServ Calls | 0 | -0.004 | Day Charge | 26.37 | -0.002 |
State | CO | -0.120 | Intl Calls | 6 | -0.017 | Intl Charge | 1.54 | -0.012 |
Day Mins | 124.3 | -0.053 | Eve Charge | 23.55 | -0.052 | CustServ Calls | 3 | -0.044 |
Partial Dependence Plots
In the previous section we talked about importance of each variable, but we didn’t answer the question of how changing variables would change the outcome.
However, it is very important to know if we change value of a variable what will be the impact on the outcome.
This kind of question can be answered by the so called partial dependence plots. Partial dependence plots depicts, indeed, the marginal effect of the variables. In the following the partial dependence plots are shown for the 5 top important variables.
In order to calculate partial dependence we can use partialDependencePlot.R
. However, since at least version 3.11.0.3717 h2o has h2o.partialPlot
function to calculate one dimensional partial dependence. I like my own implementation better because of its functionality, however h2o.partialPlot
is much faster.
cols <- var.imp[1:5]
pdps <- h2o.partialPlot(bestModel, data_frame, cols, nbins = 100, plot = F)
mean.orig <- mean(h2o.predict(bestModel, data_frame)[,3])
lapply(1:length(cols), function(x){
x_ <- pdps[[x]]
colnames(x_) <- c("variable", "diff")
x_$diff <- x_$diff - mean.orig
if(class(x_$variable) == 'factor' || class(x_$variable) == 'character'){
# order based on diff
x_ <- within(x_, variable <- factor(variable, levels=variable[order(-diff)]))
p <- ggplot(data = x_, aes(x = variable, y = diff)) +
geom_bar(stat="identity" )
}else{
p <- ggplot(data = x_, aes(x = variable, y = diff)) +
geom_line()
}
p + ggtitle(cols[x])
})
The above plots show only impact of single variable if others are kept constant. But in many cases, specially if there is interaction between variables, we are interested to know the impact on changing two or more variables. For this also we can use partialDependencePlot.R
. As an example, in the following we can see the impact of “Day Charge” for each state (the colored thin lines), and also on average (the thick black line). It can be seen that there are quite variances among different states.
source("partialDependencePlot.R")
cols <- c("State", "Day Charge")
pdp1 <- partialDependencePlot(bestModel, cols, data_frame)
pdp2 <- partialDependencePlot(bestModel, "Day Charge", data_frame)
pdp1 <- melt(pdp1)
pdp2 <- melt(pdp2)
pdp2$State <- "AVG"
colnames(pdp1) <- c("State", "Day.Charge", "diff")
colnames(pdp2) <- c("Day.Charge", "diff", "State")
ggplot(data = pdp1,
aes(x = Day.Charge,
y = diff,
color = State,
group = State)) +
geom_line() +
geom_line(data = pdp2, size=3, color='black')
A Simplified Single Decision Tree
Alright, so far we did quite a lot. We found important variables, impact of changing each variables on the outcome. However, we would like to have the big picture. A high level insight about the model.
For that, we can build a single decision tree. But we already know that if we use single decision tree to build a model it wouldn’t be that accurate.
In order to improve the accuracy of the single decision tree a little bit, we use the decision tree to approximate the GBM model (instead of training it on the original data). We train a decision tree with that data but the target variable is replaced by the predictions of GBM. The premise behind this, according to what a friend of me Armin Monecke told once, is that using predictions of GBM (or other complex models) kind of removes the noise from the data, and makes it easier for less complex model (like single decision trees) to train.
Additionally, here, we use only rows that are predicted with high confidence, i.e. probabilities close to 0 and 1.
Note: since we sample the data points, we need to take care of prior distribution (e.g. sampling with the same distribution of target column or specifying it for example using a prior
parameter in the algorithm).
denoise.h2o.df <- function(model, data_frame, destination_frame, lower.q=0.2, higher.q=0.8){
# Removes the data points that cannot be classified significantly.
# That means the data points for which the prediction is far away from 0 and 1,
# like those close to 0.5
#
# Args:
# model: A H2O model.
# data_frame: A H2O data frame.
# destination_frame: ID for the result frame.
# lower.q: The lower quantile, from which the data points will be ignored.
# higher.q: The higher quantile, to which the data points will be ignored.
#
# Returns:
# The resulting H2O data frame.
pred <- h2o.predict(model, data_frame)
predCol <- names(pred)[3]
quant <- h2o.quantile(pred[,predCol], probs = c(lower.q, higher.q))
lower <- min(quant)
higher <- max(quant)
data_pred <- h2o.cbind(data_frame, pred)
data_denoised <- data_pred[
data_pred[,predCol] < lower |
data_pred[,predCol] > higher,]
data_denoised <- h2o.assign(data_denoised, destination_frame)
return(data_denoised)
}
denoised_train_df <-
denoise.h2o.df(
model = bestModel,
data_frame = train_frame,
destination_frame = "denoised_train_frame",
lower.q = 0.1,
higher.q = 0.9)
denoised_test_df <-
denoise.h2o.df(
model = bestModel,
data_frame = test_frame,
destination_frame = "denoised_test_frame",
lower.q = 0.1,
higher.q = 0.9)
train_df <- as.data.frame(denoised_train_df)
test_df <- as.data.frame(test_frame)
test_df1 <- as.data.frame(denoised_test_df)
# Ignore the columns that is not in used in training
tmp.df <- train_df[, -which(names(train_df) %in% c("Churn.","False.","True.", "Phone" ))]
colnames(tmp.df)[colnames(tmp.df) == "predict"] <- "Churn."
counts <- as.data.frame(h2o.table(data_frame[,"Churn?"]))
priorDist <- counts$Count[1] / sum(counts$Count)
priorDist <- c(priorDist, 1 - priorDist)
tree <- rpart(
formula = Churn. ~ .,
data = tmp.df,
maxdepth = 5,
cp = 0.02,
parms = list(prior = priorDist,
split = "information"))
fancyRpartPlot(tree)
id <- which(!(test_df$State %in% levels(train_df$State)))
test_df$State[id] <- NA
Pred1 <- predict(tree, test_df)
test_df$Prediction1 <- Pred1[,"True."]
ROC1 <- roc(test_df$Churn., test_df$Prediction1)
ROC1$auc
## Area under the curve: 0.6869
Conclusion
The purpose of this article was to show how we can go through a data science task of a classification problem using H2O, one of the most powerful open source library for big data. Its speed and performance allows to accomplish analytics tasks for huge amount of data, like:
- building many models,
- parameter tuning,
- variable importance at customer level,
- (low dimensional) partial dependence analysis.
For example note that for a two dimensional partial dependence analysis with only 20 points for each dimension we need to perform prediction over all data set 400 times (i.e. 20*20).
Acknowledgement
I would like to thank Jan Simen for his comments that improved the content of this post.
Disclaimer
I am not affiliated, associated, authorized, endorsed by, or in any way officially connected with h2o.ai. I just enjoy using this product.