Deploy XGBoost Model as SQL Query

Chengjun Hou, Abhishek Bishoyi

2019-03-08

You fit a boosting tree model in R with your favourite package xgboost, the validation results looks great, so next question comes up as how to deploy this model into production so that others could utilize it to help with the business. Incorporating the model into a shiny app would certainly be a good idea, but sometimes the model needs to be integrated into some other systems that the company is heavily relied on. Plus moving large amount of data between database and R could be time and memory consuming. So we propose R package xgb2sql enabling in-database scoring of XGBoost models built in R by translating trained model objects into SQL query.

CRAN Task View: Model Deployment with R categorizes the process of deploying models to various environments for scoring or inferencing on new data into two categories. The first category is Deployment through Different Types of Artifacts, which basically means exporting the model as an object, then using supported software/platform to consume this object scoring out the model predictions. The other category is Deployment through Cloud/Server, which includes a). providing an R interface to third-party managed services such as Google Cloud Machine Learning Engine; b). turning R code into web API and opening service on the server. Our approach provides SQL query producing model predictions, which can be taken as a combination of the model itself plus the scoring process. The output SQL query can be treated as an artifact, but we can easily set up service for it on the database server.

The SQL query generated by this tool is basic enough to be compatible with all SQL-based database products and services. Other than this tool, there are two R packages providing modeling and predicting capability inside database:

Here is the outline for the rest of this vignette:

Prepare Data in Both R and Database

As we know, xgboost only consumes numeric input for its model fitting function 1. So after transferring raw table in database to R as a data.frame/data.table, same one-hot encoding needs to be performed on both the table and the data.frame/data.table. Here we have function onehot2sql() to perform one-hot encoding on the training data in R, producing at the same the SQL query performing the exact transformation for the raw table in database. Let’s start with loading the sample dataset from ggplot2:

library(data.table)
library(xgboost)
library(xgb2sql)
df <- data.frame(ggplot2::diamonds)
head(df)
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Funtion onehot2sql() is built upon base R functions model.frame() and model.matrix(). Other than consuming a data.frame as input, this function has been optimized to work with a data.table with greater efficiency. It outputs a matrix ready for model fitting following rules listed below:

  1. The function treats any non-numeric columns, i.e., columns with class not being numeric and integer, as categorical and performs one-hot encoding for them.

  2. The one-hot encoding doesn’t remove one feature-level combination for each categorical feature, as model.matrix does in order to avoid issues caused by multicollinearity. Although the output matrix conveys same amount of information with and without one feature-level combination removed, thus producing similar model performance, system knowledge gained along this modeling practice is very different. If let’s say the “cut” of diamonds being “Ideal” has a huge impact on its price as the target/response, removing binary column “cut.Ideal” in the output matrix would result in the predictive power of “cut.Ideal” being scattered among other “cut” columns 2. The model performance would be comparative, but we would miss the information that “cut” being “Ideal” is the dominate factor of price, by studying the variable importance. So as multicollinearity wouldn’t be a problem for tree-based model 3, we believe performing full one-hot encoding is more appropriate for XGBoost modeling.

  3. The function keeps NAs inside both categorical and numeric features preserved. As pointed by the author of xgboost, the algorithm will automatically learn what is the best direction to go when a value is missing, which can be viewed as automatically “learn” what is the best imputation value for missing values based on reduction on training loss 4. This is one of the reasons of XGBoost being so powerful, so we are keeping all NAs in the output matrix.

  4. The function outputs meta information tracking all the levels for each categorical feature. If it is given to the function as an input, the exact feature-level combinations will be populated, even if the new data is missing one level for a particular categorical feature, or having a new level never seen before.

Available arguments of this function are, which will be explained with examples:

onehot2sql(data, meta=NULL, sep="_", ws_replace=TRUE, ws_replace_with="",
           unique_id=NULL, output_file_name=NULL, input_table_name=NULL)

Output of this function is a list containing:

  1. meta data tracking the transformation.
  2. matrix model.matrix being the data after processing which is ready for XGBoost fitting.
  3. SQL query sql performing the exact one-hot encoding in the database.

So let’ take a look of its basic usage:

out <- onehot2sql(df)
print(out$meta)
#> $num.vec
#> [1] "carat" "depth" "table" "price" "x"     "y"     "z"    
#> 
#> $catg.vec
#> [1] "cut"     "color"   "clarity"
#> 
#> $contrasts
#> $contrasts$cut_
#>           Fair Good Very Good Premium Ideal
#> Fair         1    0         0       0     0
#> Good         0    1         0       0     0
#> Very Good    0    0         1       0     0
#> Premium      0    0         0       1     0
#> Ideal        0    0         0       0     1
#> 
#> $contrasts$color_
#>   D E F G H I J
#> D 1 0 0 0 0 0 0
#> E 0 1 0 0 0 0 0
#> F 0 0 1 0 0 0 0
#> G 0 0 0 1 0 0 0
#> H 0 0 0 0 1 0 0
#> I 0 0 0 0 0 1 0
#> J 0 0 0 0 0 0 1
#> 
#> $contrasts$clarity_
#>      I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
#> I1    1   0   0   0   0    0    0  0
#> SI2   0   1   0   0   0    0    0  0
#> SI1   0   0   1   0   0    0    0  0
#> VS2   0   0   0   1   0    0    0  0
#> VS1   0   0   0   0   1    0    0  0
#> VVS2  0   0   0   0   0    1    0  0
#> VVS1  0   0   0   0   0    0    1  0
#> IF    0   0   0   0   0    0    0  1
head(out$model.matrix)
#>   (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2
#> 1           1  0.23          0          0           0           1
#> 2           1  0.21          0          0           1           0
#> 3           1  0.23          0          0           0           0
#> 4           1  0.29          0          0           0           0
#> 5           1  0.31          0          0           0           1
#> 6           1  0.24          0          0           0           0
#>   clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E
#> 1           0           0            0            0       0       1
#> 2           0           0            0            0       0       1
#> 3           1           0            0            0       0       1
#> 4           0           1            0            0       0       0
#> 5           0           0            0            0       0       0
#> 6           0           0            0            1       0       0
#>   color_F color_G color_H color_I color_J cut_Fair cut_Good cut_Ideal
#> 1       0       0       0       0       0        0        0         1
#> 2       0       0       0       0       0        0        0         0
#> 3       0       0       0       0       0        0        1         0
#> 4       0       0       0       1       0        0        0         0
#> 5       0       0       0       0       1        0        1         0
#> 6       0       0       0       0       1        0        0         0
#>   cut_Premium cut_VeryGood depth price table    x    y    z
#> 1           0            0  61.5   326    55 3.95 3.98 2.43
#> 2           1            0  59.8   326    61 3.89 3.84 2.31
#> 3           0            0  56.9   327    65 4.05 4.07 2.31
#> 4           1            0  62.4   334    58 4.20 4.23 2.63
#> 5           0            0  63.3   335    58 4.34 4.35 2.75
#> 6           0            1  62.8   336    57 3.94 3.96 2.48

It should be noted that level “Very Good” for feature “cut” has been replaced with “VeryGood”, with the white-space removed. This behaviour is controlled by function arguments ws_replace=TRUE and ws_replace_with="", where other symbol can be specified to replace the white-space inside levels of categorical features. Such processing is very necessary as SQL database usually doesn’t allow white-space inside its table column names. And symbol separating the feature and its levels is controlled by sep="_". The output model.matrix would have all its columns reordered alphabetically.

The SQL query performing one-hot encoding for the raw table is:

cat(out$sql)
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal], 
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D], 
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E], 
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F], 
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G], 
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H], 
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I], 
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF] 
#> FROM INPUT_TABLE

We want to emphasise here that an unique row identifier inside the raw table is crucial for in-database scoring of XGBoost model. Column name of the identifier can be specified by the function argument unique_id, which will be passed along to the table after one-hot encoding. If it is not given, SQL query will be populated with column name “ROW_KEY” for the identifier. Similarly, “INPUT_TABLE” is used in the query if name of the raw table input_table_name is NULL. Given a valid value, the SQL query will be written to the file specified by output_file_name.

Let’s have another example with NAs and a date column:

d2 <- data.table(ggplot2::diamonds)
# change column class
d2[, cut:=factor(cut, ordered=FALSE)]
d2[, clarity:=as.character(clarity)]
# create IDate column
d2[, tsdt:=as.IDate('2017-01-05')]
d2[1:3, tsdt:=tsdt-1]
# add NAs
d2[1, clarity:=NA]
d2[2, depth:=NA]
head(d2)
#>    carat       cut color clarity depth table price    x    y    z
#> 1:  0.23     Ideal     E    <NA>  61.5    55   326 3.95 3.98 2.43
#> 2:  0.21   Premium     E     SI1    NA    61   326 3.89 3.84 2.31
#> 3:  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4:  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5:  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6:  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
#>          tsdt
#> 1: 2017-01-04
#> 2: 2017-01-04
#> 3: 2017-01-04
#> 4: 2017-01-05
#> 5: 2017-01-05
#> 6: 2017-01-05
out2 <- onehot2sql(d2)
head(out2$model.matrix)
#>   (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2
#> 1           1  0.23         NA         NA          NA          NA
#> 2           1  0.21          0          0           1           0
#> 3           1  0.23          0          0           0           0
#> 4           1  0.29          0          0           0           0
#> 5           1  0.31          0          0           0           1
#> 6           1  0.24          0          0           0           0
#>   clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E
#> 1          NA          NA           NA           NA       0       1
#> 2           0           0            0            0       0       1
#> 3           1           0            0            0       0       1
#> 4           0           1            0            0       0       0
#> 5           0           0            0            0       0       0
#> 6           0           0            0            1       0       0
#>   color_F color_G color_H color_I color_J cut_Fair cut_Good cut_Ideal
#> 1       0       0       0       0       0        0        0         1
#> 2       0       0       0       0       0        0        0         0
#> 3       0       0       0       0       0        0        1         0
#> 4       0       0       0       1       0        0        0         0
#> 5       0       0       0       0       1        0        1         0
#> 6       0       0       0       0       1        0        0         0
#>   cut_Premium cut_VeryGood depth price table tsdt_20170104 tsdt_20170105
#> 1           0            0  61.5   326    55             1             0
#> 2           1            0    NA   326    61             1             0
#> 3           0            0  56.9   327    65             1             0
#> 4           1            0  62.4   334    58             0             1
#> 5           0            0  63.3   335    58             0             1
#> 6           0            1  62.8   336    57             0             1
#>      x    y    z
#> 1 3.95 3.98 2.43
#> 2 3.89 3.84 2.31
#> 3 4.05 4.07 2.31
#> 4 4.20 4.23 2.63
#> 5 4.34 4.35 2.75
#> 6 3.94 3.96 2.48

Then let’s look at when meta is given to data with new elements, whether onehot2sql() will output model.matrix with identical columns as the training data, in order to apply predict() to the trained model on the new data:

newdata <- d2[1:5,]
# newdata has columns with new elements
newdata[5, clarity:='NEW']; newdata[1,tsdt:=as.IDate('2018-05-01')]
# newdata has a new column
newdata[, new_col:=1]
# newdata is lacking a column
newdata[, cut:=NULL]
head(newdata)
#>    carat color clarity depth table price    x    y    z       tsdt new_col
#> 1:  0.23     E    <NA>  61.5    55   326 3.95 3.98 2.43 2018-05-01       1
#> 2:  0.21     E     SI1    NA    61   326 3.89 3.84 2.31 2017-01-04       1
#> 3:  0.23     E     VS1  56.9    65   327 4.05 4.07 2.31 2017-01-04       1
#> 4:  0.29     I     VS2  62.4    58   334 4.20 4.23 2.63 2017-01-05       1
#> 5:  0.31     J     NEW  63.3    58   335 4.34 4.35 2.75 2017-01-05       1
onehot2sql(newdata, meta=out2$meta)$model.matrix
#> Warning in onehot2sql(newdata, meta = out2$meta): Following columns are populated with NAs: 
#> cut
#>   (Intercept) carat clarity_I1 clarity_IF clarity_SI1 clarity_SI2
#> 1           1  0.23         NA         NA          NA          NA
#> 2           1  0.21          0          0           1           0
#> 3           1  0.23          0          0           0           0
#> 4           1  0.29          0          0           0           0
#> 5           1  0.31          0          0           0           0
#>   clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2 color_D color_E
#> 1          NA          NA           NA           NA       0       1
#> 2           0           0            0            0       0       1
#> 3           1           0            0            0       0       1
#> 4           0           1            0            0       0       0
#> 5           0           0            0            0       0       0
#>   color_F color_G color_H color_I color_J cut_Fair cut_Good cut_Ideal
#> 1       0       0       0       0       0       NA       NA        NA
#> 2       0       0       0       0       0       NA       NA        NA
#> 3       0       0       0       0       0       NA       NA        NA
#> 4       0       0       0       1       0       NA       NA        NA
#> 5       0       0       0       0       1       NA       NA        NA
#>   cut_Premium cut_VeryGood depth price table tsdt_20170104 tsdt_20170105
#> 1          NA           NA  61.5   326    55             0             0
#> 2          NA           NA    NA   326    61             1             0
#> 3          NA           NA  56.9   327    65             1             0
#> 4          NA           NA  62.4   334    58             0             1
#> 5          NA           NA  63.3   335    58             0             1
#>      x    y    z
#> 1 3.95 3.98 2.43
#> 2 3.89 3.84 2.31
#> 3 4.05 4.07 2.31
#> 4 4.20 4.23 2.63
#> 5 4.34 4.35 2.75

We can see from this example that

  1. any new levels will have value of 0s on all the columns related to that feature.
  2. any new features will not be in the output model.matrix.
  3. the entire feature will be imputed with NAs if it is missing in the new data, and warnings will be given.

We recommend any feature engineering and/or missing imputation work to be done before applying function onehot2sql() to the training data in R. It should be the last step before kicking off the model fitting. And SQL query for feature engineering and/or missing imputation can be placed as a sub-query inside the one-hot query. For example, replacing “INPUT_TABLE” inside out$sql with following sub-query will do one-hot encoding together with missing imputation for feature “clarity”:

(SELECT ROW_KEY, [cut], [color], 
  (case when [clarity] IS NULL then 'MISS' else [clarity] end) as [clarity],
  [carat], [depth], [table], [price], [x], [y], [z]
FROM INPUT_TABLE) AS IMPUTED_TABLE

Transform XGBoost Model into SQL Query

Before taking a close look at function booster2sql() translating XGBoost model into SQL query, we want to illustrate the suggested work-flow for the whole process of model fitting and scoring with package xgb2sql:

  1. We start with transferring raw table from database to R as a data.frame/data.table. There are many packages supporting database connection, we recommend dplyr and DBI here.
  2. After all feature engineering and missing imputation is done, apply function onehot2sql() to the data.frame/data.table, obtaining the model.matrix and storing the one-hot query.
  3. Conduct all modeling practices until reaching a final model, then apply function booster2sql() to the final model, producing the XGBoost query for its in-database scoring.
  4. Modeling in R is done, let’s move to in-database scoring:
    • Execute the one-hot query on the raw table, creating the model-ready table.
    • Execute the XGBoost query on the model-ready table, obtaining the model predictions.
    • Compare the model prediction in R with the values given by the XGBoost query is always recommended.

Now let’s move back to function booster2sql(). Available arguments are:

booster2sql(xgbModel, print_progress=FALSE, unique_id=NULL,
            output_file_name=NULL, input_table_name=NULL, input_onehot_query=NULL)

The model input xgbModel to this function should have a class of xgb.Booster. And print_progress=FALSE controls whether the translating progress should be printed to console. Similarly, unique_id and input_table_name should be given to generate the SQL query. It should be noted that there must be a valid file path for output_file_name to write the query, otherwise the function will not run.

Let’s try to predict the “price” of diamonds using the other features. In order to demonstrate the generated XGBoost query, we will train the model with max.depth=2 and nround=2.

x <- out$model.matrix[,colnames(out$model.matrix)!='price']
y <- out$model.matrix[,colnames(out$model.matrix)=='price']
bst <- xgboost(data = x,
               label = y,
               max.depth = 2,
               eta = .3,
               nround = 2,
               objective = 'reg:linear')
#> [1]  train-rmse:4095.421143 
#> [2]  train-rmse:3074.222656
booster2sql(bst, output_file_name='xgb.txt')
#> query is written to file with row unique id named as ROW_KEY
#> query is written to file with input table named as MODREADY_TABLE
cat(readChar('xgb.txt', file.info('xgb.txt')$size))
#> SELECT  ROW_KEY , 0.5 + SUM(ONETREE) AS XGB_PRED
#> FROM (   
#>  SELECT ROW_KEY ,
#>  (CASE WHEN [carat] < 0.995000005 THEN 
#>  (CASE WHEN [y] < 5.53499985 THEN 317.401001
#>   WHEN  [y] >= 5.53499985 THEN 922.349731
#>   WHEN  [y] IS NULL THEN 317.401001 END)
#>   WHEN  [carat] >= 0.995000005 THEN 
#>  (CASE WHEN [y] < 7.19499969 THEN 1841.06018
#>   WHEN  [y] >= 7.19499969 THEN 3696.24292
#>   WHEN  [y] IS NULL THEN 1841.06018 END)
#>   WHEN  [carat] IS NULL THEN 
#>  (CASE WHEN [y] < 5.53499985 THEN 317.401001
#>   WHEN  [y] >= 5.53499985 THEN 922.349731
#>   WHEN  [y] IS NULL THEN 317.401001 END) END) AS ONETREE FROM  MODREADY_TABLE 
#>  UNION ALL 
#>  
#>  SELECT ROW_KEY ,
#>  (CASE WHEN [y] < 6.69499969 THEN 
#>  (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#>   WHEN  [carat] >= 0.824999988 THEN 1056.4021
#>   WHEN  [carat] IS NULL THEN 289.332123 END)
#>   WHEN  [y] >= 6.69499969 THEN 
#>  (CASE WHEN [y] < 7.65499973 THEN 1814.65881
#>   WHEN  [y] >= 7.65499973 THEN 3217.57129
#>   WHEN  [y] IS NULL THEN 1814.65881 END)
#>   WHEN  [y] IS NULL THEN 
#>  (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#>   WHEN  [carat] >= 0.824999988 THEN 1056.4021
#>   WHEN  [carat] IS NULL THEN 289.332123 END) END) AS ONETREE FROM  MODREADY_TABLE   
#> ) AS TREES_TABLE GROUP BY  ROW_KEY

We can see that each SELECT ... AS ONETREE section inside the XGBoost query is composed of nested case when statement, providing scores along a tree structure. And each of these sections represents one round/iteration of the XGBoost model. Values for the splits and scores within the query are from the xgb.dump() of the model without any rounding:

xgb.dump(bst)
#>  [1] "booster[0]"                             
#>  [2] "0:[f1<0.995000005] yes=1,no=2,missing=1"
#>  [3] "1:[f25<5.53499985] yes=3,no=4,missing=3"
#>  [4] "3:leaf=317.401001"                      
#>  [5] "4:leaf=922.349731"                      
#>  [6] "2:[f25<7.19499969] yes=5,no=6,missing=5"
#>  [7] "5:leaf=1841.06018"                      
#>  [8] "6:leaf=3696.24292"                      
#>  [9] "booster[1]"                             
#> [10] "0:[f25<6.69499969] yes=1,no=2,missing=1"
#> [11] "1:[f1<0.824999988] yes=3,no=4,missing=3"
#> [12] "3:leaf=289.332123"                      
#> [13] "4:leaf=1056.4021"                       
#> [14] "2:[f25<7.65499973] yes=5,no=6,missing=5"
#> [15] "5:leaf=1814.65881"                      
#> [16] "6:leaf=3217.57129"

It should be noted that model prediction calculated by adding up the scores provided by xgb.dump(), is different from that by applying predict() to the model directly. It is a rounding difference thus extremely insignificant. But since the XGBoost query is generated with scores from xgb.dump, this difference will still be there between the in-database scoring and the R’s predict() of the model.

There is one last argument of booster2sql() we haven’t talked about, i.e., input_onehot_query. Here we can input the one-hot query generated by onehot2sql(), which will be used as sub-query replacing “MODREADY_TABLE” within the XGBoost query. In this way, the XGBoost query can be executed on the raw table, producing the model predictions directly.

booster2sql(bst, output_file_name='onehot-xgb.txt', input_onehot_query=out$sql)
#> query is written to file with row unique id named as ROW_KEY
cat(readChar('onehot-xgb.txt', file.info('onehot-xgb.txt')$size))
#> SELECT  ROW_KEY , 0.5 + SUM(ONETREE) AS XGB_PRED
#> FROM (   
#>  SELECT ROW_KEY ,
#>  (CASE WHEN [carat] < 0.995000005 THEN 
#>  (CASE WHEN [y] < 5.53499985 THEN 317.401001
#>   WHEN  [y] >= 5.53499985 THEN 922.349731
#>   WHEN  [y] IS NULL THEN 317.401001 END)
#>   WHEN  [carat] >= 0.995000005 THEN 
#>  (CASE WHEN [y] < 7.19499969 THEN 1841.06018
#>   WHEN  [y] >= 7.19499969 THEN 3696.24292
#>   WHEN  [y] IS NULL THEN 1841.06018 END)
#>   WHEN  [carat] IS NULL THEN 
#>  (CASE WHEN [y] < 5.53499985 THEN 317.401001
#>   WHEN  [y] >= 5.53499985 THEN 922.349731
#>   WHEN  [y] IS NULL THEN 317.401001 END) END) AS ONETREE FROM  ( 
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal], 
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D], 
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E], 
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F], 
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G], 
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H], 
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I], 
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF] 
#> FROM INPUT_TABLE 
#> ) AS MODREADY_TABLE  
#>  UNION ALL 
#>  
#>  SELECT ROW_KEY ,
#>  (CASE WHEN [y] < 6.69499969 THEN 
#>  (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#>   WHEN  [carat] >= 0.824999988 THEN 1056.4021
#>   WHEN  [carat] IS NULL THEN 289.332123 END)
#>   WHEN  [y] >= 6.69499969 THEN 
#>  (CASE WHEN [y] < 7.65499973 THEN 1814.65881
#>   WHEN  [y] >= 7.65499973 THEN 3217.57129
#>   WHEN  [y] IS NULL THEN 1814.65881 END)
#>   WHEN  [y] IS NULL THEN 
#>  (CASE WHEN [carat] < 0.824999988 THEN 289.332123
#>   WHEN  [carat] >= 0.824999988 THEN 1056.4021
#>   WHEN  [carat] IS NULL THEN 289.332123 END) END) AS ONETREE FROM  ( 
#> SELECT ROW_KEY, [carat], [depth], [table], [price], [x], [y], [z], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Fair' then 1 else 0 end) AS [cut_Fair], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Good' then 1 else 0 end) AS [cut_Good], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Very Good' then 1 else 0 end) AS [cut_VeryGood], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Premium' then 1 else 0 end) AS [cut_Premium], 
#> (case when [cut] IS NULL then NULL when [cut] = 'Ideal' then 1 else 0 end) AS [cut_Ideal], 
#> (case when [color] IS NULL then NULL when [color] = 'D' then 1 else 0 end) AS [color_D], 
#> (case when [color] IS NULL then NULL when [color] = 'E' then 1 else 0 end) AS [color_E], 
#> (case when [color] IS NULL then NULL when [color] = 'F' then 1 else 0 end) AS [color_F], 
#> (case when [color] IS NULL then NULL when [color] = 'G' then 1 else 0 end) AS [color_G], 
#> (case when [color] IS NULL then NULL when [color] = 'H' then 1 else 0 end) AS [color_H], 
#> (case when [color] IS NULL then NULL when [color] = 'I' then 1 else 0 end) AS [color_I], 
#> (case when [color] IS NULL then NULL when [color] = 'J' then 1 else 0 end) AS [color_J], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'I1' then 1 else 0 end) AS [clarity_I1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI2' then 1 else 0 end) AS [clarity_SI2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'SI1' then 1 else 0 end) AS [clarity_SI1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS2' then 1 else 0 end) AS [clarity_VS2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VS1' then 1 else 0 end) AS [clarity_VS1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS2' then 1 else 0 end) AS [clarity_VVS2], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'VVS1' then 1 else 0 end) AS [clarity_VVS1], 
#> (case when [clarity] IS NULL then NULL when [clarity] = 'IF' then 1 else 0 end) AS [clarity_IF] 
#> FROM INPUT_TABLE 
#> ) AS MODREADY_TABLE    
#> ) AS TREES_TABLE GROUP BY  ROW_KEY

As processing time and query size grow exponentially with max.depth, linearly with nround, this approach of combining the one-hot query and the XGBoost query together should be used for only simple models.


  1. Understand your dataset with XGBoost. 

  2. Ensembles of tree-based models: why correlated features do not trip them and why NA matters. 

  3. StackExchange: Does XGBoost handle multicollinearity by itself? 

  4. GitHub Issue: What are the ways of treatng missing values in XGboost?