Shapley Value Regression

Jingyi Liang

The basic idea of calculating the importance of attributes in a linear regression is according to the coefficients in the regression. However, when we put too many independent variables to regress, we can not promise that all those independent variables are independently distributed, commonly speaking. On other words, it may have great possibility that several attributes are collinearity, which also known as highly correlated. In an example context, we can easily remove the highly correlated attributes and then do the regression. However, in real world business cases, all the attributes we selected are important and meaningful, thus we can not remove the attributes which are highly correlated randomly. Therefore, we need to find out how to calculating the importance of attributes when several attributes are collinearity.

Shapley Value regression is also called Shapley regression, Shapley Value analysis, Kruskal analysis, and dominance analysis, and incremental R-squared analysis. Apart from using it while independent variables are moderately to highly correlated in linear regression, it also can be used when computing the contribution of each predictors in machine learning.

This package only has one function shapleyvalue, and you can use it to analyze the relative importance of attributes in linear regression.

A simple example

Here, we use the bulit-in dataset Boston in package MASS. In this demo, medv as dependent variable, nox, rm, age, dis as four predictors, and we want to find out the importance of each predictor.

library(ShapleyValue)
data <- Boston
head(data) %>%
  kbl() %>%
  kable_classic(full_width = F, html_font = "Cambria")
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
y <- data$medv
x <- as.data.frame(data[,5:8])
value <- shapleyvalue(y,x)
value %>%
  kbl() %>%
  kable_classic(full_width = F, html_font = "Cambria")
nox rm age dis
Shapley Value 0.0836 0.3938 0.0573 0.0272
Standardized Shapley Value 0.1488 0.7009 0.1020 0.0483