LLpowershap: logistic loss-based automated Shapley values feature selection method | BMC Medical Research Methodology | Full Text

BMC Medical Research Methodology volume 24, Article number: 247 (2024) Cite this article

Metrics details

Shapley values have been used extensively in machine learning, not only to explain black box machine learning models, but among other tasks, also to conduct model debugging, sensitivity and fairness analyses and to select important features for robust modelling and for further follow-up analyses. Shapley values satisfy certain axioms that promote fairness in distributing contributions of features toward prediction or reducing error, after accounting for non-linear relationships and interactions when complex machine learning models are employed. Recently, feature selection methods using predictive Shapley values and p-values have been introduced, including powershap.

We present a novel feature selection method, LLpowershap, that takes forward these recent advances by employing loss-based Shapley values to identify informative features with minimal noise among the selected sets of features. We also enhance the calculation of p-values and power to identify informative features and to estimate number of iterations of model development and testing.

Our simulation results show that LLpowershap not only identifies higher number of informative features but outputs fewer noise features compared to other state-of-the-art feature selection methods. Benchmarking results on four real-world datasets demonstrate higher or comparable predictive performance of LLpowershap compared to other Shapley based wrapper methods, or filter methods. LLpowershap is also ranked the best in mean ranking among the seven feature selection methods tested on the benchmark datasets.

Our results demonstrate that LLpowershap is a viable wrapper feature selection method that can be used for feature selection in large biomedical datasets and other settings.

Peer Review reports

Dimensionality reduction is an important task in machine learning and data mining, considering the benefits that it brings. Feature selection (a.k.a attribute selection), among other approaches, is an extensively used dimensionality reduction method as it helps to select relevant features and to remove irrelevant (noise) and redundant features, which generally improves model performance, reduces overfitting, and improves interpretation of models [1]. It also facilitates further follow-up analyses on important features identified, which require domain expertise and well-established statistical procedures [2, 3].

In this paper, we present an enhanced version of the recently introduced SHAP value-based feature selection method, powershap [4], and tests its viability in healthcare settings. Firstly, we exploit Shapley values [5] (hereafter referred to as LogisticLossSHAP), which distributes the mismatch between prediction and the truth calculated by the logistic loss function among the input features. Secondly, we make certain modifications to the calculation of p-values to identify important features. Finally, we make further modifications to statistical power calculations for automating the feature selection method and eliminating the need to specify the number of iterations the method needs to run. We call our method LLpowershap (Logistic Loss-based powershap).

In the following section, we illustrate the connection and benefits of using LLpowershap compared to the previous methods. In “LLpowershap” section, we explain in detail the changes that we make to improve upon powershap [4]. By following the testing strategy described in [4], the performance of LLpowershap is then compared against five other state-of-the-art Shapley value-based methods and two filter methods using both simulation and real-world datasets (“Experiments” and “Results” sections). “Discussion” section discusses the results, and the conclusion is provided in “Conclusion” section.

Post-hoc explanations of black-box machine learning models are becoming increasingly popular, with SHAP [6], an additive feature attribution method, becoming a common method for providing local explanations with good local fidelity. SHAP values are based on Shapley values, a unique solution concept used in game theory that deals with how fairly (mathematically) the payoffs from a cooperative game can be distributed to the game players, satisfying certain axioms, namely, dummy/null player, efficiency/full allocation, symmetry/fairness, and linearity. The Shapley value for a player is defined as the average marginal contribution of the player, considering all possible ways that the player can be part of. It has been demonstrated that Shapley value-based methods can provide different attributions to features for the same input even when the values are computed exactly (as opposed to approximation), owing to different cooperative games formulated by the methods [7]. Marginal (also called interventional or true to the model) Shapley values use marginal distributions to simulate missingness of features as required in the calculation of Shapley values. Conditional (also called true to the data) Shapley values, on the contrary, use a conditional distribution to simulate missingness of features. Conditional Shapley values tend to spread credit between correlated features, whereas marginal Shapley values produce attributions that are a description of the functional form of the model [8]. Conditional and marginal approaches are by far the most common feature removal approaches in practice [8].

TreeSHAP [9], is a widely used algorithm to calculate/estimate Shapley values for tree-based models (decisions trees, tree ensembles such as random forests and gradient boosting), leveraging the structure of the tree(s) to the reduce time complexity from exponential to polynomial time. TreeSHAP is not model agnostic because it utilises internal details of decision trees such as the values at the leaves and the splitting proportions at internal nodes. It has a conditional version, known as Path-dependent TreeSHAP (which estimates Shapley values, biased but variance free) and a marginal version, Interventional TreeSHAP (which calculates exact Shapley values, unbiased and variance free) [8]. Interventional TreeSHAP calculates exact marginal Shapley values in time linear in the size of the model and number of samples in the background dataset used for simulating feature removal [9], with the advantage of having no burden of checking convergence and also having no noise in the estimates [9]. Having no sampling variability when applied to models with many input features is an advantage of using Interventional TreeSHAP. We use Interventional TreeSHAP in LLpowershap by setting feature_perturbation=“interventional” in creating explainer objects. Interventional TreeSHAP can be slower than Path-dependent TreeSHAP if the background dataset is large as the complexity scales with the size of the background dataset. However, model agnostic explainers such as KernelSHAP [6] and SAGE [10] can be much slower compared to Interventional TreeSHAP.

Rank-based feature importance feature selection method and other Shapley value-based methods need global feature importance, rather than local Shapley values. One common way of finding global feature importance is by averaging absolute local Shapley values, so that positive and negative values do not cancel out. However, there are methods that estimate/calculate global feature importance by assessing features effect on loss [10, 11]. For example, the work cited in [11] follows the strategy of using linear models with hinge loss-based characteristic function and relate a feature’s contribution to Shapley value-based error apportioning (SVEA) of total training error, enabling to use a zero-based threshold to identify important subset of features.

To explain interventional Shapley values for models, assume we have a trained model f to predict Y given an input \(\varvec{X}\), consisting of individual features \((X_1, X_2, \ldots , X_m)\). With the subset of given features as \(\varvec{X}_S \equiv \{X_i | i \in S\}\) and the subset of missing features to be simulated as \(\varvec{X}_{\bar{S}} \equiv \{X_i | i \notin S\}\), the prediction \(f(\varvec{x}_S, \varvec{X}_{\bar{S}})\) in interventional Shapley values can be found as

Instead of explaining the prediction made by the model f, it is possible to marginally explain the logistic loss \(\ell\). The characteristic function for the cooperative game will be

Due to the popularity, accessibility and the strength of the SHAP algorithm, in addition to explaining models and other tasks, they have been used in the development of new feature selection methods. Rank-based method using either a rank cut-off or Shapley value cut-off to determine important subset of features is straightforward and is commonly used. Other methods using SHAP values include BorutaShap [12], shapicant [13] and the recent powershap [4]. In BorutaShap (based on Boruta algorithm), utilising TreeSHAP (and hence only works for tree based models), every input feature is randomly shuffled to create shadow features. It works on the idea that a feature is important if it is only better than the best performing shadow feature. The algorithm is repeated for a number of iterations for statistical interpretation for selecting important features based on a cut-off value for p-values. Instead of permuting input features, in shapicant, labels are permuted. First, the true SHAP values are calculated using the actual labels and then for a number of iterations, the labels are randomly shuffled, models are trained and null SHAP values are calculated. Given the true SHAP values and the sets of null SHAP values, statistical interpretation can be realised by non-parametrically estimating p-values and thus enabling the selection of important features using a cut-off value for the p-values obtained. Shapicant specifically uses both the mean of the negative and positive Shapley values.

Powershap algorithm has two parts: the Explain component and the Core component. In the first part, when training a model, a uniform random noise feature is added to the model input and the importance of features (including the random noise feature) is calculated on the validation set (which is used during the training process for early stopping). This process is repeated for a number of iterations, each time the random noise is generated by setting a different random seed for random number generation as well as for splitting the training samples into training and validation sets. Shapley values are aggregated by calculating mean absolute Shapley values of individual samples. In the Core part of the algorithm, p-value for each feature is calculated using a percentile formulae ranking the position of mean (for all the iterations) Shapley value of a feature among the Shapley values of the random noise feature obtained from each iteration. Furthermore, it has an automatic mode which uses statistical power calculations by heuristally assuming t-distributions for noise and feature Shapley values to determine the number of iterations required to reach the required power level (\(\beta\)) for a given p-value cut-off (\(\alpha\)). This eliminates the need to provide the hyper-parameter for number of iterations that the method needs to iterate before calculating p-values. It is noteworthy that BorutaShap and powershap also calculate global feature importance by averaging (\(\mu\)) the absolute local Shapley values of each feature, while shapicant uses both the mean of positive and negative Shapley values.

Generally, feature selection approaches are classified into three broad categories, namely, filter, wrapper and embedded methods. On most occasions, the aim of feature selection is to find the optimal set of features with maximum accuracy [14]. Similar to powershap, BorutaShap and shapicant, our method belongs to the category of wrapper methods.

Filter methods have several advantages as (a) no models need to be trained, (b) they are easy to implement and fast to execute, and (c) they can handle higher number of features. A recent study shows that there is no one criterion better than others and that which criterion should be used depends upon the specific problem in hand [15]. These approaches have limitations. In addition to imposing assumptions on the data (for example, F test assumes a linear relationship between the feature and the outcome variable), these methods do not consider interactions among features and often require a cut-off point specification or hyper-parameter tuning [16]. Wrapper methods invariably involve training models which can capture non-linear relationships and complex interactions. Greedy sequential search such as backward and forward selection and recursive feature elimination are examples of wrapper methods. Wrapper methods can be slow and computationally inefficient, however they generally return fewer but more qualitative features accounting for non-linearity and interactions. In embedded feature selection methods, feature selection is part of training the model itself. For example, fitted trees in random forest models or gradient boosting models can be used to calculate feature importance (e.g., PredictionValuesChange in CatBoost [17] models) facilitating subsequent feature selection. Although these approaches can capture non-linearity and interactions (for example, feature importances calculated in tree models), these methods are highly dependent upon the the algorithms used for creating the models. Often, they also lack axiomatic properties (efficiency, symmetry, dummy player, and additivity) of Shapley values. Embedded models also include penalised linear models for feature selection, such as LASSO regression, which requires creating additional features for capturing non-linearity and interactions.

Like powershap, LLpowershap builds on the idea that a known random feature should have on average lower impact on predictions than an informative feature, but extends the idea in two ways. Firstly, we utilise loss-based Shapley (LogisticLossSHAP) values as it considers the loss instead of model prediction, which can have an impact when a model does not predict the output well. Moreover, we calculate LogisticLossSHAP values on unseen data instead of training/validation sets. Below we demonstrate how LogisticLossSHAP values calculated on the test set can be beneficial in not reporting attributions that lean towards the noise present in the training set rather than the true underlying patterns, especially when using highly flexible (low bias) machine learning algorithms such as gradient boosting.

Let \(\varvec{x}_i\) be the feature vector for a specific training example, i, with \(y_i\) as the label. Assume that \(y_i\) is erroneously recorded. Let \(\varvec{x}_k\) be the feature vector for test example k which is identical to \(\varvec{x}_i\) (that is, \(\varvec{x}_i = \varvec{x}_k\)) but has the true label, \(y_k\) (that is \(y_i \ne y_k\)). Let \(f(\varvec{X})\) be the trained model and assume that it greatly overfits the training data as shown in Fig. 1, demonstrating that the generalisation performance of a model is related to its prediction capability on independent test data [18]. Further, assume that \(f(\varvec{X})\) uses noise feature j for overfitting the training example i. We know that predictive Shapley value for feature j, which measures the contribution of feature j to the model’s prediction for the training example i is

where N is the set of all features and \(f_{i}\) is \(f(\varvec{X})\) evaluated at example i. The logistic loss Shapley value that incorporates the label \(y_i\) and measure the contribution of feature j in reducing the logistic loss for the training example i will be

where the logistic loss L is defined as \(L(f(\textbf{x}), y) = -[y \log (f(\textbf{x})) + (1 - y) \log (1 - f(\textbf{x}))].\)

For the training example i, \(\phi _{j, \text {predictive}}(\varvec{x_i})\) will be high because feature j has a strong contribution to the model’s prediction, and \(\phi _{j, \mathrm{logistic\_loss}}(\varvec{x_i})\) will also be high because feature j reduces the logistic loss for the example. For the identical test example k, the predictive Shapley value for feature j is

Since \({\varvec{x}_k = \varvec{x}_i}\), the predictive Shapley value \(\phi _{j, \text {predictive}}(\varvec{x}_k)\) will be high and will be equal to \(\phi _{j, \text {predictive}}(\varvec{x}_i)\), irrespective of the truth label. Predictive Shapley values suffer from such overfitting scenarios and hence BorutaShap, shapicant and powershap as well. For the test example k with the true label \(y_k\), the logistic loss Shapley value is

Since \(y_k \ne y_i\), and the model overfits the training example i using feature j, the logistic loss for the test example k will reflect this discrepancy. Consequently, \(\phi _{j, \mathrm{logistic\_loss}}(\varvec{x}_k)\) will be low indicating either no contribution or contribution in the opposite direction compared to the training example. Our observations show that LogisticLossSHAP values calculated on test sets have lower Shapley values for noise features compared to the values calculated on the training sets, that is, the strength of the noise features is further attenuated by calculating Shapley value on unseen data. On our simulation data, this attenuation for 100 iterations ranged roughly from 1.5 to 4 times, whereas we did not observe such attenuation in informative features.

Secondly, substantiated by the observations (for example, from 0 noise features to 15 noise features in the output) that using a single noise feature results in greater number of noise outputs when LogisticLossSHAP values are used, a new powerful set of LogisticLossSHAP values is created by taking the maximum of different noises of different standard distributions in each iteration. As expected, on the simulation datasets, statistical analyses show that this newly created set of LogisticLossSHAP values has higher influence than the individual noises. We insert five different noise features: uniform, normal, exponential, logistic and Cauchy standard distributions. Using this set of values is somewhat similar to the strategy followed in BorutaShap, but in BorutaShap the shadow features are created for every input feature by randomly shuffling the input features and picking up the best performing shadow feature. Creating shadow features for all features can result in time and space complexity issues when training data is very large. In principle, it may also be possible to employ noises of the same distribution type such as uniform distribution in LLpowershap, but the current strategy is more aligned with the shadow feature strategy of BorutaShap, that the real features often exhibit more specific distributions. For example, in large datasets, income of participants often exhibits log-normal characteristic with decrease in frequencies as the income levels go up, whereas measures such as height, weight and blood pressure of participants usually follow a normal distribution.

Positive and negative Shapley values of a feature for prediction indicate the direction and magnitude of impact of the feature to reach the predicted value from a global baseline value. Both higher positive and negative Shapley values are valuable for prediction. Unfortunately, tree-based models sometimes use noise features in creating decisions trees. Also, tree explainers sometimes provide non-zero Shapley values to noise features [19], making it less attractive to use zero-based threshold on mean absolute local Shapley values to select informative features. In the case of Shapley values calculated for explaining logistic loss, positive and negative values have different meanings. Positive Shapley values increase the loss and negative values reduce the loss. We negate Shapley values for easier understanding. Again, due to the issues aforementioned, zero-based threshold is not ideal. We do not take the absolute of LogisticLossSHAP values as it is not appropriate.

In Algorithm 1, we show how Shapley values are calculated in LLpowershap, which is a modification of the algorithm proposed by Verhaeghe et al. [4]. The key differences are (a) using LogisticLossSHAP instead of SHAP values, (b) splitting the whole training samples at the proportion of 0.7 and 0.1 for training and validation and 0.2 for the test set (to calculate LogisticLossSHAP values on unseen data) and (c) employing different noise features and taking the maximum in each iteration.

Algorithm 1 LLpowershap Explain algorithm

Powershap uses a percentile formula to calculate p-value empirically, ranking the mean Shapley value for a feature within the Shapley values for the random uniform feature. In contrast, for each iteration we consider only the noise feature that has had the highest influence. To empirically calculate p-value we consider the distribution of Shapley values of both the highest noise and the feature as non-parametric and employ the Mann-Whitney U test. The null hypothesis here is that there is no difference between the distributions and the alternative hypothesis is that Shapley value distribution of the feature is greater than that of the noise. This results in getting a p-value for each feature. Then using a user supplied p-value cut-off \(\alpha\), we can find the set of informative features. We use scipy.stats.mannwhitneyu [20] to calculate p-values. Algorithm 2 shows the modifications to calculate p-values.

Algorithm 2 LLpowershap Core algorithm

We utilise the powerful automatic mode designed in [4] with the following changes. We set the initial number of iterations to 20 instead of 10 iterations which helps to estimate more accurate p-values. The other change is in the calculation of the effect size. We test the equality of variance using Levene’s test and if there is no difference in variances to the user provided \(\alpha\) value, we use pooled standard deviation (Eqs. 7 and 8). Where variances are not equal, we use Glass’s delta (Eq. 9, using the standard deviations of Shapley values of features) in estimating the effect size. We find that, generally, for the LogisticLossSHAP values, the augmented noise features have much smaller means (roughly 50 to 1,000 times less) and standard deviations (5 to 100 times less) compared to informative features in our simulation data. We assume t-distributions (our observations on the simulation data show this is a reasonable assumption) both for the noise and informative features for empirical power calculations. Algorithm 3 shows the power analysis. We do not make any changes to the following aspects of the automatic mode of powershap. That is, the default \(\alpha\) is set to 0.01 and the required power to 0.99. If the number of required iterations estimated using the power calculation exceeds 10, an additional 10 iterations are added and p-values and required iterations are again calculated. By default, this is repeated three times to avoid a possible infinite calculation. We provide source files that can be used to replace certain existing files in powershap [4] to make use of LLpowershap.

Algorithm 3 LLpowershap analysis function

We repeat the experiments conducted in [4] with some additions. We include LLpowershap and a rank-based feature selection method that heuristically considers the top 3% of features (when there are more than 100 features to select from) as the relevant features, as used in comparing different feature selection methods on different synthetic datasets in [21]. To test the applicability of the proposed LLpowershap along the other methods, we additionally include a very large real-world biomedical dataset consisting of more than 450,000 samples with over 2,800 features. We do not include forward selection due to the time complexity issues associated with forward selection as noted in [4]. To allow direct comparability, we chose to employ the same simulation and real-world datasets as in [4] and also used the same random seeds to generate and split the datasets. In simulation data, we additionally include the case of only 3% of the total number of features being informative to resemble the use case that in many real-world large datasets only a small percentage of features are informative for the selected outcome, especially if the outcome is a rare event. We also test the methods with 10,000 samples in the simulation data. Two filter methods (providing p-values) selected for comparison are chi-squared and F-test feature selection, available from the sklearn library [22]. Input values are shifted to positive range by adding the absolute of the minimum value for chi-squared test as the test works only with positive values. Chi-square test returns low p-value when a feature is not independent of the outcome variable. F-test fits univariate linear regression models and reports F-score and p-value with the null hypothesis that the feature does not differentiate between the two classes. For all Shapley value based method we used an XGBoost [23] gradient boosting tree-based method with 250 estimators with overfitting detection enabled (early_stopping_rounds = 25). We used an XGBoost model instead of a CatBoost model (as done in [4]) as CatBoost lacked Interventional TreeSHAP for logistic loss. This way we minimised/nullified algorithm specific differences (for example, CatBoost has ordered boosting and dynamic selection of learning rate based on dataset and number of estimators specified compared to XGBoost). XGBoost also has the advantage of choosing only one feature from a set of perfectly correlated features to construct decision trees and thereby rendering Shapley value of zero to those features not used. CatBoost uses all of those perfectly correlated features resulting in distributing feature importance among the correlated features. All Shapley value based methods are tested in their default mode. Both powershap and LLpowershap supported early stopping with a validation set and hence a validation set was provided. BorutaShap and shapicant in default mode did not support early stopping. In the experiments we were primarily interested in the returned informative features, noise features in the output and the performance of models using only the selected features. The code to reproduce the results can be found at https://github.com/madakkmi/LLpowershap.

All the methods (except the top 3%) are tested on simulation data containing 5,000 and 10,000 samples, created using make_classification of sklearn, with the hypercube parameter set to True. When hypercube parameter is enabled, samples are drawn in a hypercube-manner. The simulations are run for 20, 100, 250 and 500 features with the percentage of informative features varying as 3% (one feature if 3% is less than one), 10%, 33%, 50% and 90%, enabling a comprehensive evaluation of the methods under different varying conditions. We repeated the experiments five times using different random seeds in invoking the make_classification function. We set the number of repeat and redundant (linear combination of informative features) features as zero. The methods tested were LLpowershap, powershap, BorutaShap, shapicant, chi-squared and F-test.

We evaluate the different methods in their default configuration on three publicly available high dimensional classification datasets, namely, Madelon [24], Gina priori [25], the Scene dataset [26], and the UK Biobank [27] data prepared for predicting cancer incidences. Madelon is a multivariate and highly non-linear dataset, whereas Gina priori is a digit recognition dataset and Scene is a scene recognition dataset. The UK Biobank data was prepared to identify risk factors of cancer incidence in over 400,000 UK Biobank participants after their baseline visit for a follow-up period of 10 years [28]. The dataset contained features pertaining to personal characteristics, lifestyle and environment, physical measurements, self-reported diseases, medications and operations and blood biomarkers. These baseline measures were captured via touch-screen questionnaires, face-to-face interviews, physical measurements and blood sampling and urine collection, from the UK biobank participants between 2006 through 2010, during their initial visit to one of the 22 assessment centers across England (89%), Wales(4%) and Scotland(7%) [27]. The dataset is class imbalanced (cancer incident rate was around 11%).

The characteristics of these datasets are shown in Table 1. For the Scene dataset, we used the label “Urban” for the experiments. The datasets are split into training and test sets at the ratio of 75:25. All methods are evaluated using 10-fold cross validation using two other independent state-of-the-art gradient boosting decision trees algorithms, CatBoost and LightGBM [29] in their default mode. Further performance was assessed on the test set using 1,000 bootstrapped datasets. The models were evaluated using the area under the receiver-operating characteristics (AUROC) metric. We statistically test the normality and homogeneity of variances of AUROC scores and decide whether to use Friedman test or ANOVA test for determining overall significance among the feature selection methods and Nemenyi or Tukey’s honest significance difference (HSD) post-hoc tests to compare between pairs of feature selection methods. We used the Python package autorank for statistical tests [30].

Table 2 shows the results of simulation in terms of the percentage of informative features found for the datasets with sample size 5,000 and 10,000, whereas Table 3 shows the results in terms of number of selected noise features. For datasets with 5,000 samples, LLpowershap returns maximum number of informative features in all cases except when there are 500 features and the percentage of informative features is 50% and 90%. LLpowershap and powershap have similar performance in terms of percentage of informative features found but LLpowershap has lower number of noise features (in terms of mean and standard deviation) in the output. Among filter methods, chi-squared method is better than F-test with higher number of informative features found, with no noise in the output. BorutaShap and shapicant have similar performance except for 100 features with 90% of informative features. Figure 2 shows the results of simulation for each seed with the first row for the percentage of informative features found and the second row for number of selected noise features for the dataset with 5,000 samples. The results show higher noise output for powershap compared to LLpowershap. For 10,000 samples, LLpowershap and powershap have comparable performance in terms of percentage of informative features found but in most settings, LLpowershap outputs lower number of noise features. It returns nil or nearly nil noises for features 20, 100 and 250, and lower number of noises compared to powershap for 500 features. We observe that filter methods, chi-squared and F-test do not improve performance with the addition of 5,000 samples. A similar trend is noticed for shapicant as well. However, LLpowershap, powershap and BorutaShap improve their performance with the additional samples. The reduced performance of LLpowershap and powershap when 90% of the 500 features are informative in the case of 5,000 samples compared to 10,000 samples (as shown in Table 2) indicates the underfitting of the models due to limited number of samples to learn the underlying patterns.

Table 4 shows the size of the selected feature sets on the benchmark classification datasets. As expected, the filter methods tend to select more features (except for F test in Madelon where the features have highly non-linear relationship with the outcome). BorutaShap picks the lowest number of features, probably due to the fact that selected features must outperform the best shadow feature.

In our experiments, we evaluated the timing of various feature selection methods on a high-performance Windows 10 Enterprise machine as shown in Table 5. The system was equipped with 128GB of RAM and powered by an Intel(R) Xeon(R) W-1270P CPU running at 3.80GHz with 8 cores and 16 logical processors. There were only minimum background processes running at the time of evaluation. As anticipated, the filter methods were extremely quick due to their lack of involvement in creating multivariate models. Powershap outpaced all other wrapper methods.

Prediction performance of the selected features using 10-fold cross validation and on the test set is shown in Fig. 3 for a default CatBoost model and in Fig. 4 for a default LightGBM model, measured using AUROC. In most cases, LLpowershap has the highest performance among the datasets and the methods. The top 3% method performs poorly on the Scene and Gina priori datasets but performs better on the Madelon dataset. For the UKB Cancer data, all methods perform well in the cross-validation. Interestingly, we see a very similar trend in performance (between Figs. 3 and 4) when we use a CatBoost or a LightGBM to measure the performance of the selected features. The statistical analysis was conducted for the seven feature selection methods using 16 testing scenarios (four datasets, each assessed using two different machine learning algorithms (CatBoost and LightGBM) and two evaluation methods (10-fold cross-validation and test set)). The family-wise significance level (\(\alpha\)) of the tests was 0.01. We rejected the null hypothesis that the distribution of AUROC scores for a feature selection method is normal for the methods LLpowershap (\(p = 0.001\)), shapicant (\(p = 0.001\)), chi-squared (\(p < 0.001\)), F-test (\(p < 0.001\)) and top 3% (\(p < 0.001\)). Therefore, we assumed that the scores of all feature selection methods did not follow a normal distribution. Hence, we used the non-parametric Friedman test to determine if there were significant differences between the median values of the methods, followed by the post-hoc Nemenyi test to identify specific differences. We report the median, median absolute deviation, and the mean rank among all methods over the samples in Table 6. Differences between methods were significant if the mean rank difference was greater than the critical distance (CD = 2.637) as shown in Fig. 5. The Friedman test rejected the null hypothesis (\(p < 0.001\)), indicating a significant difference in the central tendency of the methods. Based on the Nemenyi test, no significant differences were found within the groups: {BorutaShap, shapicant, top 3%}, {powershap, shapicant, chi-squared, F-test, top 3%} and {LLpowershap, powershap, shapicant, chi-squared, F-test}. All other differences were significant. LLpowershap ranked the best in mean ranking among all the feature selection methods (Fig. 5).

Model performance on training and test data with respect to model complexity. The model \(f(\varvec{X})\) overfits the training data

Simulation benchmark results on the datasets created using the make_classification of sklearn for 5,000 samples with five different random seeds. X-axis labels show the percentages of informative features in the datasets with the counts in brackets. Each dot of a particular colour represents result from one simulation (for example, using random seed 0) for a particular method

Benchmark performance using default CatBoost model, with error bars representing the standard deviations

Benchmark performance using default LightGBM model, with error bars representing the standard deviations

Critical difference diagram showing average ranks of the feature selection methods on the benchmark datasets

Our experiments on simulated data as well as on the benchmark datasets demonstrate that there is advantage in using LLpowershap in terms of selecting higher number of informative features with minimal or no noise, and also providing higher performance on the selected set of features. The advantage is evident on simulation data and on the benchmark datasets, including the biomedical data, UKB Cancer. Further improvements could be achieved by replacing the default XGBoost models with hyper-parameter tuned XGBoost models and by optimising the wrapper hyper-parameters such as \(\alpha\) and \(\beta\) values. As seen with the simulation results for 500 features with 90% informative features for 5,000 and 10,000 samples, resolving underfitting can be helpful in extracting even a higher number of informative features and for further reduction in noise features in the output. The other Shapley values based feature selection methods (especially BorutaShap and shapicant) may also benefit from using tuned models. Utilising powershap’s convergence option, LLpowershap can be used to extract maximum number of informative features by successively removing already found informative features from the input. Outputting lower number of noise features becomes highly relevant in this scenario. Unselected collinear features can also be recovered by using the convergence option. Although filter methods of feature selection can handle large amount of data such as healthcare data, their performance does not necessarily improve with additional samples, as shown by the results of chi-squared and F-test method on simulated data with additional 5,000 samples. This may make them less attractive for large biomedical datasets. Benchmark results on the dataset Gina priori show the weakness of filter methods. Chi-squared method selects 80% of the features as important and F test selects 52% of the features as important. Although LLpowershap selects only 16% of the features as important, it has equal performance with the filter methods in 10-fold cross validation using both CatBoost and LightGBM models. Moreover, LLpowershap is ranked the best in mean ranking among all methods considering different testing scenarios on the benchmark datasets, even though it selects far fewer number of features as important compared to the filter methods.

LLpowershap tends to be slower than powershap as time complexity of calculating Shapley values using Interventional TreeSHAP scales in terms of model size and the size of the background dataset. We set background dataset to the maximum recommended size of 1024 samples. Path-dependent TreeSHAP (as employed by powershap, BorutaShap and shapicant) does not suffer from this issue. In the absence of this issue, both LLpowershap and powershap should have similar time complexity. However, on the Madelon dataset, we find LLpowershap is faster than BorutaShap and shapicant but slower than powershap. On the Gina priori and Scene datasets, LLpowershap is faster than shapicant but slower than BorutaShap. On the UKB Cancer dataset with over 450,000 samples and over 2,800 features, among the wrapper methods, powershap was the fastest, followed by LLpowershap, the top 3% method, BorutaShap and shapicant, ensuring the viability of LLpowershap with respect to execution time. We anticipate the developers of XGBoost and SHAP packages for Python will soon incorporate the calculation of logistic loss based Shapley values utilising GPU, resulting in significant performance for LLpowershap in terms of time efficiency.

Every study comes with some limitations and this one is no exception. As noted in [4], make_classification in its default mode generates data points from hypercubes as linear classification problems, which are easier to classify. Also, as the no-free lunch theorems state that there is no one single algorithm exists, that performs best on all possible datasets, there is still the need to tune hyper-parameters of models used within a wrapper and to optimise the hype-parameters of the wrapper method as well for specific datasets. The results of such time consuming actions could be different from what we have presented here, but these analyses suggest that there are inherent benefits in using loss-based Shapley values on unseen data.

Before opting for five noise features of different distributions, we experimented with the idea of permuting the most important feature (by Shapley values) as the single noise feature replacing the uniform random noise in powershap but our experiments showed that that this may not be beneficial, at least on the simulation data in this work. We also tried using uniform noise with very small correlation (<1e-6) with the outcome variable and our experiments showed this may not be particularly useful, given such noise features might have higher correlation with other features, thereby becoming useful in the construction of decision trees. We also experimented with loss-based Shapley values on the training and validation sets and our experiments showed that they are not as effective as Shapley values on the test set.

For rank-based feature selection methods, it is difficult to set a threshold that works for many datasets for discarding irrelevant features. Different strategies include defining a threshold by the largest gap between two consecutively ranked features [31] or heuristically considering anything below the top 3% of features (if there are more than hundred features to select from) after ranking as irrelevant, as used in [21]. From a predictive performance perspective, our tests on the benchmark datasets showed this strategy effective only in the case of the dataset Madelon and to some extent on UKB Cancer dataset, advocating the need to check the performance of the models with the selected features before conducting further follow-up analyses. Alternatively, methods such as LLpowershap outputting features based on given cut-off value for p-values are more effective.

With the advent of big data, we have the opportunity to deal with large scale datasets (both in terms of samples and features), making it more important to do feature selection as an initial screening step in the analysis pipeline. Often these large datasets contain a very high proportion of features that do not have any relevance to the outcome of interest. An illustrative example used here is the large scale biomedical datasets containing comprehensive information (including hospital admissions, treatment and medication history) on participants that can be used for discovering risk factors for relatively rare diseases. Our simulation results with only 3% of the total number of features were specified as important, and where there were sufficiently large number of features (250 and 500) show a miniature of this scenario. LLpowershap’s better performance in terms of noise output in such a scenario (for example, 3% informative features out of a total of 500 features, bottom right plot in Fig. 2) may suggest that our method is more suitable for large scale real-world datasets with a very small proportion of relevant features with respect to the outcome of interest. For example, all 41 features selected by LLpowershap for UKB Cancer were also reported as important features in [28].

We propose LLpowershap (logistic loss-based powershap) as a viable wrapper feature selection method for classification on a wide range of datasets as our experiments show that it can output increased number of informative features with lower number of noise features in the output. On the benchmark real-world datasets, including a large biomedical dataset, our method shows consistent higher or at par performance on the selected features, tested using two independent state-of-the-art gradient boosting decision tree algorithms.

All datasets except UKB Cancer used in this project are publicly available. This research has been conducted using the UK Biobank Resource under Application Number 89630. The UK Biobank is an open access resource and bona fide researchers can apply to use the UK Biobank dataset by registering and applying at https://www.ukbiobank.ac.uk/enable-yourresearch/register. The code for LLpowershap can be found at https://github.com/madakkmi/LLpowershap.

The code for LLpowershap can be found at https://github.com/madakkmi/LLpowershap.

Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature selection: a data perspective. ACM Comput Surv. 2017;50(6):1–45.

Article Google Scholar

Madakkatel I, Zhou A, McDonnell MD, Hyppönen E. Combining machine learning and conventional statistical approaches for risk factor discovery in a large cohort study. Sci Rep. 2021;11(1):22997.

Article CAS PubMed PubMed Central Google Scholar

Liu X, Morelli D, Littlejohns TJ, Clifton DA, Clifton L. Combining machine learning with Cox models to identify predictors for incident post-menopausal breast cancer in the UK Biobank. Sci Rep. 2023;13(1):9221.

Article CAS PubMed PubMed Central Google Scholar

Verhaeghe J, Van Der Donckt J, Ongenae F, Van Hoecke S. Powershap: a power-full shapley feature selection method. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer; 2022. pp. 71–87.

Shapley LS. A value for n-person games, Contributions to the Theory of Games (AM-28), Volume II; 1953. p. 307–18.

Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765–74.

Google Scholar

Merrick L, Taly A. The explanation game: Explaining machine learning models using shapley values. In: Machine Learning and Knowledge Extraction: 4th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2020, Dublin, Ireland, August 25–28, 2020, Proceedings 4. Springer; 2020. pp. 17–38.

Chen H, Covert IC, Lundberg SM, Lee SI. Algorithms to estimate Shapley value feature attributions. Nat Mach Intel. 2023;5(6):590–601.

Article Google Scholar

Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intel. 2020;2(1):56–67.

Article Google Scholar

Covert I, Lundberg SM, Lee SI. Understanding global feature contributions with additive importance measures. Adv Neural Inf Process Syst. 2020;33:17212–23.

Google Scholar

Tripathi S, Hemachandra N, Trivedi P. Interpretable feature subset selection: A Shapley value based approach. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE; 2020. pp. 5463–5472.

Keany E. Ekeany/Boruta-Shap: A Tree based feature selection tool which combines both the Boruta feature selection algorithm with shapley values. 2020. https://github.com/Ekeany/Boruta-Shap. Accessed 2 Jan 2024.

Calzolari M. manuel-calzolari/shapicant: Feature selection package based on SHAP and target permutation, for pandas and Spark. 2020. https://github.com/manuel-calzolari/shapicant. Accessed 2 Jan 2024.

Kohavi R, John GH. Wrappers for feature subset selection. Artif Intell. 1997;97(1–2):273–324.

Article Google Scholar

Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Comput Stat Data Anal. 2020;143:106839.

Article Google Scholar

Kumari B, Swarnkar T. Filter versus wrapper feature subset selection in large dimensionality micro array: A review. Int J Comput Sci Inf Technol. 2011;2(3):1048–53.

Google Scholar

Dorogush AV, Ershov V, Gulin A. CatBoost: gradient boosting with categorical features support. 2018. arXiv preprint arXiv:1810.11363.

Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. vol. 2. Springer; 2009.

Linardatos P, Papastefanopoulos V, Kotsiantis S. Explainable ai: A review of machine learning interpretability methods. Entropy. 2020;23(1):18.

Article PubMed PubMed Central Google Scholar

Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261–72.

Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. Feature selection for high-dimensional data. Springer; 2015.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.

Google Scholar

Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM; 2016. p. 785–94. https://doi.org/10.1145/2939672.2939785.

Guyon I, Gunn S, Ben-Hur A, Dror G. Result analysis of the nips 2003 feature selection challenge. Adv Neural Inf Process Syst. 2004;17. https://www.openml.org/d/1485. Accessed 2 Jan 2024.

Gina Priori. OpenML; 2014. https://www.openml.org/d/1042. Accessed 2 Jan 2024.

Matthew Boutell XS Jiebo Luo, Brown C. Scene. OpenML; 2004. https://www.openml.org/d/1042. Accessed 2 Jan 2024.

Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 2015;12(3):e1001779.

Article PubMed PubMed Central Google Scholar

Madakkatel I, Lumsden AL, Mulugeta A, Olver I, Hyppönen E. Hypothesis‐free discovery of novel cancer predictors using machine learning. Eur J Clin Investig. 2023;53(10):e14037.

Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: a highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst. 2017;30:3146–54.

Google Scholar

Herbold S. Autorank: A Python package for automated ranking of classifiers. J Open Source Softw. 2020;5(48):2173. https://doi.org/10.21105/joss.02173.

Sánchez-Maroño N, Alonso-Betanzos A, Tombilla-Sanromán M. Filter methods for feature selection–a comparative study. In: International Conference on Intelligent Data Engineering and Automated Learning. Springer; 2007. pp. 178–187.

Download references

We acknowledge the source code made available by the authors of the package powershap available at https://github.com/predict-idlab/powershap which were selectively modified by us to create LLpowershap. We are indebted and thankful to all participants of the UK Biobank, who made this work possible.

Not applicable.

This project was funded by Medical Research Future Fund (MRFF), Grant/Award Number: 2007431. Elina Hyppönen is funded by NHMRC Investigator Fellowship (GNT 2025349).

Australian Centre for Precision Health, Unit of Clinical and Health Sciences, University of South Australia, Adelaide, 5001, South Australia, Australia

Iqbal Madakkatel & Elina Hyppönen

South Australian Health and Medical Research Institute (SAHMRI), Adelaide, 5001, South Australia, Australia

Iqbal Madakkatel & Elina Hyppönen

You can also search for this author in PubMed Google Scholar

EH and IM conceived the idea. IM designed and programmed the methodology and drafted the manuscript. EH managed funding acquisition, supervised, reviewed and edited the manuscript.

Correspondence to Iqbal Madakkatel.

The UK Biobank project was approved by the National Information Governance Board for Health and Social Care and North West Multi-centre Research Ethics Committee (11/NW/0382) and the study was conducted in accordance with the Declaration of Helsinki. Participants provided electronic consent to use their anonymized data and samples for health-related research, to be recontacted for further sub-studies and for the UK Biobank to access their health-related records.

Not applicable.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

Madakkatel, I., Hyppönen, E. LLpowershap: logistic loss-based automated Shapley values feature selection method. BMC Med Res Methodol 24, 247 (2024). https://doi.org/10.1186/s12874-024-02370-8

Download citation

Received: 21 February 2024

Accepted: 14 October 2024

Published: 24 October 2024

DOI: https://doi.org/10.1186/s12874-024-02370-8

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative