The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.
stabilityPhi(features, p, impute.na = NULL)
list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset.
The features must be given by their names (character
) or indices (integerish
).
numeric(1)
Total number of features in the datasets.
numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets.
E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result.
With which value should these missing values be imputed? NULL
means no imputation.
numeric(1)
Stability value.
The stability measure is defined as the average phi coefficient between all pairs of feature sets. It can be rewritten as (see Notation) $$\frac{2}{m (m - 1)} \sum_{i=1}^{m-1} \sum_{j = i+1}^m \frac{|V_i \cap V_j| - \frac{|V_i| \cdot |V_j|}{p}} {\sqrt{|V_i| (1 - \frac{|V_i|}{p}) \cdot |V_j| (1 - \frac{|V_j|}{p})}}.$$
For the definition of all stability measures in this package,
the following notation is used:
Let \(V_1, \ldots, V_m\) denote the sets of chosen features
for the \(m\) datasets, i.e. features
has length \(m\) and
\(V_i\) is a set which contains the \(i\)-th entry of features
.
Furthermore, let \(h_j\) denote the number of sets that contain feature
\(X_j\) so that \(h_j\) is the absolute frequency with which feature \(X_j\)
is chosen.
Analogously, let \(h_{ij}\) denote the number of sets that include both \(X_i\) and \(X_j\).
Also, let \(q = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i|\) and \(V = \bigcup_{i=1}^m V_i\).
Nogueira S, Brown G (2016). “Measuring the Stability of Feature Selection.” In Machine Learning and Knowledge Discovery in Databases, 442--457. Springer International Publishing. doi:10.1007/978-3-319-46227-1_28 .
Bommert A, Rahnenführer J, Lang M (2017). “A Multicriteria Approach to Find Predictive and Sparse Models with Stable Feature Selection for High-Dimensional Data.” Computational and Mathematical Methods in Medicine, 2017, 1--18. doi:10.1155/2017/7907163 .
Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906 .
feats = list(1:3, 1:4, 1:5)
stabilityPhi(features = feats, p = 10)
#> [1] 0.7576447