Add model-agnostic permutation feature importance function

Description

Permutation feature importance is a great way to get feature importance in a model-agnostic fashion. All our algorithms (except Stacked Ensemble at the moment) have built-in feature importance, but it would be great to have this feature. It makes sense to have it as separate function which does not happen automatically as part of the model building process. This can also be used as a new method for doing metalearning (model selection) inside a Stacked Ensemble.

Here is the methodology:

A) you have a hold-out dataset (or you use the kfold)
B) You make predictions using the ensemble model and you measure AUC or whichever other metric ( you have already computed these things with the leaderboard).Lets say this gives 0.8 AUC
C) For each column in the data.

  1. you randomly shuffle it

  2. you repeat the scoring where you have that column as random (and everything else is correct)

  3. you measure AUC . Now lets say AUC is 0.7. The different between the original AUC and this one (where one feature is wrong) is the importance of that column

  4. you bring this column back to normal and your repeat for the next column

References: https://christophm.github.io/interpretable-ml-book/feature-importance.html
The permutation feature importance measurement was introduced by Breiman (2001) for random forests. Based on this idea, Fisher, Rudin, and Dominici (2018) proposed a model-agnostic version of the feature importance and called it model reliance.

Assignee

Ard Kelmendi

Fix versions

None

Reporter

Erin LeDell

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

No

Priority

Major
Configure