Create dynamic Stacked Ensemble metalearning strategy in AutoML based on data size

Description

There's some strong evidence that as datasets grow in size, it's better to switch over to using a blending frame for training the metalearner in Stacked Ensembles instead of 5-fold CV (which is what we use by default on all datasets).

On a benchmark of the HIGGS dataset, we compare blending, 3-fold, 5-fold of 1 hour to 5-fold for 4 hours. Here we see that with 1M rows, we can beat a 1-hour blending frame by running the default 5-fold for longer (4 hours) – though it’s still obviously better to use less time (and hence the blending frame here). At 10M rows, blending for 1 hour is still giving better results than default 5-fold CV for 4 hours…. which means there is really no reason we should be doing CV at this point. These results use a separate test set for leaderboard scoring (the AUCs you see on the plot).

We will have to do some more benchmarking on this because if we switch over to using a 10% (or some other fraction) blending frame for datasets of a certain "size" or "size in relation to compute resources", then we don’t get the CV metrics for the leaderboard, so we will have to chop off another piece of data just for the leaderboard scoring, which could be ok if there's "enough" data, but we need to be careful about doing this properly.

Assignee

Unassigned

Fix versions

None

Reporter

Erin LeDell

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

No

Components

Priority

Major
Configure