AutoML: offer the possibility to specify the order in which training steps are executed

Description

After discussing Epsilon's needs with and regarding case 94685, we decided for now to provide the possibility for AutoML to specify the order in which training steps will be executed.
This can be done at higher/coarse-grained level (order of default algos, default grids):

  • XGB_defaults, GBM_defaults, … XGB_grid, …

Or at a more fine-grained level (order of each hardcoded model):

  • XGB_default_1, XGB_default_2, …., GBM_def_1, ….

Proposal

The suggested parameter name for this specification is training_steps.

Here is the suggested JSON representation to specify those steps in an ordered way:

1 2 3 4 5 6 7 8 9 10 [ {"name":"XGBoost", "steps":[{"id":"def_1"}, {"id":"def_2"}, {"id":"def_3"}], {"name":"GLM"}, {"name":"DRF", "alias":"all"}, {"name":"GBM", "alias":"defaults"}, {"name":"XRT"}, {"name":"XGBoost", "steps":[{"id":"grid_1"}]}, {"name":"GBM", "alias":"grids"], {"name":"StackedEnsemble", "steps":[{"id":"best"}, {"id":"all"}]} ]

Unfortunately, JSON doesn’t guarantee conservation of object keys so we can’t use a JSON object for this but have to use only arrays.

The semantic of the example above goes as follow:

  • starts with XGBoost algorithm, but only hardcoded models with ids def_1, def_2, def_3 in the given order.

  • then train all the GLM models (default models and/or grids), followed by all DRF models (using alias all in the latter case).

  • then train all the default GBM models (using alias defaults to avoid typing all the model ids explicitly).

  • then train all the XRT models

  • then train XGBoost step with id grid_1 (probably a grid…)

  • then train all the GBM grids (using alias grids to avoid listing them explicitly).

  • then train the StackedEnsemble models with ids best and all in this order.

  • DeepLearning algo hasn’t been mentioned in this example, so it will be skipped.

If an algo or a model id (e.g. def_3) is present in this order specification but the id doesn’t exist anymore in the new AutoML version, then it will be ignored with a warning message.

The representation is also easily extensible: we can add new algos, new default models, new grids, new hyperparameter search methods…

If user also specifies exclude_algos parameter, this one will apply on top of the order specification: this allows user to keep this specification in one variable, without having to change it later. For example exclude_algos=[“XRT“]in combination with training_steps=the_example_above will execute the steps defined in the example except XRT. Same thing if using include_algos instead.

After running AutoML, the detailed training_steps specification (with all step ids) will be available from the automl instance so that the user can save it for later use.

 

Python representation examples (can use list or tuples):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # the JSON example translated to Python: training_steps=[ ('XGBoost', ['def_1', 'def_2', 'def_3']), ('GLM'), ('DRF', 'all'), ('GBM', 'defaults'), 'XRT', ('XGBoost', ['grid_1']), ('GBM', 'grids'), ('StackedEnsemble', ['best', 'all']) ] # specify only algos ordering: in this case it will always execute # all default models first (if any) # immediately followed by the algo grids (if any): training_steps=['XGBoost', 'GLM', 'DRF', 'GBM', 'DeepLearning'] # only specify algos order, making the distinction between default models and grids (the order of each individual model is the default one defined by backend): training_steps=[ ('XGBoost', 'defaults'), ('GLM', 'grids'), ('DRF', 'defaults'), ('GBM', 'defaults'), ('XGBoost', 'grids'), ('GBM', 'grids'), ('StackedEnsemble', 'all') ]

And an equivalent representation in R:

1 2 3 4 5 6 7 8 9 10 training_steps=list( list('XGBoost', c('def_1', 'def_2', 'def_3')), list('GLM'), list('DRF', 'all'), list('GBM', 'defaults'), 'XRT', list('XGBoost', c('grid_1')), list('GBM', 'grids'), list('StackedEnsemble', c('best', 'all')) )

 

Environment

None

Status

Assignee

Sebastien Poirier

Fix versions

Reporter

Sebastien Poirier

Support ticket URL

None

Labels

None

Release Priority

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Priority

Major