Uploaded image for project: 'Public H2O 3'
  1. PUBDEV-3973

When fold_assignment for base learners is not Modulo, Stacked Ensemble fails

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.10.5.1
    • Component/s: StackedEnsemble
    • Labels:
      None
    • CustomerVisible:
      No

      Description

      There is a check in the stacked ensemble that forces the base learners to have used `fold_assignment = "Modulo"` to guarantee that they all used identical folds. This is overly strict.

      There are two other ways to get identical folds, but they require some additional checking.
      1. If the user selects `fold_assignment` of "AUTO" or "Random" (AUTO defaults to Random), and the same seed was used in all the algos, then the folds produced will be identical. So we could add a check to validate that the same seed was used.
      2. If the user provides a fold id column in the training set and uses that to train all the models, that is another way to guarantee that the folds are the same. In this case, we need to check that the same column was used in the training_frame (or if `keep_cross_validation_fold_assignment = TRUE`, then we could check the actual folds themselves).

      Here is an example where `fold_assignment = "AUTO"`.

      h= h2o.init()
      train <- h2o.importFile("/Users/nidhimehta/Desktop/data/datasets/adult_data.csv",destination_frame = "train")
      test <- h2o.importFile("/Users/nidhimehta/Desktop/data/datasets/adult_test.csv",destination_frame = "test")
      y = "income"
      
      #y <- "response"
      x <- setdiff(names(train), y)
      family <- "binomial"
      
      #For binary classification, response should be a factor
      
      train[,y] <- as.factor(train[,y])
      test[,y] <- as.factor(test[,y])
      
      nfolds <- 5  
      
      glm1 <- h2o.glm(x = x, y = y, family = family ,model_id = "glm1",seed = 1,
                      training_frame = train,
                      nfolds = nfolds,
                      fold_assignment = "AUTO",
                      keep_cross_validation_predictions = TRUE)
      #0.677503
      gbm1 <- h2o.gbm(x = x, y = y, model_id = "gbm1",distribution = "AUTO",
                      training_frame = train,
                      seed = 1,
                      nfolds = nfolds,
                      fold_assignment = "AUTO",
                      keep_cross_validation_predictions = TRUE)
      #0.7691238
      rf1 <- h2o.randomForest(x = x, y = y,model_id = "rf1",# distribution not used for RF
                              training_frame = train,
                              seed = 1,
                              nfolds = nfolds,
                              fold_assignment = "AUTO",
                              keep_cross_validation_predictions = TRUE)
      #0.7517167
      dl1 <- h2o.deeplearning(x = x, y = y,model_id = "dl1",distribution = "AUTO",reproducible = T,seed = 1,
                              training_frame = train,
                              nfolds = nfolds,
                              fold_assignment = "AUTO",
                              keep_cross_validation_predictions = TRUE)
      #0.692663
      ss1 = h2o.stackedEnsemble(x, y, train, model_id="modelc",
                              selection_strategy = c("choose_all"), 
                              base_models = list(glm1@model_id,gbm1@model_id,rf1@model_id,dl1@model_id))
      #0.9757004
      
       train <- h2o.importFile("/Users/nidhimehta/Desktop/data/datasets/adult_data.csv",destination_frame = "train")
        |=======================================================================================================| 100%
      > test <- h2o.importFile("/Users/nidhimehta/Desktop/data/datasets/adult_test.csv",destination_frame = "test")
        |=======================================================================================================| 100%
      > y = "income"
      > 
      > #y <- "response"
      > x <- setdiff(names(train), y)
      > family <- "binomial"
      > 
      > #For binary classification, response should be a factor
      > 
      > train[,y] <- as.factor(train[,y])
      > test[,y] <- as.factor(test[,y])
      > 
      > nfolds <- 5  
      > 
      > glm1 <- h2o.glm(x = x, y = y, family = family ,model_id = "glm1",seed = 1,
      +                 training_frame = train,
      +                 nfolds = nfolds,
      +                 fold_assignment = "AUTO",
      +                 keep_cross_validation_predictions = TRUE)
        |=======================================================================================================| 100%
      > #0.677503
      > gbm1 <- h2o.gbm(x = x, y = y, model_id = "gbm1",distribution = "AUTO",
      +                 training_frame = train,
      +                 seed = 1,
      +                 nfolds = nfolds,
      +                 fold_assignment = "AUTO",
      +                 keep_cross_validation_predictions = TRUE)
        |=======================================================================================================| 100%
      > #0.7691238
      > rf1 <- h2o.randomForest(x = x, y = y,model_id = "rf1",# distribution not used for RF
      +                         training_frame = train,
      +                         seed = 1,
      +                         nfolds = nfolds,
      +                         fold_assignment = "AUTO",
      +                         keep_cross_validation_predictions = TRUE)
        |=======================================================================================================| 100%
      > #0.7517167
      > dl1 <- h2o.deeplearning(x = x, y = y,model_id = "dl1",distribution = "AUTO",reproducible = T,seed = 1,
      +                         training_frame = train,
      +                         nfolds = nfolds,
      +                         fold_assignment = "AUTO",
      +                         keep_cross_validation_predictions = TRUE)
        |=======================================================================================================| 100%
      > #0.692663
      > ss1 = h2o.stackedEnsemble(x, y, train, model_id="modelc",
      +                         selection_strategy = c("choose_all"), 
      +                         base_models = list(glm1@model_id,gbm1@model_id,rf1@model_id,dl1@model_id))
        |                                                                                                       |   0%
      
      water.exceptions.H2OIllegalArgumentException: Base model does not use Modulo for cross-validation: 5
      
      water.exceptions.H2OIllegalArgumentException: Base model does not use Modulo for cross-validation: 5
      	at hex.StackedEnsembleModel.checkAndInheritModelProperties(StackedEnsembleModel.java:285)
      	at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(StackedEnsemble.java:112)
      	at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
      	at water.H2O$H2OCountedCompleter.compute(H2O.java:1220)
      	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
      	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
      	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
      	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
      	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
      
      Error: water.exceptions.H2OIllegalArgumentException: Base model does not use Modulo for cross-validation: 5
      > #0.9757004
      > 
      

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                rpeck Raymond Peck
                Reporter:
                nidhi Nidhi Mehta
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: