Uploaded image for project: 'Public H2O 3'
  1. Public H2O 3
  2. PUBDEV-3694

Errors with PCA on wide data for pca_method = GramSVD which is the default

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.10.3.1
    • Component/s: R
    • Labels:
      None
    • CustomerVisible:
      No
    • Sprint:
      H2O Sprint 35

      Description

      I am running into a lot of errors using a small, but wide dataset on an 8GB H2O cluster. I don't think memory should be a problem, but perhaps that's the issue, or perhaps it's something else.

      Three types of errors:

      library(h2o)
      h2o.init(nthreads = -1, max_mem_size = "8G")
      
      
      file <- "http://www.stat.berkeley.edu/~ledell/data/rotterdam.csv.gz"
      df <- h2o.importFile(file)
      dim(df)  # 286 22284
      
      y <- "relapse"
      x <- setdiff(names(df), y)
      df[,y] <- as.factor(df[,y])  #Convert to factor (for binary classification)
      
      
      splits <- h2o.splitFrame(df, seed = 1)
      train <- splits[[1]]
      test <- splits[[2]]
      print(dim(train))
      print(dim(test))
      
      
      # Does not work:
      # Train a default PCA
      h2o_pca <- h2o.prcomp(train, k = 8, x = x)
      
      #Error: java.lang.IllegalArgumentException: Found validation errors: ERRR on field: _train: Gram matrices (one per thread) won't fit in the driver node's memory (59.19 GB > 6.93 GB) - try reducing the number of columns and/or the number of categorical factors.
      # Also kills the H2O cluster!
      
      
      # Try again with Power method instead, but this errors out and kills the cluster!
      # Train a PCA model using 20 principal components.
      h2o_pca20 <- h2o.prcomp(train,
                              x = x, k = 20,
                              transform = "STANDARDIZE",
                              pca_method = "Power",
                              use_all_factor_levels = TRUE,
                              seed = 1)
      
      #ERROR: Unexpected HTTP Status code: 500 Server Error (url = http://localhost:54321/3/Jobs/$03017f00000132d4ffffffff$_813316ba1fc0d17fc26ffd94f7e24d85)
      
      #Error: lexical error: invalid char in json text.
      #<html> <head> <meta http-equiv=
      #  (right here) ------^
      
      
      
      #Train a PCA model using 20 principal components.
      h2o_pca20 <- h2o.prcomp(train,
                              x = x, k = 20,
                              transform = "STANDARDIZE",
                              pca_method = "Randomized",
                              use_all_factor_levels = TRUE,
                              seed = 1)
      
      # java.lang.IllegalArgumentException: Can not make vectors of different length compatible!
      #   
      #   java.lang.IllegalArgumentException: Can not make vectors of different length compatible!
      #   at water.fvec.Frame.makeCompatible(Frame.java:1391)
      # at water.fvec.Frame.makeCompatible(Frame.java:1379)
      # at water.fvec.Frame.bulkAdd(Frame.java:525)
      # at water.fvec.Frame.add(Frame.java:510)
      # at water.fvec.Frame.add(Frame.java:564)
      # at hex.svd.SVD$SVDDriver.randSubIter(SVD.java:210)
      # at hex.svd.SVD$SVDDriver.computeImpl(SVD.java:455)
      # at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
      # at hex.ModelBuilder.trainModelNested(ModelBuilder.java:225)
      # at hex.pca.PCA$PCADriver.computeImpl(PCA.java:247)
      # at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:169)
      # at water.H2O$H2OCountedCompleter.compute(H2O.java:1206)
      # at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
      # at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
      # at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
      # at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
      # at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
      # 
      # Error: java.lang.IllegalArgumentException: Can not make vectors of different length compatible!
      

      The only method that works here is GLRM:

       h2o_pca20 <- h2o.prcomp(train,
      +                         x = x, k = 20,
      +                         transform = "STANDARDIZE",
      +                         pca_method = "GLRM",
      +                         use_all_factor_levels = TRUE,
      +                         seed = 1)
        |=======================================================================| 100%
      > 
      

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                wendy Wendy
                Reporter:
                erin Erin LeDell
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: