AIOOB while training an H2OKMeansEstimator

Description

I am experiencing issues when training an H2OKMeansEstimator. Notably, an IndexOutOfBoundException is generated while collecting metrics during the clustering process.

I am adding this bug as related to Sparkling Water since I am using this product but the stack trace suggests instead that the problem is triggered in the h2o-core.

My code looks like the following:
from pysparkling import *
conf = (H2OConf(sc)
.use_auto_cluster_start()
.set_yarn_queue("spark-analytics")
.set_num_of_external_h2o_nodes(6)
.set_mapper_xmx("4G") )
context = H2OContext.getOrCreate(sc, conf)

  1. Generate a Spark RDD
    datadfjan= spark.sql('''<BUSINESS_RELATED_LOGIC>''').repartition(20)
    data_h2o_jan = context.as_h2o_frame(datadfjan)

ignored_columns=['col_to_exclude_1', 'col_to_exclude_2']
x=[c for c in datadfjan.columns if c not in ignored_columns]

import h2o
from h2o.estimators import H2OKMeansEstimator
mjan = H2OKMeansEstimator(model_id='model_jan_v2_estimate_k',estimate_k=True, k=10, standardize=True, init="PlusPlus",nfolds=4,max_iterations=20,ignored_columns=ignored_columns)
mjan.train(x=x, training_frame=data_h2o_jan)
predjan = mjan.predict(data_h2o_jan)
mjan

When executing the train command, it fails often with the following error (Notice not on every session but it happens very, very often):
kmeans Model Build progress: |███████████████████████████████████ (failed)
---------------------------------------------------------------------------
EnvironmentError Traceback (most recent call last)
<ipython-input-7-61f099372615> in <module>()
4 print "Start training %d" % iter
5 mjan = H2OKMeansEstimator(model_id='model_jan_v2_estimate_k',estimate_k=True, k=10, standardize=True, init="PlusPlus",nfolds=4,max_iterations=20,ignored_columns=ignored_columns)
----> 6 mjan.train(x=x, training_frame=data_h2o_jan)
7 predjan = mjan.predict(data_h2o_jan)
8 print "Finish training %d" % iter

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/estimators/estimator_base.pyc in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns)
227 return
228
--> 229 model.poll()
230 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0]
231 self._resolve_model(model.dest_key, model_json)

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/job.pyc in poll(self)
74 if (isinstance(self.job, dict)) and ("stacktrace" in list(self.job)):
75 raise EnvironmentError("Job with key {} failed with an exception: {}\nstacktrace: "
---> 76 "\n{}".format(self.job_key, self.exception, self.job["stacktrace"]))
77 else:
78 raise EnvironmentError("Job with key %s failed with an exception: %s" % (self.job_key, self.exception))

EnvironmentError: Job with key $0300ffffffff$_83c9f95c3e6b2df5d96182cccee4376 failed with an exception: java.lang.ArrayIndexOutOfBoundsException: 3
stacktrace:
java.lang.ArrayIndexOutOfBoundsException: 3
at water.util.ArrayUtils.add(ArrayUtils.java:150)
at hex.ModelMetricsClustering$MetricBuilderClustering.reduce(ModelMetricsClustering.java:130)
at hex.ModelMetricsClustering$MetricBuilderClustering.reduce(ModelMetricsClustering.java:80)
at hex.ModelBuilder.cv_mainModelScores(ModelBuilder.java:479)
at hex.ModelBuilder.computeCrossValidation(ModelBuilder.java:288)
at hex.ModelBuilder$1.compute2(ModelBuilder.java:203)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1217)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

After some investigation in the source code of h2o at the specific tag (jenkins-rel-turnbull-2) and at your current master (348f5964201e2176df672769402e9067963d7ce1), it turns out that the code fails when the method add in ArrayUtils receives two differently sized arrays, a and b, where the size of a is bigger than b.

So I'm contacting you to:
1) Propose a simple way to patch this specific issue (something similar to the attached ArrayUtils.patch. Your current master will fail with an ArrayIndexOutOfBoundsException for the test named "testAddWithDifferentSizesArrayWhenFirstArrayIsLonger").
2) Ask whether point 1 is actually the right solution for this problem: Are you guys assuming that the ArrayUtils class will always use equal-sized arrays? Should that assumption be enforced in the MetricBuilderClustering object? Because if so, I clearly have a snippet of code that breaks such an assumption.

For completeness I also attached the h2o logs retrieved from my H2O session using the flow UI.

Looking forward to hearing from you and of course I'm available to provide more insgihts on my code/setting if needed.

Assignee

Amy Wang

Fix versions

Reporter

Avkash Chauhan

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

Support Incident

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

No

Support Assessment

Data Science Issue

AffectedCustomers

Sprint

None

Affects versions

Priority

Major
Configure