Pysparkling -error copying data to sparkling water

Description

---------------------------------------------------------------------------
H2OServerError Traceback (most recent call last)
<ipython-input-4-d2020c574d7d> in <module>()
----> 1 data_h2o_jan = context.as_h2o_frame(datadfjan)
2 data_h2o_apr = context.as_h2o_frame(datadfapr)
3 data_h2o_july = context.as_h2o_frame(datadfjuly)
4 data_h2o_oct = context.as_h2o_frame(datadfoct)

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/pysparkling/context.pyc in as_h2o_frame(self, dataframe, framename)
175 """
176 if isinstance(dataframe, DataFrame):
--> 177 return fc._as_h2o_frame_from_dataframe(self, dataframe, framename)
178 elif isinstance(dataframe, RDD):
179 # First check if the type T in RDD[T] is one of the python "primitive" types

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/pysparkling/conversions.pyc in _as_h2o_frame_from_dataframe(h2oContext, dataframe, frame_name)
44 j_h2o_frame = h2oContext._jhc.asH2OFrame(dataframe._jdf, frame_name)
45 j_h2o_frame_key = j_h2o_frame.key()
---> 46 return H2OFrame.from_java_h2o_frame(j_h2o_frame,j_h2o_frame_key)
47
48 @staticmethod

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/pysparkling/context.pyc in from_java_h2o_frame(h2o_frame, h2o_frame_id)
37 @staticmethod
38 def from_java_h2o_frame(h2o_frame, h2o_frame_id):
---> 39 fr = H2OFrame.get_frame(h2o_frame_id.toString())
40 fr._java_frame = h2o_frame
41 fr._backed_by_java_obj = True

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in get_frame(frame_id)
220 fr._ex._cache._id = frame_id
221 try:
--> 222 fr._ex._cache.fill()
223 except EnvironmentError:
224 return None

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in fill(self, rows)
298 if rows <= len(self):
299 return
--> 300 res = h2o.api("GET /3/Frames/%s" % self._id, data={"row_count": rows})["frames"][0]
301 self._l = rows
302 self._nrows = res["rows"]

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
82 # type checks are performed in H2OConnection class
83 _check_connection()
---> 84 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
85
86

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
261 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
262 self._log_end_transaction(start_time, resp)
--> 263 return self._process_response(resp, save_to)
264
265 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in _process_response(response, save_to)
586 # Note that it is possible to receive valid H2OErrorV3 object in this case, however it merely means the server
587 # did not provide the correct status code.
--> 588 raise H2OServerError("HTTP %d %s:\n%r" % (status_code, response.reason, data))
589
590

H2OServerError: HTTP 500 Server Error:
Server error java.lang.RuntimeException:
Error: DistributedException from ha255datanode-11.lhr4.prod.booking.com/10.182.165.16:54321, caused by java.lang.IllegalStateException: Unable to clean up RollupStats after an exception (see cause). This could cause a key leakage, key=$05ff14000000feffffff48dd8c688a3190a30b4f0902a72fd2ed$
Request: None

====== CODE =====

for this we used spark 2.0.2 with dynamic allocatioin on yarn.
Here are the cloud settings for h2o:
from pysparkling import *
conf = (H2OConf(sc)
.use_auto_cluster_start()
.set_yarn_queue("spark-analytics")
.set_num_of_external_h2o_nodes(2)
.set_mapper_xmx("4G") )

context = H2OContext.getOrCreate(sc, conf)

Here is the code that produced the error:
datadfjan= spark.sql('''SELECT * FROM jgaskins.datajanimputed WHERE first_opened<"2015-10-01" ''').repartition(20)
print datadfjan.cache().count()
data_h2o_jan = context.as_h2o_frame(datadfjan)

The data I can not share since it is quite sensible.
But I can give you some summary statistics so you can rebuild some similar random data on your own:
Frames
Type ID Rows Columns Size
frame_rdd_52_9bf596e65db76d42e914308347e24f6a
778100 45 165MB
label type Missing Zeros +Inf -Inf min max mean sigma cardinality Actions
a int 0 0 0 0 10003.0 1525888.0 828359.2361 467523.0640 · Convert to enum
b int 0 64 0 0 0 747.0 621.3136 179.5556 · Convert to enum
c int 0 1 0 0 -20999229.0 2147483647.0 71684348.1129 242923234.6456 · Convert to enum
d string 0 0 0 0 · · · · · Convert to enum
e string 0 0 0 0 · · · · · Convert to enum
f int 0 0 0 0 1.0 4.0 1.8224 1.0899 · Convert to enum
g int 0 600304 0 0 0 132.0 5.2346 13.8543 · Convert to enum
h string 0 0 0 0 · · · · · Convert to enum
j real 0 67360 0 0 0 1.0 0.4850 0.3136 · ·
k real 0 206274 0 0 0 1.0 0.1114 0.1497 · ·
l real 0 143065 0 0 0 1.0 0.2078 0.2112 · ·
m real 0 324824 0 0 0 1.0 0.1008 0.1718 · ·
n real 0 246400 0 0 0 0.8135 0.2215 0.2227 · ·
o real 0 367702 0 0 0 13.4006 1.1487 1.8578 · ·
p int 0 19 0 0 0 100.0 84.8820 14.1145 · Convert to enum
q real 0 100560 0 0 0 59.6466 0.9336 1.5236 · ·
r real 0 117651 0 0 0 1.0 0.6975 0.3548 · ·
s int 0 879 0 0 0 161.0 21.2645 13.9207 · Convert to enum
t int 0 1469 0 0 0 154.0 23.5149 12.6826 · Convert to enum
u int 0 38 0 0 0 6083.0 26.7258 65.5662 · Convert to enum

Assignee

Michal Malohlava

Reporter

Avkash Chauhan

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

ReleaseNotesHidden

None

Priority

Major
Configure