We're updating the issue view to help you get more done. 

as_factor() 'corrupts' dataframe if it fails

Description

as_factor() can corrupt a dataframe (example below) if it fails. User has to cast to character (or otherwise) first before as_factor will work.

Suggestion here is:

1) Support as_factor() better regardless of existing type
2) If as_factor() fails, to not corrupt the dataframe


Hi,

I have a df with several features that I want to apply asFactor(). I'm using this code:

def pimpIt(df):
for i in all_features[:]:
print df[i].head(3)
print df.types[i]
df[i] = df[i].asfactor()
print df[i].head(3)
print df.types[i]
return df

train_H2O_2 = pimpIt(train_H2O)

When the type of a column is 'real', it fails. E.g. I have a factor hotel_class with possible values 1.0, 2.0,3.0,4.0 and 5.0. (Somehow it became a double; I know how to cast it to int before throwing into this function, but I'd like to illustrate the issue with the dataframe becoming unusable after running into this issue)
---------------------------------------------------------------------------
H2OResponseError Traceback (most recent call last)
<ipython-input-30-5ed645a83a87> in <module>()
8 return df
9
---> 10 train_H2O_2 = pimpIt(train_H2O)

<ipython-input-30-5ed645a83a87> in pimpIt(df)
4 print df.types[i]
5 df[i] = df[i].asfactor()
----> 6 print df[i].head(3)
7 print df.types[i]
8 return df

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in _repr_(self)
410 stk = traceback.extract_stack()
411 if not ("IPython" in stk[-2][0] and "info" == stk[-2][2]):
--> 412 self.show()
413 return ""
414

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in show(self, use_pandas)
422 print("This H2OFrame has been removed.")
423 return
--> 424 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill()
425 if H2ODisplay._in_ipy():
426 import IPython.display

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in _frame(self, fill_cache)
471
472 def _frame(self, fill_cache=False):
--> 473 self._ex._eager_frame()
474 if fill_cache:
475 self._ex._cache.fill()

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eager_frame(self)
84 if not self._cache.is_empty(): return
85 if self._cache._id is not None: return # Data already computed under ID, but not cached locally
---> 86 self._eval_driver(True)
87
88 def _eager_scalar(self): # returns a scalar (or a list of scalars)

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eval_driver(self, top)
98 def _eval_driver(self, top):
99 exec_str = self._get_ast_str(top)
--> 100 res = ExprNode.rapids(exec_str)
101 if 'scalar' in res:
102 if isinstance(res['scalar'], list):

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in rapids(expr)
201 :returns: The JSON response (as a python dictionary) of the Rapids execution
202 """
--> 203 return h2o.api("POST /99/Rapids", data={"ast": expr, "session_id": h2o.connection().session_id})
204
205

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
82 # type checks are performed in H2OConnection class
83 _check_connection()
---> 84 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
85
86

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
261 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
262 self._log_end_transaction(start_time, resp)
--> 263 return self._process_response(resp, save_to)
264
265 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

/opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in _process_response(response, save_to)
581 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
582 if status_code in {400, 404, 412} and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
--> 583 raise H2OResponseError(data)
584
585 # Server errors (notably 500 = "Server Error")

H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
Error: Categorical conversion can only currently be applied to integer columns.
Request: POST /99/Rapids
data: {u'session_id': '_sid_bb3c', u'ast': "(tmp= py_136_sid_bb3c (rows (cols_py (tmp= py_135_sid_bb3c (:= py_132_sid_bb3c (as.factor (cols_py py_132_sid_bb3c 'hotel_class')) 337 [])) 'hotel_class') [0:3]))"}
So I'll work around this by CAST it to int in Spark, or do as.numeric first and then as.factor second for the h2o frame.

The issue though is that now my h2o dataframe is rendered useless. E.g. if I try to inspect the dataframe, or the head of this column, I get:
H2OResponseError: Server error water.exceptions.H2OKeyNotFoundArgumentException:
Error: Object 'py_135_sid_bb3c' not found for argument: key
Request: GET /3/Frames/py_135_sid_bb3c
params: {u'row_count': '10'}
Other dataframe functions still work though. E.g. train_H2O.types gives the dictionary containing the column type information.

In Flow, I see the data is still there. So I guess the python object has become a corrupt reference or something, and I imagine you can create better error handling here so the data scientist doesn't lose time restarting the jupyter sparkling water kernel, reshipping of data, retrying the function, failing again, new kernel etc, after 3rd time finding out what is causing the issue and applying a more root cause fix (sorry for exaggeration .

Environment

None

Status

Assignee

Vlad Patryshev

Reporter

Nick Karpov

Labels

None

Release Priority

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

Fix versions

Priority

Major