Uploaded image for project: 'SW'
  1. SW-334

as_factor() 'corrupts' dataframe if it fails

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.1.3, 2.0.7, 1.6.9
    • Component/s: None
    • Labels:
      None
    • CustomerVisible:
      No
    • Sprint:
    • AffectedCustomers:

      Description

      as_factor() can corrupt a dataframe (example below) if it fails. User has to cast to character (or otherwise) first before as_factor will work.

      Suggestion here is:

      1) Support as_factor() better regardless of existing type
      2) If as_factor() fails, to not corrupt the dataframe


      Hi,

      I have a df with several features that I want to apply asFactor(). I'm using this code:

      def pimpIt(df):
      for i in all_features[:]:
      print dfi.head(3)
      print df.typesi
      dfi = dfi.asfactor()
      print dfi.head(3)
      print df.typesi
      return df

      train_H2O_2 = pimpIt(train_H2O)

      When the type of a column is 'real', it fails. E.g. I have a factor hotel_class with possible values 1.0, 2.0,3.0,4.0 and 5.0. (Somehow it became a double; I know how to cast it to int before throwing into this function, but I'd like to illustrate the issue with the dataframe becoming unusable after running into this issue)
      ---------------------------------------------------------------------------
      H2OResponseError Traceback (most recent call last)
      <ipython-input-30-5ed645a83a87> in <module>()
      8 return df
      9
      ---> 10 train_H2O_2 = pimpIt(train_H2O)

      <ipython-input-30-5ed645a83a87> in pimpIt(df)
      4 print df.typesi
      5 dfi = dfi.asfactor()
      ----> 6 print dfi.head(3)
      7 print df.typesi
      8 return df

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in _repr_(self)
      410 stk = traceback.extract_stack()
      411 if not ("IPython" in stk-20 and "info" == stk-22):
      --> 412 self.show()
      413 return ""
      414

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in show(self, use_pandas)
      422 print("This H2OFrame has been removed.")
      423 return
      --> 424 if not self._ex._cache.is_valid(): self._frame()._ex._cache.fill()
      425 if H2ODisplay._in_ipy():
      426 import IPython.display

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/frame.pyc in _frame(self, fill_cache)
      471
      472 def _frame(self, fill_cache=False):
      --> 473 self._ex._eager_frame()
      474 if fill_cache:
      475 self._ex._cache.fill()

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eager_frame(self)
      84 if not self._cache.is_empty(): return
      85 if self._cache._id is not None: return # Data already computed under ID, but not cached locally
      ---> 86 self._eval_driver(True)
      87
      88 def _eager_scalar(self): # returns a scalar (or a list of scalars)

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in _eval_driver(self, top)
      98 def _eval_driver(self, top):
      99 exec_str = self._get_ast_str(top)
      --> 100 res = ExprNode.rapids(exec_str)
      101 if 'scalar' in res:
      102 if isinstance(res'scalar', list):

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/expr.pyc in rapids(expr)
      201 :returns: The JSON response (as a python dictionary) of the Rapids execution
      202 """
      --> 203 return h2o.api("POST /99/Rapids", data=

      {"ast": expr, "session_id": h2o.connection().session_id}

      )
      204
      205

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/h2o.pyc in api(endpoint, data, json, filename, save_to)
      82 # type checks are performed in H2OConnection class
      83 _check_connection()
      ---> 84 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
      85
      86

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in request(self, endpoint, data, json, filename, save_to)
      261 auth=self._auth, verify=self._verify_ssl_cert, proxies=self._proxies)
      262 self._log_end_transaction(start_time, resp)
      --> 263 return self._process_response(resp, save_to)
      264
      265 except (requests.exceptions.ConnectionError, requests.exceptions.HTTPError) as e:

      /opt/sparkling-water/2.0.3/py/build/dist/h2o_pysparkling_2.0-2.0.3-py2.7.egg/h2o/backend/connection.pyc in _process_response(response, save_to)
      581 # Client errors (400 = "Bad Request", 404 = "Not Found", 412 = "Precondition Failed")
      582 if status_code in

      {400, 404, 412}

      and isinstance(data, (H2OErrorV3, H2OModelBuilderErrorV3)):
      --> 583 raise H2OResponseError(data)
      584
      585 # Server errors (notably 500 = "Server Error")

      H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
      Error: Categorical conversion can only currently be applied to integer columns.
      Request: POST /99/Rapids
      data:

      {u'session_id': '_sid_bb3c', u'ast': "(tmp= py_136_sid_bb3c (rows (cols_py (tmp= py_135_sid_bb3c (:= py_132_sid_bb3c (as.factor (cols_py py_132_sid_bb3c 'hotel_class')) 337 [])) 'hotel_class') [0:3]))"}

      So I'll work around this by CAST it to int in Spark, or do as.numeric first and then as.factor second for the h2o frame.

      The issue though is that now my h2o dataframe is rendered useless. E.g. if I try to inspect the dataframe, or the head of this column, I get:
      H2OResponseError: Server error water.exceptions.H2OKeyNotFoundArgumentException:
      Error: Object 'py_135_sid_bb3c' not found for argument: key
      Request: GET /3/Frames/py_135_sid_bb3c
      params:

      {u'row_count': '10'}

      Other dataframe functions still work though. E.g. train_H2O.types gives the dictionary containing the column type information.

      In Flow, I see the data is still there. So I guess the python object has become a corrupt reference or something, and I imagine you can create better error handling here so the data scientist doesn't lose time restarting the jupyter sparkling water kernel, reshipping of data, retrying the function, failing again, new kernel etc, after 3rd time finding out what is causing the issue and applying a more root cause fix (sorry for exaggeration .

        Attachments

          Activity

            People

            • Assignee:
              vladpatryshev Vlad Patryshev (Inactive)
              Reporter:
              nickkarpov Nick Karpov
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: