jumbling rows of data during scoring

Description

Hi,

we are seeing something very disturbing that makes us very queasy about using H2O: namely before and after passing through H2O predictions, we see a phenomenon I will illustrate with a simple example (switching some columns between rows):

Input dataframe:

id date partialkey feature1 feature2 feature3 feature4 feature5
---------------------------------------------------------------------------------------------
1 Dec-1 a 0 1 0 0 0
2 Dec-1 b 2 3 1 1 1

output dataframe:

id date partialkey feature1 feature2 feature3 feature4 feature5 predict A R
---------------------------------------------------------------------------------------------
1 Dec-1 a 0 1 1 1 1 out1 0.4 0.6
2 Dec-1 b 2 3 0 0 0 out2 0.1 0.9

In other words, two rows have SOME OF THEIR COLUMS switched (in the example above, feature3 feature4 feature5)

This could happen if for example the columns (which are stored separately in chunks, I believe) are not stiched back together properly after running through scoring on a model.

We are investigating the issue and trying to come up with a reproducible example (hard to share because propriatory dataset), but I'm wondering if you can tell me if you have seen any similar bugs and what could be causing it. It is quite urgent, so please I beg you, please respond.

Here are more details:
1. We don't see an issue if running on a single node without ignored columns
2. All training columns are categorical (strings) and model is doing binary classification. There are no missing values in the input dataset in the columns used for training.
3. The dataframe in question has a lot of ignored columns. Some of the columns are originally of Decimal type. All decimal type columns are cast to doubles before passing through H2O. All datatypes are doubles, strings and timestamps.
4. Running Sparkling water 1.5.10 on CDH 5.4

The H2O code is very simple and is only a few lines:

cast Spark dataframe to H2OFrame
load stored H2O model from hdfs location
declare categorical columns
score on the H2OFrame (to get predict, A, R columns) (for GBM, DRF and LR models)
add predictions to original dataframe (from GBM, DRF and LR models)
dataframe update key
convert H2OFrame to Spark DataFrame

INVESTIGATING FURTHER:
Here are more observations when restricting to dataframes without the ignored columns (some 50 columns are ignored, 32 columns are retained):

  • we have two slightly different codebases which produce slightly different derived features on identical inputs

  • original input after transformation thus produces two slightly different input dataframes to predictions - I will call these Set1 and Set2

  • Set1 uses Model1 for prediction and there is no switching rows in this case - NO PROBLEM

  • Set2 uses Model2 for prediction and switching rows happens in this case - PROBLEM

  • passing Set2 through Model1: switching rows happens in this case - PROBLEM

  • passing subset of Set2 where jumbling of rows happens (166 rows out of 20,000) through Model 1 - now switching columns does NOT happen - NO PROBLEM

So, identical models and identical code, run on two almost identical datasets have very different behavior.

How is it possible that the structure of the dataset causes jumbling of data??? And only in some columns???
Could casting to different datatypes (Decimal to Double or int to string) have anything to do with it???

Thank you so much for any insight you can offer!

https://groups.google.com/forum/#!topic/h2ostream/QsgcbrpyJAs

Environment

None

Status

Assignee

Jakub Hava

Fix versions

None

Reporter

Avkash Chauhan

Support ticket URL

None

Labels

None

Release Priority

None

Affected Spark version

None

Customer Request Type

Forum

Task progress

None

CustomerVisible

No

Support Assessment

Data Science Issue

Priority

Major
Configure