H2OFrame in Python is adding additional duplicate rows to the Pandas DataFrame

Description

When converting a Pandas dataframe to a H2o frame using the h2o.H2OFrame() function an error is occuring.

Additional rows are being created in the H2o Frame. When I looked into this, it appears the new rows are duplicates of other rows. Depending on the data size the number of duplicate rows added varies, but typically around 2-10.

Code:

train_h2o = h2o.H2OFrame(python_obj=train_df_complete)

print(train_df_complete.shape[0])
print(train_h2o.nrow)

Output:

3871998
3872000

Assignee

New H2O Bugs

Fix versions

None

Reporter

George Carmichael

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Affects versions

Priority

Critical
Configure