Parsing issue and inconsistencies

Description

---------- Forwarded message ----------
From: <geoffrey.sims@gmail.com>
Date: Mon, Jun 1, 2015 at 12:09 AM
Subject: [h2ostream] Parsing issue and inconsistencies
To: h2ostream@googlegroups.com

Hello,

I've been experiencing an issue with the parsing of a large data set (~500M rows, ~200 factor columns).

The issue is that a number of numerical factors are detected, incorrectly, as categoricals. Our NULL value is "\N" (as output from Hive), and in most cases is detected correctly so. But for a small number of factors (about 8 out of 200) the parser treats it as a category level. It is reproducible in the sense that if I start a new H2O instance and reload the data, the same 8 columns are flagged incorrectly.

Of course, as has been stated elsewhere, using for example the R as.numeric() function gives an error (cannot coerce type 'S4' to vector of type 'double'), so it cannot be forced numeric.

Interestingly, if I create a table consisting solely of the 8 columns which weren't being passed correctly, and load JUST those columns as a new dataset into H2O, it parses perfectly. I've also confirmed independently that there is only one non-numeric value in each column which is indeed the NULL character string.

We are using H2O 2.8.6.2, and while I realise this won't fully be supported going forward, I would appreciate any insight into the above problem, or how we can work around it for now.

cheers

Assignee

Brandon Hill

Reporter

SriSatish Ambati

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

Priority

Major
Configure