---------- Forwarded message ----------
Date: Mon, Jun 1, 2015 at 12:09 AM
Subject: [h2ostream] Parsing issue and inconsistencies
I've been experiencing an issue with the parsing of a large data set (~500M rows, ~200 factor columns).
The issue is that a number of numerical factors are detected, incorrectly, as categoricals. Our NULL value is "\N" (as output from Hive), and in most cases is detected correctly so. But for a small number of factors (about 8 out of 200) the parser treats it as a category level. It is reproducible in the sense that if I start a new H2O instance and reload the data, the same 8 columns are flagged incorrectly.
Of course, as has been stated elsewhere, using for example the R as.numeric() function gives an error (cannot coerce type 'S4' to vector of type 'double'), so it cannot be forced numeric.
Interestingly, if I create a table consisting solely of the 8 columns which weren't being passed correctly, and load JUST those columns as a new dataset into H2O, it parses perfectly. I've also confirmed independently that there is only one non-numeric value in each column which is indeed the NULL character string.
We are using H2O 220.127.116.11, and while I realise this won't fully be supported going forward, I would appreciate any insight into the above problem, or how we can work around it for now.