Here's a file with utf8 chars
its 10MB so I left it on mr-0x2
you can get it with
scp 0xdiag@mr-0x2:/home/0xdiag/syn_enums_10000000x1.csv .
It's got just 1 col with normal linux eols
but I also pick randomly from UTF8, so there might be "extra" eol type chars, so it's "longer" than 10e6 (which is the rowcount I used to generated it
so this is okay:
$ wc -l *csv
Given what we (tomas/brandon) discussed yesterday, it should be considered to have 10033885 rows?
I think h2o doesn't count "empty" rows. So the wc comparison isn't going to work.
Maybe h2o should count empty rows and fill them with NA if people are going to do that comparison?
but when I parse it with h2o (multi-node parse, but I guess it shouldn't matter), I get 9304014
44 (decimal) is comma
I may have some of the eol chars in there.
In any case, h2o reports 9304014 which seems wrong? it would mean it's ignoring rows/
Exception: Expect numRows 9304014 >= rowCount 10000000 since we can have extra eols
The dataset looks like this
Some rows are empty. Do we not count them?