Should customers be able to compare wc -l on a dataset, vs the numRows that h2o reports, and match?

Description

Here's a file with utf8 chars
its 10MB so I left it on mr-0x2
you can get it with

scp 0xdiag@mr-0x2:/home/0xdiag/syn_enums_10000000x1.csv .

It's got just 1 col with normal linux eols
but I also pick randomly from UTF8, so there might be "extra" eol type chars, so it's "longer" than 10e6 (which is the rowcount I used to generated it

so this is okay:
$ wc -l *csv
10033885 syn_enums_10000000x1.csv

Given what we (tomas/brandon) discussed yesterday, it should be considered to have 10033885 rows?
I think h2o doesn't count "empty" rows. So the wc comparison isn't going to work.
Maybe h2o should count empty rows and fill them with NA if people are going to do that comparison?

but when I parse it with h2o (multi-node parse, but I guess it shouldn't matter), I get 9304014

/2/PostFile.json?key=syn_enums_1000000x1.csv

/2/Parse2.json?destination_key=cI&source_key=syn_enums_1000000x1.csv&header=0&separator=44

44 (decimal) is comma

I may have some of the eol chars in there.
In any case, h2o reports 9304014 which seems wrong? it would mean it's ignoring rows/

Exception: Expect numRows 9304014 >= rowCount 10000000 since we can have extra eols

The dataset looks like this
Some rows are empty. Do we not count them?

ª

Ä

¬

Â

å

<8e>
=

j
Q

}
<8f>

Assignee

New H2O Bugs

Reporter

Kevin Normoyle

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

Components

Priority

Major
Configure