big data and complex join results incorrect answer

Description

There seems to be regression in h2o.merge. On a large data, and a complex join, it returns incorrect number of rows.
AFAIR version from November 2016 (included in the script) merge produced expected number of rows. Unfortunately I am not able to double check that now due to an error, but it is included in the script, may need to increase java memory. In the latest stable release number of rows is incorrect.

Attached script can be used to reproduce the issue. It assumes that datasets will be available at:
data/J1_1e9_NA_0_0.csv
data/J1_1e9_1e9_0_0.csv

Unfortunately I couldn't reproduce the problem using smaller data.

Merge query is run twice, it is irrelevant here.

Output of the script on the recent version, correct number of rows is 900000000.

{{[1] ‘3.30.0.3’ ## h2o version
[1] 1000000000 7 ## dim
[1] 1000000000 7 ## dim
[1] 452684385 13 ## dim(ans); first run
user system elapsed
4.121 1.000 370.810
[1] 452684385 13 ## dim(ans); second run
user system elapsed
4.046 0.222 580.150

  1.  

    1. head(ans, 3)
      id3 id1 id2 id4 id5 id6 v1 id1.1 id2.1 id4.1 id5.1 id6.1
      1 1 161 491574 id161 id491574 id1 75.13724 161 491574 id161 id491574 id1
      2 2 633 992481 id633 id992481 id2 47.93112 633 992481 id633 id992481 id2
      3 3 738 974137 id738 id974137 id3 89.55979 738 974137 id738 id974137 id3
      v2
      1 0.803628
      2 3.294198
      3 58.137494

    2. tail(ans, 3)
      id3 id1 id2 id4 id5 id6 v1 id10 id20 id40
      1 1e+09 162 705341 id162 id705341 id999999998 41.827172 162 705341 id162
      2 1e+09 219 656057 id219 id656057 id999999999 78.156812 219 656057 id219
      3 1e+09 722 146519 id722 id146519 id1000000000 7.737905 722 146519 id722
      id50 id60 v2
      1 id705341 id999999998 57.570973
      2 id656057 id999999999 8.684918
      3 id146519 id1000000000 32.560568}}

Assignee

Wendy

Fix versions

None

Reporter

Jan Gorecki

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

No

Priority

Major
Configure