Merge: OOM crashes when using merge during the "gather" stage of merge.

Description

Problem

Running merge uses unexpectedly more memory than expected. The following are the result from running the _merge_test.R _ script attached. Each merge is done against a 2.7 million rows by 100 columns dataset and the ID column is always unique integers.

NumNodes

MapperXmx

Total Cluster Size (after 10-15% overhead)

6

10gb

53.33gb

NumRows

NumCols

Size in Mem

3x Frame Memory

Successfully Merges

2.7million

100

1gb

3*(1gb + 1g) = 6gb

Yes

2.7million

500

6gb

3*(1gb + 6g) = 21gb

Yes

2.7million

1000

11gb

3*(1gb + 11g) = 36gb

Yes

2.7million

1100

12gb

3*(1gb + 12g) = 39gb

Yes

2.7million

1500

17gb

3*(1gb + 17g) = 54gb

No

Looking at the logs the logs, it never takes H2O to create the left and right indices even on very large datasets.

However when stitching together the columns by the row indices, the memory blows up. From the experiment you should see that increasing the amount of columns and size of the frames to a certain threshold would lead to OOM. This has also been tested for Sparse data frames which did not contribute to the spike in memory.

Temporary Solution

For those with relatively small datasets, we can recommend you move some of the data back into a single node R session and using data.table. Another trick to avoid bringing too much back into R where the jobs and resources are hard to schedule is to sort/arrange only one of the frames you want to merge. The resulting sorted frame would be brought back into h2o to run a simple cbind which is not as resource intensive. Please see attached Cbind_Join.R script for example code.

Assignee

Michal Kurka

Fix versions

Reporter

Amy Wang

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

Support Incident

Task progress

None

CustomerVisible

Yes

AffectedCustomers

Components

Affects versions

Priority

Blocker
Configure