Running merge uses unexpectedly more memory than expected. The following are the result from running the _merge_test.R _ script attached. Each merge is done against a 2.7 million rows by 100 columns dataset and the ID column is always unique integers.
Total Cluster Size (after 10-15% overhead)
Size in Mem
3x Frame Memory
3*(1gb + 1g) = 6gb
3*(1gb + 6g) = 21gb
3*(1gb + 11g) = 36gb
3*(1gb + 12g) = 39gb
3*(1gb + 17g) = 54gb
Looking at the logs the logs, it never takes H2O to create the left and right indices even on very large datasets.
However when stitching together the columns by the row indices, the memory blows up. From the experiment you should see that increasing the amount of columns and size of the frames to a certain threshold would lead to OOM. This has also been tested for Sparse data frames which did not contribute to the spike in memory.
For those with relatively small datasets, we can recommend you move some of the data back into a single node R session and using data.table. Another trick to avoid bringing too much back into R where the jobs and resources are hard to schedule is to sort/arrange only one of the frames you want to merge. The resulting sorted frame would be brought back into h2o to run a simple cbind which is not as resource intensive. Please see attached Cbind_Join.R script for example code.