Add Rule of Thumb for Data Conversion

Description

We should add the following Rule of Thumb to the Data Sharing section of Sparkling Water
http://docs.h2o.ai/sparkling-water/2.3/latest-stable/doc/design/data_sharing.html

Memory Consideration When Converting Between Data Frames Types

When Using Sparkling Water External Backend:

If you have allocated the recommended memory amount to your H2O cluster (4 x the size of your dataset), you don't need to worry about memory constraints when converting between a Spark DataFrame and an H2OFrame; there is no collision with Spark storage.

Note: the 4 x the size of your dataset assumes your dataset is represented as a CSV. If your dataset is represented as JSON, XML or parquet, the requirements may differ significantly.

When Using Sparkling Water Internal Backend:

In internal backend mode H2O-3 shares the JVM with Spark executors. In this case, you will want to allocate enough memory to run Spark transformations on your DataFrame (which means allocating a minimum memory of your dataset and memory for those transformations), plus allocate an additional 4 x the size of your dataset.

Note: there is data duplication when you convert between a Spark DataFrame and an H2Oframe (though H2O uses compression tricks to help reduce the memory requirements for this conversion); there is no data duplication when you convert between an H2OFrame and a Spark DataFrame because Sparkling Water uses a wrapper around the H2OFrame, which uses the RDD/DataFrame API.

Status

Assignee

Jakub Hava

Reporter

Lauren DiPerna

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

Fix versions

Priority

Major
Configure