Memory leak in H2O (standalone cluster)

Description

UPDATE

I created a reproducible example in R and tested it on a tiny 4 node linux cluster using h2o 3.8.3.2. -

The workflow creates dummy data and then iterativly computes a new model, make a prediction, calculates a dummy KPI and last removes the model plus the prediction data. It uses the "full blown gc" approach from Tom (https://groups.google.com/d/msg/h2ostream/Dc6l4xzwkaU/n-w2p02mBwAJ).
You can run it with

I run it twice - one time only with a simple housekeeping (h2o.rm) and a larger dataset- and one time with Toms GC approach and a smaller dataset. In both cases I used a fresh h2o cluster where each of the four nodes was started accoring to

I attached the jvm node logs from each run.

  • only simple housekeeping

  • tiggers multiple gc runs

A first analysis indicates that in both cases the heap increases from iteration to iteration, regardless wether we use just a simple housekeeping or a multiple garbage collections.

  • only simple housekeeping

  • tiggers multiple gc runs (in the picture I hide the gc runs so that the heap consumption is visible)


    END OF UPDATE

--------------------------------------------------------------------------------------------

Monitoring memory consumption in h2o shows that there is a memory leak when running repetitive model creation jobs. A typical ML use case when you want to do this is for example hyperparameter tuning, model validation using a resampling approach, feature selection, bootstrapping, ....
Our example is about feature selection where we take a subset of the features, train a model and evaluate it afterwards. After each of these iterations all new data sets (prediction data set) and the model files are removed with h2o.rm()

The cluster is a six node cluster where each node is started with

After we encountered the first problem, we start a new run and in parallel created a small monitoring script to get a constant update of h2o cluster statistics (using

). This script runs on the main node:

Additionally the script also counts the number of keys using the R API function

(user objects).
The analysis of the monitoring data after approx. 15 h shows

  1. We keep our workspace clean so that the number of user objects is constant (kv_count)

  2. Free mem is decreasing of time (free_mem)

  3. POJO mem is increasing over time with clearly visible spikes over time (pojo_mem)

The pojo_mem spikes correspond with log warnings of the form
[for node .45.2]

[for node .45.3]

As the number of user objects is constant, the memory increase indicates some kind of problematic garbage collection or housekeeping and have serious impact on the usage of the h2o cluster: node failure. In our first run we encountered this effect in a way that the current job stops further processing and most flow requests became unresponsive. To solve this problem we had to restart the cluster - meaning a complete loss of data and results.

Assignee

Roberto Rösler

Fix versions

None

Reporter

Roberto Rösler

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Priority

Critical
Configure