XGBoost Estimator takes too much memory

Description

I'm trying to use the XGBoost Estimator to train a big dataset (like from1GB..) in distributed mode with a Hadoop Cluster. I respect the tutorial in the part of Xgboost to lancer the h2o_hadoop.jar by yarn. I also change the config of the Yarn for my situation and execute with the option -extramempercent 100.
But I found that it's really strange that It takes too much memory in the cluster. (1GB data -> 40GB cluster total)
If i pass the XGboost to LightGBM(with tree_method="hist",grow_policy="lossguide"), it takes more memory than the XGboost(75GB). It's really terrible.
And in the process of training, the memory used increased followed with the n_trees. (if i put n_trees = 3000 trees, the programme crashed when trainging at the 1500trees because that the memory is exceeded the limit)
I hope that you can take a test with the big data and look at the usage of the memory in the cluster.
I can train the same data in local mode with n_trees = 5000 (the usage of memory is stable that means will not increase in the process of training) but I can't train it in distributed mode.
My environment of cluster: 3 machines with 8 core and 32GB RAM.

Assignee

New H2O Bugs

Fix versions

Reporter

zhongtian xiao

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

No

Components

Affects versions

Priority

Major
Configure