I'm trying to use the XGBoost Estimator to train a big dataset (like from1GB..) in distributed mode with a Hadoop Cluster. I respect the tutorial in the part of Xgboost to lancer the h2o_hadoop.jar by yarn. I also change the config of the Yarn for my situation and execute with the option -extramempercent 100.
But I found that it's really strange that It takes too much memory in the cluster. (1GB data -> 40GB cluster total)
If i pass the XGboost to LightGBM(with tree_method="hist",grow_policy="lossguide"), it takes more memory than the XGboost(75GB). It's really terrible.
And in the process of training, the memory used increased followed with the n_trees. (if i put n_trees = 3000 trees, the programme crashed when trainging at the 1500trees because that the memory is exceeded the limit)
I hope that you can take a test with the big data and look at the usage of the memory in the cluster.
I can train the same data in local mode with n_trees = 5000 (the usage of memory is stable that means will not increase in the process of training) but I can't train it in distributed mode.
My environment of cluster: 3 machines with 8 core and 32GB RAM.