XGBoost: "NCCL failure :cuda malloc failed" memory allocation crash on munged BNPParibas

Description

h2o-3 crashes with the following stacktrace when XGBoost is run on BNPParibas as munged by autodl 0.9.1. This is with h2o-3 built from 's branch mm/xgb_upgrade, which updates to the latest XGBoost. Note that this is similar but not identical to PUBDEV-4997. The identical model build failed for me a couple times the way and once the way.

The dataset, logfile and a repro Python script are here: mr-dl2:/home/rpeck/XGBoost-realloc-crash-2017.10.11

[23:21:14] /home/michal/dev/xgboost/dmlc-core/include/dmlc/logging.h:300: [23:21:14] /home/michal/dev/xgboost/src/tree/updater_gpu_hist.cu:286: GPU plugin exception: NCCL failure :cuda malloc failed /home/michal/dev/xgboost/src/tree/updater_gpu_hist.cu(318)

Stack trace returned 7 entries:
[bt] (0) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f5e65ef043c]
[bt] (1) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost4tree12GPUHistMaker6UpdateERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISD_EE+0x15e) [0x7f5e661496ee]
[bt] (2) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISD_EESaISG_EE+0x9ce) [0x7f5e65f1a21e]
[bt] (3) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_6detail18bst_gpair_internalIfEESaIS7_EEPNS_11ObjFunctionE+0xb50) [0x7f5e65f1b640]
[bt] (4) /tmp/libxgboost4j_gpu8406566776870335554.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x22b) [0x7f5e6609a20b]
[bt] (5) /tmp/libxgboost4j_gpu8406566776870335554.so(XGBoosterUpdateOneIter+0x27) [0x7f5e6606bde7]
[bt] (6) [0x7f5ece19210d]

Assignee

Rory Mitchell

Fix versions

None

Reporter

Raymond Peck

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Sprint

None

Priority

Blocker
Configure