XGBoost: realloc() memory allocation crash on munged BNPParibas

Description

h2o-3 crashes with the following stacktrace when XGBoost is run on BNPParibas as munged by autodl 0.9.1. This is with h2o-3 built from 's branch mm/xgb_upgrade, which updates to the latest XGBoost.

I have saved the dataset, logfile and a repro Python script here: mr-dl2:/home/rpeck/XGBoost-realloc-crash-2017.10.11

  •  

    •  

      • Error in `java': realloc(): invalid next size: 0x00007fd204280850 ***
        ======= Backtrace: =========
        /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fd49f0207e5]
        /lib/x86_64-linux-gnu/libc.so.6(+0x834aa)[0x7fd49f02c4aa]
        /lib/x86_64-linux-gnu/libc.so.6(realloc+0x179)[0x7fd49f02d839]
        /usr/lib/x86_64-linux-gnu/libpciaccess.so.0(+0x3c96)[0x7fd400ac8c96]
        /usr/lib/x86_64-linux-gnu/libpciaccess.so.0(+0x3ef2)[0x7fd400ac8ef2]
        /usr/lib/x86_64-linux-gnu/libpciaccess.so.0(pci_device_get_device_name+0x54)[0x7fd400ac9094]
        /usr/lib/nvidia-375/libnvidia-ml.so(+0xf7e6c)[0x7fd3e8184e6c]
        /usr/lib/nvidia-375/libnvidia-ml.so(+0xe9106)[0x7fd3e8176106]
        /usr/lib/nvidia-375/libnvidia-ml.so(+0xc4cc)[0x7fd3e80994cc]
        /usr/lib/nvidia-375/libnvidia-ml.so(nvmlDeviceGetCpuAffinity+0x338)[0x7fd3e80ccb88]
        /usr/lib/nvidia-375/libnvidia-ml.so(nvmlDeviceSetCpuAffinity+0x21e)[0x7fd3e80ccfce]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_Z28wrapNvmlDeviceSetCpuAffinityP13nvmlDevice_st+0x14)[0x7fd42a154964]
        /tmp/libxgboost4j_gpu4238844599563539840.so(ncclCommInitAll+0x49a)[0x7fd42a14d9da]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost4tree12GPUHistMaker8InitDataERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EERNS_7DMatrixERKNS_7RegTreeE+0xa27)[0x7fd42a1417a7]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost4tree12GPUHistMaker10UpdateTreeERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEPNS_7RegTreeE+0x4e)[0x7fd42a14503e]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost4tree12GPUHistMaker6UpdateERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixERKS2_IPNS_7RegTreeESaISD_EE+0x8a)[0x7fd42a14961a]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesERKSt6vectorINS_6detail18bst_gpair_internalIfEESaIS5_EEPNS_7DMatrixEiPS2_ISt10unique_ptrINS_7RegTreeESt14default_deleteISD_EESaISG_EE+0x9ce)[0x7fd429f1a21e]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPSt6vectorINS_6detail18bst_gpair_internalIfEESaIS7_EEPNS_11ObjFunctionE+0xb50)[0x7fd429f1b640]
        /tmp/libxgboost4j_gpu4238844599563539840.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiPNS_7DMatrixE+0x22b)[0x7fd42a09a20b]
        /tmp/libxgboost4j_gpu4238844599563539840.so(XGBoosterUpdateOneIter+0x27)[0x7fd42a06bde7]
        [0x7fd489017b94]

Assignee

Rory Mitchell

Fix versions

None

Reporter

Raymond Peck

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

ReleaseNotesHidden

None

CustomerVisible

Yes

Components

Sprint

None

Priority

Blocker
Configure