Bad split on categorical variable in GBM and DRF affecting model quality

Description

Dataset and config to reproduce the bug

Training CSV used

categorical_column

target

A

True

B

True

C

False

D

False

E

True

F

True

G

False

H

False

Configuration used

parameter

value

ntrees

1

max_depth

1

min_rows

1

nbins_cats

4 for BUG and 8 for NO BUG

column

type

categorical_column

enum

target

enum

Explanations

With nbins_cats = 8 (i.e. nbins_cats greater or equal to the number of unique values of the categorical column), there is no bug, the training AUC is 1 as expected and the tree is the expected one below :

Whereas with nbins_cats = 4, there is the bug i.e. a bad split (and "numerical") on the categorical column, the training AUC is 0.75 and it is confirmed by the bad tree shown below :

Normally, in this example, even with nbins_cats = 4 we should get the same optimal split than with nbins_cats = 8 and thus AUC should be 1.
But you see that AUC is only 0.75 and not 1.

Assignee

Michal Kurka

Fix versions

Reporter

Ismael

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Affects versions

Priority

Critical
Configure