H2O should standardize categorical predictors (dummy/one-hot) for regularized GLM models (lasso, ridge, elastic net)

Description

The h2o.glm function includes a ‘standardize’ parameter that is true by default, and this standardizes continuous predictors. However, if predictors are stored as factors within the input H2OFrame, H2O does not standardize the automatically encoded factor variables (i.e., the resultant dummy or one-hot vectors). I've confirmed this experimentally, but references to this decision also show up in the source code:

For instance, method denormalizeBeta (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L359) includes the comment "denormalize only the numeric coefs (categoricals are not normalized)." It also looks like means (variable _normSub) and standard deviations (inverse of variable _normMul) are only calculated for the numerical variables, and not the categorical variables, in the setTransform method (https://github.com/h2oai/h2o-3/blob/553321ad5c061f4831c2c603c828a45303e63d2e/h2o-algos/src/main/java/hex/DataInfo.java#L599).

However, per Tibshirani (1997), dummy or one-hot variables should also be standardized to enable fair regularization penalties:

"The lasso method requires initial standardization of the regressors, so that the penalization scheme is fair to all regressors. For categorical regressors, one codes the regressor with dummy variables and then standardizes the dummy variables" (p. 394).

Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in medicine, 16(4), 385-395. http://statweb.stanford.edu/~tibs/lasso/fulltext.pdf

Statistical Considerations:

For a dummy variable or one-hot vector, the mean is the proportion of TRUE values, and the SD is directly proportional to the mean. The SD reaches its maximum when the proportion of TRUE and FALSE values is equal (i.e., σ = 0.5), and the sample SD (s) approaches 0.5 as n → ∞. Thus, if continuous predictors are standardized to have SD = 1, but dummy variables are left unstandardized, the continuous predictors will have at least twice the SD of the dummy predictors, and more than twice the SD for imbalanced dummy variables.

It seems like this could be a problem for regularization (LASSO, ridge, elastic net), because the scale/variance of predictors is expected to be equal so that the regularization penalty (λ) applies evenly across predictors. If two predictors A and B have the same standardized effect size, but A has a smaller SD than B, A will necessarily have a larger unstandardized coefficient than B. This means that, if left unstandardized, the regularization penalty will erroneously be more severe to A than B. In a regularized regression with a mixture of standardized continuous predictors and unstandardized categorical predictors, it seems like this could lead to systematic over-penalization of categorical predictors.

Assignee

New H2O Bugs

Fix versions

None

Reporter

Elliot

Support ticket URL

None

Labels

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Priority

Major
Configure