PCA with standardized transformation not showing zero mean for categoricals

Description

When using the H2OPrincipalComponentAnalysisEstimator object to calculate PCA components of data, the first components don't have zero mean as expected when categorical variables are present. If those variables are dummified upfront to 0/1 variables per variable category, the result is as expected (i.e. all zero means).

This looks like there might be a bug in PCA for categorical features. Reproducible code snippet below

Where the means look like

for categoricals and

 

for manual one-hot encoding the numbers look like:

 

 

Environment

None

Status

Assignee

Wendy

Fix versions

Reporter

Lauren DiPerna

Support ticket URL

Labels

None

Release Priority

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

No

Priority

Major
Configure