h2o.glm falsely converges when predicted probabilities range broadly between [0, 1] boundaries

Description

Here is a reproduce of the issue:

library(h2o)
h2oHandle <- h2o.init(max_mem_size = "2g")

set.seed(123)
simdata <- data.frame(x = rlnorm(1e6, meanlog = 0, sdlog = 1))
simdata$y <- rbinom(1e6, 1, binomial()$linkinv(-2.5 + 2.5 * simdata$x))
SIMDATA <- as.h2o(h2oHandle, simdata, key = "SIMDATA")

system.time(glm.r <- glm(y ~ x, data = simdata, family = binomial()))
system.time(glm.h2o <- h2o.glm(x = "x", y = "y", data = SIMDATA, family = "binomial", lambda = 0, alpha = 0))

coef(glm.r)
coef(glm.h2o@model)

summary(fitted(glm.r, type = "response"))
summary(h2o.predict(glm.h2o, SIMDATA)[,3L])

sessionInfo()

  1.  

    1. — OUTPUT -------------------------------------------
      > library(h2o)
      Loading required package: RCurl
      Loading required package: bitops
      Loading required package: rjson
      Loading required package: statmod
      Loading required package: tools

----------------------------------------------------------------------

Your next step is to start H2O and get a connection object (named
'localH2O', for example):
> localH2O = h2o.init()

For H2O package documentation, first call init() and then ask for help:
> localH2O = h2o.init()
> ??h2o

To stop H2O you must explicitly call shutdown (either from R, as shown
here, or from the Web UI):
> h2o.shutdown(localH2O)

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.0xdata.com

----------------------------------------------------------------------

Attaching package: 'h2o'

The following objects are masked from 'package:base':

ifelse, max, min, sum

> h2oHandle <- h2o.init(max_mem_size = "2g")

H2O is not running yet, starting it now...

Note: In case of errors look at the following log files:
C:\Users\patrick\AppData\Local\Temp\RtmpgNTXgP/h2o_patrick_started_from_r.out
C:\Users\patrick\AppData\Local\Temp\RtmpgNTXgP/h2o_patrick_started_from_r.err

java version "1.7.0_67"
Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

Successfully connected to http://127.0.0.1:54321
R is connected to H2O cluster:
H2O cluster uptime: 5 seconds 851 milliseconds
H2O cluster version: 2.6.1.5
H2O cluster name: H2O_started_from_R
H2O cluster total nodes: 1
H2O cluster total memory: 1.78 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE

>
> set.seed(123)
> simdata <- data.frame(x = rlnorm(1e6, meanlog = 0, sdlog = 1))
> simdata$y <- rbinom(1e6, 1, binomial()$linkinv(-2.5 + 2.5 * simdata$x))
> SIMDATA <- as.h2o(h2oHandle, simdata, key = "SIMDATA")

===================================================

100%
>

> system.time(glm.r <- glm(y ~ x, data = simdata, family = binomial()))
user system elapsed
43.09 3.75 51.60
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> system.time(glm.h2o <- h2o.glm(x = "x", y = "y", data = SIMDATA, family = "binomial", lambda = 0, alpha = 0))

===================================================

100%

user system elapsed
2.55 0.15 51.11
>
> coef(glm.r)
(Intercept) x
-2.500699 2.503921
> coef(glm.h2o@model)
x Intercept
0.2893151 -0.2521745
>
> summary(fitted(glm.r, type = "response"))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.07727 0.22710 0.50060 0.55000 0.91750 1.00000
> summary(h2o.predict(glm.h2o, SIMDATA)[,3L])
1
Min. :0.4379
1st Qu.:0.4738
Median :0.5093
Mean :0.5454
3rd Qu.:0.5781
Max. :1.0000
>
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] tools stats graphics grDevices utils datasets methods base

other attached packages:
[1] h2o_2.6.1.5 statmod_1.4.20 rjson_0.2.14 RCurl_1.95-4.3 bitops_1.0-6

Assignee

Tomas Nykodym

Reporter

Patrick Aboyoun

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

Components

Priority

Major
Configure