Uploaded image for project: 'SW'
  1. SW-430

pysparkling: adding a column to a data frame does not work when parse the original frame in spark

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.9, 1.6.11, 2.1.8
    • Component/s: None
    • Labels:
    • CustomerVisible:
      No
    • Sprint:

      Description

      #90702
      Code to repro-
      from Kuba - looks like the issue. frame is not re-evaluated after the column is added.

      
      # import csv file
      spark_df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('BostonHousing.csv')
      
      # create h2o context
      from pysparkling import *
      hc = H2OContext.getOrCreate(sc)
      
      
      boston = hc.as_h2o_frame(spark_df)
      import h2o
      from h2o.estimators.glm import H2OGeneralizedLinearEstimator
      
      predictors = boston.columns[:-1]
      response = "medv"
      boston_glm2 = H2OGeneralizedLinearEstimator(nfolds=2,Lambda=.01)
      boston_glm2.train(x = predictors, y = response,training_frame = boston)
      
      pred = boston_glm2.predict(boston)
      
      boston["predict"] = pred['predict']
      sp_boston = hc.as_spark_frame(boston)
      
      sp_boston
      

        Attachments

          Activity

            People

            • Assignee:
              michal Michal Malohlava
              Reporter:
              nidhi Nidhi Mehta
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: