Parallelisation of data splitting

Description

Hi ,

I am trying to partition data here on MODEL_ID and assign each subset a unique name . I am using for loop here , but that increases the time as data increases since for loop subsets in a linear fashion . Is there any way we can parallelise , the below code ? I tried mclapply , it didnt work , errors out since we are using the same dataframe (ex. here mrd_log / baselineDat) so it locks it for one run and errors out for the other , and lapply takes the same time as for loop . Is there any better way to solve this issue ?

Here is the code :

for (i in 1:length(modIDS)){
#split data
h2o.assign(mrd_log[mrd_log$MODEL_ID==modIDS[i],], key=paste0("allSplitDat",run$run_id,i))

h2o.assign(mrd_log[mrd_log$MODEL_ID==modIDS[i] & mrd_log$TRAIN_IND==1,], key=paste0("trainDat",run$run_id,i))

#split baseline data
h2o.assign(baselineDat[baselineDat$MODEL_ID==modIDS[i],],key=paste0("baselineSplitDat",run$run_id,i))
}

Assignee

New H2O Bugs

Fix versions

None

Reporter

priyal.doshi

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Priority

Major
Configure