Improving situation with unseen categories or junk data in numerical column

Description

Guys that would improve the situation with unseen categories or junk data in numerical column. Although forcing to ‘missing’ is one option that you guys currently do, this still would not be optimal in production as we might not have even seen ‘missing’ in training data and might be a dark shot in certain cases. What we currently do with our NN solutions is we impute unseen categories to ‘mode’ and numerical junk to ‘median’. This would generally produce a safer result. For some very specific variables we might even modify this to something other than mode or median.

So to put it in a nutshell is it possible to include a capability that takes a configuration file ( txt, xml , etc ) which includes the default value to use for each variable in case unseen category or junk numeric is observed.

Something like this:

Variables name | default value if unseen category or junk numerical
VarMNumeric| NA
VarXNumeric| 0
VArNNumeric| -1
VarYCat | ‘#’
VarZCat | NA

Assignee

Arno Candel

Fix versions

None

Reporter

Avkash Chauhan

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

Support Incident

Task progress

None

CustomerVisible

No

Support Assessment

Data Science Issue

AffectedCustomers

Due date

2017/02/22

Priority

Critical
Configure