Guys that would improve the situation with unseen categories or junk data in numerical column. Although forcing to ‘missing’ is one option that you guys currently do, this still would not be optimal in production as we might not have even seen ‘missing’ in training data and might be a dark shot in certain cases. What we currently do with our NN solutions is we impute unseen categories to ‘mode’ and numerical junk to ‘median’. This would generally produce a safer result. For some very specific variables we might even modify this to something other than mode or median.
So to put it in a nutshell is it possible to include a capability that takes a configuration file ( txt, xml , etc ) which includes the default value to use for each variable in case unseen category or junk numeric is observed.
Something like this:
Variables name | default value if unseen category or junk numerical
VarYCat | ‘#’
VarZCat | NA