Feature Request: Time Series Cross-Validation

Description

Would it be possible to implement Time Series K-Fold Cross-Validation?

For example:

With a Time Series Training/Validation Interval which goes from 2017-01-01 to 2019-12-31:

  • With a regularly spaced interval of 1 [month] step.

  • For example, a regular K-Fold Cross-Validation is trained between 2017-01-01 and 2017-02-01, and error within that Time Frame is minimised (Eg: RMSE). However, in order to evaluate by ourselves the error, we use out-of-sample validation data, which goes from 2017-02-01 to 2017-03-01.

The process is repeated iteratively:

  • A regular Cross-Validation is trained between 2017-01-01 and 2017-03-01, and error within that Time Frame is minimised (Eg: RMSE). However, in order to evaluate by ourselves the error, we use out-of-sample validation data, which goes from 2017-03-01 to 2017-04-01.

And so successively.

Some theoretical context:
https://robjhyndman.com/hyndsight/tscv/
https://www.sciencedirect.com/science/article/abs/pii/S0167947317302384

Some scikit-learn context:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

How would be used?:

  • If you are generating a point estimate you could calculate t+1, and use t+1 target prediction mean to update the features, and to calculate t+1 back again, and so successively, for a reasonable time window. This would require calculated variables, which “Update” with Target Mean Predictions, by considering it as already known data.

  • Another use case (Which requires Time Series Cross-Validation + Group-Wise Cross-Validation):

    • We would have 2 Time axis:

      • Our Date Axis: 2019-01-01, 2019-01-02, 2019-01-03, …, 2019-12-31.

      • Our Snapshot Date Axis (Linked to a given Date Axis, that is, grouped by Date Axis): That is, how on a given Date, we see our sales till snapshot, for our Target Date. We could call this variable “Number of Days Out” (NDO), and we aim to predict at NDO = 0 (NDO = Date - Snapshot).

    • Our Target would be Sales at NDO = 0 for each Date.

    • This is a regular scenario, as many enterprise database tables are many times versioned-tables.

In addition, some further regression forecast that could be added to our H2O Tree Model, in order to benefit from its regularisation, and use it as an ensemble for a large number of forecasts:

Currently I apply previous strategies manually for my Time Series Projects, but is always nice to see those automated, so that others can benefit.

Assignee

Unassigned

Fix versions

Reporter

Juan Telleria

Support ticket URL

None

Labels

None

Affected Spark version

None

Customer Request Type

None

Task progress

None

CustomerVisible

Yes

Components

Affects versions

Priority

Minor
Configure