ML Feature Engineering: from Pandas to SQL (BigQuery) — Lag Features
This article is a section of the article ML Feature Engineering: from Pandas to SQL (BigQuery).
Lag Features
This kind of featuring is used to encode data dynamics through time. By giving such features to a non-time series oriented model (like a Random Forest for instance), it gives the model some information about data dynamics through time. If data is aligned with a constant time step and ordered, we simply write the following code with Pandas to obtain previous day sales:
With BigQuery SQL, it’s straight forward (with the same assumptions):
Time Lag Features
Same problem as rolling windows, if some time steps are missing. Using lag technique without having that in mind, leads to incoherent lags (see figure below).
To handle that, one can use time resampling or time lag functions: the latter is shown below.
Concerning BigQuery SQL, there is no such “time lag” function. It becomes tricky as you need to pass time series into arrays, and take the first cell as the shifting value:
Remark: doing so, it is completely possible to build (time) lagged rolling windows by coupling the two techniques.
For more feature engineering techniques from Pandas to SQL, check others sections from ML Feature Engineering: from Pandas to SQL (BigQuery).