ML Feature Engineering: from Pandas to SQL (BigQuery) — Lag Features

Vincent Levorato
2 min readMay 18, 2022

This article is a section of the article ML Feature Engineering: from Pandas to SQL (BigQuery).

Lag Features

This kind of featuring is used to encode data dynamics through time. By giving such features to a non-time series oriented model (like a Random Forest for instance), it gives the model some information about data dynamics through time. If data is aligned with a constant time step and ordered, we simply write the following code with Pandas to obtain previous day sales:

Lag feature using Pandas

With BigQuery SQL, it’s straight forward (with the same assumptions):

Lag feature using BigQuery

Time Lag Features

Same problem as rolling windows, if some time steps are missing. Using lag technique without having that in mind, leads to incoherent lags (see figure below).

Time lag features inconsistency example

To handle that, one can use time resampling or time lag functions: the latter is shown below.

Time Lag feature using Pandas

Concerning BigQuery SQL, there is no such “time lag” function. It becomes tricky as you need to pass time series into arrays, and take the first cell as the shifting value:

Time Lag feature using BigQuery

Remark: doing so, it is completely possible to build (time) lagged rolling windows by coupling the two techniques.

For more feature engineering techniques from Pandas to SQL, check others sections from ML Feature Engineering: from Pandas to SQL (BigQuery).

--

--

Vincent Levorato

Lead Data Scientist @ Prisma Media. Freelance consultant in data science and AI architectures. Computer science PhD. https://www.linkedin.com/in/vlevorato/