ML Feature Engineering: from Pandas to SQL (BigQuery) — Time Resampling

Vincent Levorato
2 min readMay 18, 2022

This article is a section of the article ML Feature Engineering: from Pandas to SQL (BigQuery).

Time Resampling

When working with time series, resampling time on a regular step is common: it allows data to be aligned for shifting rows usage, doing consistent aggregation operations, graphics, and so on. With Pandas, it is almost straight forward with the resample method:

Time Resampling with Pandas
Time Resampling with Pandas

The code above realigns data each 30 minutes, using a mean aggregation. For instance, for the apple product, between 8:00 and 8:30, there were two sales rows, with respectively values 10 and 5. As these values stand in the same time interval, the average result is 7.5 .

Using BigQuery SQL to achieve that is not so easy (not specific to BQ):

We need first to create a time step referential (ts_steps_ref) which contains each time step (in that case each 30 mins) and each category product.

The final query joins the time step referential with the initial data on the rounded timestamp using integer division (DIVfunction), and applies the AVG function to obtain the sales mean value per time step/category product.

Time Resampling using BigQuery

Note: as I used CROSS JOIN to build the time step referential, I obtain all the possible combinations between time steps and category product: it results in an extra line compared to Pandas.

For more feature engineering techniques from Pandas to SQL, check others sections from ML Feature Engineering: from Pandas to SQL (BigQuery).

--

--

Vincent Levorato

Lead Data Scientist @ Prisma Media. Freelance consultant in data science and AI architectures. Computer science PhD. https://www.linkedin.com/in/vlevorato/