ML Feature Engineering: from Pandas to SQL (BigQuery) — Time Resampling
This article is a section of the article ML Feature Engineering: from Pandas to SQL (BigQuery).
Time Resampling
When working with time series, resampling time on a regular step is common: it allows data to be aligned for shifting rows usage, doing consistent aggregation operations, graphics, and so on. With Pandas, it is almost straight forward with the resample
method:
The code above realigns data each 30 minutes, using a mean aggregation. For instance, for the apple product, between 8:00 and 8:30, there were two sales rows, with respectively values 10 and 5. As these values stand in the same time interval, the average result is 7.5 .
Using BigQuery SQL to achieve that is not so easy (not specific to BQ):
We need first to create a time step referential (ts_steps_ref
) which contains each time step (in that case each 30 mins) and each category product.
The final query joins the time step referential with the initial data on the rounded timestamp using integer division (DIV
function), and applies the AVG
function to obtain the sales mean value per time step/category product.
Note: as I used CROSS JOIN
to build the time step referential, I obtain all the possible combinations between time steps and category product: it results in an extra line compared to Pandas.
For more feature engineering techniques from Pandas to SQL, check others sections from ML Feature Engineering: from Pandas to SQL (BigQuery).