ML Feature Engineering: from Pandas to SQL (BigQuery) — Rolling Windows

Vincent Levorato
3 min readMay 18, 2022

This article is a section of the article ML Feature Engineering: from Pandas to SQL (BigQuery).

Rolling windows

Rolling Windows without time

Feature engineering involves statistics in lot of cases, and rolling aggregations are one of them. A common operation is to build rolling windows over time series, and Pandas has dedicated methods to do so, for instance on 3 rows (in this case, days):

Rolling average using Pandas

As with standard SQL, BigQuery has windowing functions to achieve this calculation:

Rolling average using BigQuery

This kind of calculation can only be done if the time frequency is constant for each item, meaning having a value for each day in that case. If not, two solutions are usually applied: time resampling or time rolling window. The first one is covered in another section, so let’s check how to achieve time rolling window with Pandas and BigQuery SQL.

Rolling Windows with time

After putting some missing values in the data (no information for Orange product on March 4 and Lemon product on March 8), we can achieve a time rolling window of 3 days like this:

Time rolling average using Pandas

The trick appears, for instance, at the line number 9: in the column sales_rolling_mean_3_days, we have the right calculation with 3 Oranges sales on March 5, none on March 4, and 3 again the 3rd of March, which gives an average sales of 3. If classic rolling window had been used, we would have a false result, making the mean with values of March 5, the 3rd of March and also the 2nd of March which is not what we wanted (expected window is 3 days, not 4).

This operation considering time can be done with BigQuery SQL this way:

Time rolling average using BigQuery

For more feature engineering techniques from Pandas to SQL, check others sections from ML Feature Engineering: from Pandas to SQL (BigQuery).

Advanced rolling windows

With Pandas, if we don’t consider possible performance issues due to high amount of data in memory, applying any aggregation function to the rolling function is easy:

df_roll = df.groupby('category_product').rolling(window=3, min_periods=1).agg(my_agg_function)

It can be usual functions (can be called by their name in string, like 'mean' or 'std' ), functions from a specific package or user-defined functions:

Rolling Median Absolute Deviation using Pandas

With BigQuery SQL, only standard aggregation functions are available (AVG, STD, MIN, MAX, COUNT). If someone needs to make a rolling median, it is not straightforward as shown above with Pandas. To calculate this feature, the code below will do the job (see Time Rolling Window section for time rolling version):

Rolling Median Absolute Deviation using BigQuery

The idea is to use the arrays corresponding to the rolling windows as input to any calculation you should want to make (quantiles extraction, skewness, kurtosis, etc. ).

For more feature engineering techniques from Pandas to SQL, check others sections from ML Feature Engineering: from Pandas to SQL (BigQuery).

--

--

Vincent Levorato

Lead Data Scientist @ Prisma Media. Freelance consultant in data science and AI architectures. Computer science PhD. https://www.linkedin.com/in/vlevorato/