6 Seasonal adjustment of high-frequency data

In the previous chapter, we went through all the steps to perform the seasonal adjustment of a time series, using an indicator from Brazilian retail sales to illustrate the technique. However, one detail was left between the lines: the method employed, namely the U.S Census Bureau’s X-13-ARIMA, is well suited for monthly and quarterly data, but was not particularly designed for high-frequency data.

Until recently, this wasn’t much of an issue, but the growing use of alternative data has made it a real limitation. Fortunately, there are tools available to work with this type of data, each with its own pros and cons. For the purposes of this chapter, I will present two methods that I consider the most useful for day-to-day applications, using the credit_card_br dataset, which contains alternative indicators for economic activity in Brazil based on credit card transactions.

Data

The credit_card_br dataset is available in the R4ER2data package under the name credit_card_br.

Let’s start by getting a quick glimpse of our dataset.

library(R4ER2data)
library(tidyverse)

credit_card_br <- R4ER2data::credit_card_br

credit_card_br |> glimpse()

Rows: 2,142
Columns: 12
$ date                    <date> 2018-01-01, 2018-01-02, 2018-01-03, 2018-01-0…
$ beauty                  <dbl> 1.809595, 22.638353, 40.355584, 60.385363, 87.…
$ accommodation           <dbl> 141.02167, 203.34703, 156.29133, 137.05665, 14…
$ recreational_activities <dbl> 23.52548, 147.95883, 97.75418, 103.66431, 126.…
$ clothing                <dbl> 3.877227, 67.051026, 79.968250, 88.295544, 100…
$ vehicles                <dbl> 2.576551, 91.293356, 107.536082, 114.973678, 1…
$ fuels                   <dbl> 59.52721, 98.93869, 95.22126, 93.98672, 108.08…
$ food                    <dbl> 57.64591, 77.87337, 80.45811, 86.29766, 111.79…
$ furniture               <dbl> 9.338676, 87.396817, 92.761066, 98.161058, 209…
$ office                  <dbl> 9.547598, 92.614475, 92.872892, 105.437308, 10…
$ goods                   <dbl> 16.08848, 81.91656, 89.40665, 92.30509, 130.11…
$ services                <dbl> 55.14835, 94.77876, 86.54082, 91.21855, 114.92…

The dataset contains daily series of credit card transactions spanning from 2018 to 2023, divided into two major categories — goods and services — and some of their subcategories. Now, let’s plot a small portion of the sample to get an idea of what the data looks like in terms of seasonality.

credit_card_br |> 
  filter(between(date, as.Date('2023-02-01'), as.Date('2023-04-01'))) |> 
  ggplot() +
  geom_line(aes(x = date, y = goods), lwd = 1) +
  labs(
    title    = 'Credit card transactions for goods',
    subtitle = 'Index (2018 = 100). Current prices (BRL).'
  )

Unlike monthly or quarterly time series, the main challenge with daily time series is that they can exhibit multiple overlapping seasonal patterns — annual, monthly, and weekly. This makes it particularly difficult to visualize these patterns in a single plot. For instance, in the chart above, a weekly seasonal pattern is noticeable. However, if we increase the sample size, it might become harder to identify other patterns.

credit_card_br |> 
  filter(between(date, as.Date('2022-01-01'), as.Date('2022-12-01'))) |> 
  ggplot() +
  geom_line(aes(x = date, y = goods), lwd = 1) +
  labs(
    title    = 'Credit card transactions for goods',
    subtitle = 'Index (2018 = 100). Current prices (BRL).'
  )

For this reason, we seek to employ effective methods to identify and estimate the various seasonal patterns present in the series, allowing us to uncover the underlying trends in the data. This is where the following methods come into play.

6.0.1 Prophet

Prophet is a forecasting model developed by Meta, designed to handle time series of various frequencies. Among other outputs, it returns the fitted and future values for the time series components (trend and seasonality), allowing us to compute a seasonally adjusted version of the data. Importantly, this method also supports the inclusion of special events, such as outliers and custom holidays, which increases the robustness of the results.

Let’s start by preparing the input data. Prophet requires a data frame with two columns: ds, which contains the dates; and y, which contains the target time series. Taking the logarithm of y helps reduce volatility and makes the relationship between its components linear.

goods_df <- credit_card_br |> 
  select(ds = date, y = goods) |> 
  mutate(y = log(y))

Since Black Friday is an important event for retail sales, it makes sense to include it in the model. For this purpose, we need to create an additional data frame with two columns: ds (dates) and holiday (the event’s label). Any other special events can be added to this data frame.

To create the vector with the Black Friday dates for each year in the dataset, I will take advantage of the Thanksgiving dates vector provided by the tis package.

library(lubridate)

special_events <- tibble(
  ds = map_chr(
    .x = 2018:2023,
    .f = function(x) {
      tis::holidays(x)['Thanksgiving'] |> 
        ymd() %m+% 
        days(1) |> 
        as.character()
    }
  ),
  holiday = 'Black Friday'
)

The process of fitting the model is carried out in layers: the first layer defines some preliminary settings, and the model is then estimated using the fit.prophet function.

library(prophet)

m <- prophet(holidays = special_events)
m <- add_country_holidays(m, country_name = 'BR')
m <- fit.prophet(m, goods_df)

We can access all the holidays and special events included in the model through the train.holidays.names object.

m$train.holiday.names

[1] "Black Friday"               "New Year's Day"            
[3] "Tiradentes"                 "Worker's Day"              
[5] "Independence Day"           "Our Lady of the Apparition"
[7] "All Souls' Day"             "Republic Proclamation Day" 
[9] "Christmas"

Many important holidays in Brazil are not included in the built-in holidays vector and could be added to the special_events data frame. However, I will leave that aside for now and move forward. The next step is to create a data frame with future dates for prediction. Even though we are not interested in future values, this step is necessary to obtain the outcome.

future_df <- make_future_dataframe(
  m, 
  periods = 1,
  freq = 'day', 
  include_history = TRUE
)

m_forecast <- predict(m, future_df)

The m_forecast object contains all the components estimated by the model, including the aggregate seasonal effect and the individual effects of each holiday or special event. We can easily visualize them using the prophet_plot_components() function.

prophet_plot_components(m, m_forecast)

The first panel shows the estimated trend, which was severely hit by the COVID-19 episode and returned to the pre-pandemic levels by mid-2021. The next three panels provide information about the series’ seasonal pattern. More specifically, retail sales experience a sharp drop every December 25th and January 1st (panel 2), a justified occurrence since the majority of brick-and-mortar stores are closed on these days. The same reasoning above explains why lower sales are recorded on Sundays (panel 3).

While most holidays show a modest impact, the Black Friday event, as expected, has a significant positive effect on sales (panel 2). Regarding the yearly pattern (panel 4), it’s no surprise that sales exhibit an upward trend starting in the last quarter of the year, reaching a peak in December.

The package also includes a function to plot the fitted values of the model, which can be useful for identifying patterns that were not properly modeled. For example, a better fit could be achieved by including other important holidays such as Carnival and Corpus Christi, or by adding special events for extended holidays.

dyplot.prophet(m, m_forecast)

Assuming this is our ready-to-go version of the model, we could use the estimated aggregated seasonal effect (the column additive_terms) to subtract from the original data (since we applied a log transformation) and obtain seasonally adjusted data. Remember, we need to apply exp to back-transform the values into the original scale.

goods_df_sa <- m_forecast |> 
  select(ds, additive_terms) |> 
  left_join(goods_df, by = 'ds') |> 
  mutate(y_sa = y - additive_terms) |> 
  mutate(across(c(y, y_sa), ~ exp(.x))) |> 
  rename(c('date'= 'ds', 'goods' = y, 'goods_sa' = y_sa))

goods_df_sa |>   
  filter(between(date, as.Date('2023-02-01'), as.Date('2023-04-01'))) |> 
  ggplot() +
  geom_line(aes(x = date, y = goods, color = 'Original'), lwd = 1) +
  geom_line(aes(x = date, y = goods_sa, color = 'Seasonally-Adjusted'), lwd = 1) +
  labs(
    title = 'Credit card transactions for goods',
    subtitle = 'Index (2018 = 100). Current prices (BRL).',
    color = ''
  )

The plot above shows that we were able to eliminate the largest regularly spaced peaks and troughs, although some residual seasonality might still be present in the data. From this point onward, it’s a matter of addressing the remaining holidays and special events that might still affect the model estimates until we feel comfortable with the result.

Additionally, other features — such as trend changepoints — may also play a role. To explore the full range of features available in Prophet and gain a deeper understanding of how to fine-tune its default parameters, I strongly recommend consulting the package’s documentation.

6.0.2 DSA

The Daily Seasonal Adjustment (DSA) is a method proposed by Daniel Ollech for performing seasonal adjustment of high-frequency data (see Ollech (2018)). In general terms, it combines the STL method (see Hyndman and Athanasopoulos (2018)) to extract multiple seasonal components from the time series with ARIMA regression to account for calendar effects. Interested readers can refer to the paper for more details on the methodology.

As for its application, the dsa function offers many arguments to control the resulting series. Readers are also encouraged to consult the package vignette for further guidance. For our purposes, it will suffice to demonstrate how to implement the method to address common tasks.

We’ll use the same time series as before — namely, credit card expenditures on goods. Unlike Prophet, however, the dsa function expects the input time series to be an xts object. So the first step is to create the appropriate object.

goods_xts <- xts::as.xts(x = goods_df$y, order.by = as.Date(goods_df$ds))

The second step is to define the holidays — and, possibly, other relevant dummy variables. These can be provided as a multiple time series object, where each column corresponds to a specific holiday, with values set to 1 on the holiday dates and 0 otherwise. The package includes a set of predefined holidays we can choose from, and we can, of course, add custom ones.

For example, in the code below, I select some common holidays from the holidays dataset and manually add Black Friday and Brazil’s Independence Day.

holidays_names <- c(
  "CarnivalMonday", 
  "ChristmasDay", 
  "CorpusChristi", 
  "EasterPeriod", 
  "NewYearsDay", 
  "NewYearsEve",
  "LabourDay"
)

my_holidays <- dsa::holidays[, holidays_names] |> 
  as.data.frame() |> 
  rownames_to_column(var = 'date') |> 
  mutate(
    BlackFriday     = ifelse(date %in% special_events$ds, 1, 0),
    IndependenceBR  = ifelse(format(as.Date(date), '%m-%d') == '09-07', 1, 0)
  )

my_holidays_xts <- xts::xts(
  my_holidays |>
    arrange(date) |> 
    select(-date),
  order.by = as.Date(
    my_holidays |> 
      arrange(date) |> 
      pull(date)
  )
)

The final input is then constructed using the multi_xts2ts function. Note that we provide two inputs. The first — my_holidays_fit — refers to the sample period of the holiday regressors, which matches the time series of credit card data. The second — my_holidays_fc — corresponds to the one-year-ahead period, used to obtain future values.

my_holidays_fit <- dsa::multi_xts2ts(my_holidays_xts[goods_df$ds])
my_holidays_fc  <- dsa::multi_xts2ts(
  my_holidays_xts[as.Date(max(goods_df$ds):(max(goods_df$ds)+365))],
  short = TRUE
)

We are now ready to perform the seasonal adjustment of the time series. Since the data is already in logarithmic scale, I set Log = FALSE. Note that the procedure may take a little while to complete.

dsa_fit <- dsa::dsa(
  goods_xts, 
  regressor = my_holidays_fit,
  forecast_regressor = my_holidays_fc,
  Log = FALSE,
  progress_bar = FALSE
)

After the procedure is completed, we can extract the seasonally adjusted data using the get_sa() function and inspect it visually using our preferred tool. In this case, I like using the dygraphs package because it produces an interactive plot, allowing us to identify the time period of a given data point simply by hovering the mouse over it — which is particularly helpful in this context.

goods_dsa <- dsa::get_sa(dsa_fit)
goods_dsa |> 
  exp() |> 
  dygraphs::dygraph()

Although the noticeable seasonality has disappeared, the figure still reveals large spikes at several data points, which may be due to a variety of causes — such as omitted holidays, promotions, or even random noise. It is the analyst’s responsibility to investigate these movements and address them when appropriate in order to produce reliable results.

Hyndman, R. J., and G. Athanasopoulos. 2018. Forecasting: Principles and Practice, 2nd Edition. OTexts: Melbourne, Australia.

Ollech, Daniel. 2018. “Seasonal Adjustment of Daily Time Series. Discussion Paper No 41/2018.”