13 Basic workflow

In recent years, forecasting has been widely associated with sophisticated Machine (or Deep) Learning methods such as XGBoost, LSTM, Random Forest, and Neural Networks. In fact, these algorithms have proven to be very effective in improving forecast accuracy across various contexts. However, in most day-to-day applications, simpler statistical methods deliver fairly good results at a very low cost of implementation. They can also serve as a useful benchmark for more complex models.

In this chapter, we’ll cover the basic steps to generate forecasts for a time series, making extensive use of the forecast package to predict the Brazilian CPI excluding regulated prices. I’d like to draw your attention to the fact that while choosing the appropriate method is important, it’s only one part of the process. To be successful, we need to understand and carefully handle each step involved.

For those interested in a more rigorous and detailed approach to this subject, I can’t recommend the exceptional R. J. Hyndman and Athanasopoulos (2018) highly enough.

13.1 Step 1: Observe the Time Series Features

Forecasting is all about extrapolating patterns. When our forecast depends on other variables, we’re assuming that the historical relationship between the target and the explanatory variables will hold in the future. Likewise, when we employ univariate methods, the main assumption is that the time series features remain constant or evolve in an predictable way. Hence, the first step is to investigate the features of the time series.

Let’s import the monthly CPI excluding regulated prices (not seasonally adjusted). Then, we’ll plot the series using the autoplot function from the forecast package. Note that it’s based on ggplot, so we can add ggplot layers to the plot.

Data

Monthly CPI excluding regulated prices (not seasonally adjusted) from the Brazilian Central Bank is available in the R4ER2data package under the name cpi_br.

library(tidyverse)
cpi_br <- R4ER2data::cpi_br

cpi_br_ts <- ts(cpi_br$cpi, start = c(2004, 1), frequency = 12)

library(forecast)
autoplot(cpi_br_ts) + 
  labs(
    title = 'Brazilian CPI ex-regulated prices - %MoM NSA',
    y = '',
    x = ''
  ) +
  scale_y_continuous(labels = function(x) paste0(x, '%'))

Figure 13.1: Plot of Brazilian CPI ex-regulated prices - %MoM NSA

The figure above provides a general overview of the time series. We can visually infer that the average CPI was around 0.5% from 2004 to 2016, then dropped to about half that from mid-2016 to early 2020, before the Covid-19 pandemic hit the economy. This episode marked the beginning of a trend in which inflation rose to values close to 1%.

A more informative plot can be obtained using the mstl function, which decomposes the series into trend, seasonal, and remainder (noise) components. These are precisely the features we’re most interested in understanding in order to make accurate predictions.

cpi_br_ts |>
  mstl() |> 
  autoplot() +
  labs(title = 'Brazilian CPI: time series decomposition')

Figure 13.2: MSTL Decomposition of the Brazilian CPI ex-regulated prices time series

We can extract two important pieces of information from this graph. First, the trend that began in the aftermath of the Covid-19 pandemic is flattening, although there are no clear signs of a reversal. Second, there is a noticeable change in the seasonal pattern starting in 2016, with higher peaks and a different shape. Therefore, when selecting an appropriate forecasting model, we should opt for those that offer flexibility in capturing both trend and seasonality – in the case of univariate models – or choose explanatory variables that can replicate this pattern to some extent.

13.2 Step 2: Split the Sample

We need to split the sample into training and testing sets in order to evaluate our model. There are several splitting schemes, and the appropriate choice depends on the nature of the data and the sample size. For time series, the most robust scheme is block cross-validation, where many contiguous sections of the series are selected at random to train the model, and tests are performed on the adjacent observations. In practice, however, it’s very common to use the leave-one-out approach, where we train the model using observations up to \(t-1\) to predict the target at \(t\). This procedure is iterated over \(t\) to generate a representative set of (pseudo) out-of-sample forecasts that we can use to assess model accuracy.

Note that the structural changes shown in Figure 13.2 have important implications for how we split the sample. More specifically, the new seasonal pattern accounts for about 40% of the sample, whereas the post-Covid trend appears in only 15%. Thus, testing our model over the full sample – or over a sample that overrepresents this recent period – could lead us to believe we have a good model for forecasting the future when, in fact, we don’t.

Let’s make a simple, yet very common choice here: consider the sample starting in 2016 and use the leave-one-out approach beginning in Jan/2019 to predict twelve months ahead. We’ll use the rsample package, which is part of the tidymodels ecosystem of R packages specifically designed to support forecasting in a tidy workflow. Since it follows the tidy philosophy, we first need to convert our data from ts to a data frame or tibble. It can be easily done using the timetk package.

The rolling_origin function from rsample returns a convenient object where each slice contains two components – the training set and the testing set – which can be accessed through specific functions. The initial argument defines the size of the initial sample used for training; the assess argument defines how many observations are used for testing in each step; and cumulative = TRUE ensures that we do not drop the earliest observations as we expand the sample. In simpler words, we’ll be using an expanding-window approach.

library(rsample)
library(timetk)
cpi_df <- tk_tbl(
  cpi_br_ts, 
  rename_index = 'date'
) |> 
  filter(date >= 'jan 2016')

cpi_split <- rolling_origin(
  cpi_df, 
  initial = which(cpi_df$date == 'dec 2018'),
  assess = 1,
  cumulative = TRUE
) 

cpi_split

# Rolling origin forecast resampling 
# A tibble: 44 × 2
   splits         id     
   <list>         <chr>  
 1 <split [36/1]> Slice01
 2 <split [37/1]> Slice02
 3 <split [38/1]> Slice03
 4 <split [39/1]> Slice04
 5 <split [40/1]> Slice05
 6 <split [41/1]> Slice06
 7 <split [42/1]> Slice07
 8 <split [43/1]> Slice08
 9 <split [44/1]> Slice09
10 <split [45/1]> Slice10
# ℹ 34 more rows

As you can see, cpi_split contains 43 slices – each slice is represented by a two-part sample in the form [train set/test set]. The first slice includes a training set ranging from Jan/2016 to Dec/2018 (36 observations) and a test set with a single observation in Jan/2019. The second slice incorporates the observation in Jan/2019 to the training set (now 37 observations) and uses Feb/2019 as the test set. The same logic applies to the remaining slices until the end of the sample. With our split scheme complete, we’re ready to move on to the next step.

13.3 Step 3: Choose the Model

Choosing an appropriate model involves several dimensions. First, we must decide whether or not to use explanatory variables – and if so, which ones to include. In addition, we should weigh the pros and cons of using a machine learning model instead of a simpler method. Considerations such as the processing time and interpretability play a significant role in many applications. Since the purpose here is to develop intuition about each step of the forecasting pipeline, I’ll sidestep these issues by assuming we’re restricted to univariate statistical models only. This is by no means a strong limitation, as in real-life scenarios we are often required to produce reasonable forecasts quickly rather than spending a great deal of time searching for the most accurate numbers.

The forecast package includes a wide range of useful univariate statistical models. ETS is generally my go-to choice when there is no obvious candidate, as it doesn’t impose strong assumptions about the data (e.g., stationarity). In addition, it has been shown to perform very well across a variety of datasets in the M Competition.

I use TBATS as a first approach when forecating high-frequency data (daily or higher), since it can accommodate multiple seasonal patterns. ARIMA is helpful for stationary data, especially when ACF/PACF plots show a well-defined autocorrelation structure. However, it’s worth noting that in practice, statistical assumptions are often relaxed when the goal is forecasting rather than inference.

In fact, producing accurate forecasts is inevitably a trial-and-error process, and as we gain more experience with the subject, some choices tend to stand out. For instance, we saw that the Brazilian CPI exhibits a changing trend – favoring a more flexible model like ETS. On the other hand, for most of the period, this trend seems to evolve at a constant pace, which also makes ARIMA models good candidates.

Moreover, in the presence of a seasonal pattern, models that are more successful at capturing the data’s seasonality have a clear advantage. It’s not uncommon for one model to better capture a specific feature of the data, while another performs better on a different feature – which is why combining the forecasts from different models usually improves accuracy.¹ Therefore, to produce a more reliable result, we’ll generate forecasts from three sources: ETS, ARIMA, and their average.

cpi_fc <- cpi_split |> 
  mutate(
    ets_fc = map_dbl(
      .x = splits, 
      .f = function(x) { 
        ets_fc_out <- x |> 
          analysis() |> 
          tk_ts(select = 'value', start = c(2016, 1), frequency = 12) |> 
          ets() |> 
          forecast(h = 1)
        ets_fc_out <- ets_fc_out$mean
      }
    ),
    arima_fc = map_dbl(
      .x = splits, 
      .f = function(x) {
        arima_fc_out <- x |> 
          analysis() |> 
          tk_ts(select = 'value', start = c(2016, 1), frequency = 12) |> 
          auto.arima() |> 
          forecast(h = 1)
        arima_fc_out <- arima_fc_out$mean
      }
    ),
    avg_fc = (arima_fc + ets_fc) / 2,
    date = map_chr(
      .x = splits,
      .f = function(x) {
        x |> 
          assessment() |> 
          pull(date) |> 
          as.character() 
      } 
    ),
    date = date |> zoo::as.yearmon()
  ) |> 
  select(date, contains('fc')) |> 
  right_join(cpi_df, by = 'date')

cpi_fc

# A tibble: 80 × 5
   date      ets_fc arima_fc avg_fc value
   <yearmon>  <dbl>    <dbl>  <dbl> <dbl>
 1 Jan 2019   0.166   0.330  0.248   0.41
 2 Feb 2019   0.179   0.364  0.271   0.48
 3 Mar 2019   0.285   0.414  0.349   0.75
 4 Apr 2019   0.416   0.567  0.492   0.41
 5 May 2019   0.444   0.377  0.411  -0.23
 6 Jun 2019   0.368   0.200  0.284   0.08
 7 Jul 2019   0.236   0.142  0.189   0.12
 8 Aug 2019   0.187   0.132  0.160  -0.06
 9 Sep 2019   0.187   0.0449 0.116  -0.1 
10 Oct 2019   0.126  -0.0230 0.0516  0.18
# ℹ 70 more rows

Now we have a data frame with the predictions from each source plus the actual (realized) value for the CPI for the last 45 months – including periods pre-, during- and post-Covid. In the next section, we’ll see how to get a representative summary of the results so as we can conclude which model is better. You may have a clue on how to do this, but I can assure you that there are some relevant aspects worth exploring that are hardly found elsewhere. Before moving to the next section, we can take a look on how these forecasts look like.

cpi_fc |> 
  pivot_longer(-date, names_to = 'model', values_to = 'forecast') |> 
  ggplot(aes(x = date)) +
  geom_line(aes(y = forecast, color = model), lwd = 1) +
  theme_light() +
  scale_color_brewer(type = 'div', palette = 1) +
  theme(legend.position = 'top') +
  labs(
    title = 'Brazilian CPI Forecasts - %MoM NSA',
    y = 'CPI (%MoM NSA)',
    color = ''
  )

13.4 Step 4: Evaluate the Model

Choosing the best forecasting model can be stated in mathematical terms as a problem of minimizing an error metric – such as the mean absolute error (MAE) or the root mean square error (RMSE), which are common choices. These metrics are loss functions, i.e., they express an objective. Consequently, we should be fully aware of what our objective is in order to translate it into an appropriate metric (or function).

For example, MAE and RMSE are symmetric functions and use simple average to summarize forecasting errors. Using both of them to evaluate a model’s accuracy is equivalent to saying: “I don’t care about the sign of the error – 2 units up or down equally impact my result; also, it doesn’t matter when the largest errors occurred – over the last 3 observations or in the early part of the evaluation period”.

Surely, these conditions don’t apply to all businesses. Someone interested in forecasting the demand for electricity in a large city might prefer to be surprised on the downside rather than the upside. Also, a model with higher accuracy in the last 12 months might be better at capturing the current electricity demand pattern than a model with great performance during the initial periods of the testing sample. In short, many situations require us to define what conditions the model must meet – and this involves designing a specific function. This function should summarize the forecast errors in a way that represent our objective.

To demonstrate this idea, I will propose two alternative accuracy metrics that are slight modifications of the well-known MAE. The first (accuracy_1) assigns double the weight to upside errors (predictions below actual values), whereas the second (accuracy_2) assigns (linearly) decreasing weights to errors ocurring further in the past. You should be aware that the results from the two metrics are not directly comparable; the ordering of models is the relevant information here.

accuracy_1 <- function(e) {
  .abs_weighted_errors      <- ifelse(e > 0, 2 * e, abs(e))
  .mean_abs_weighted_errors <- mean(.abs_weighted_errors)
}

accuracy_2 <- function(e) {
  .abs_errors               <- abs(e)
  .weights                  <- seq(from = 1, to = length(.abs_errors), by = 1)
  .weights                  <- .weights / sum(.weights)
  .mean_abs_weighted_errors <- weighted.mean(.abs_errors, .weights)
}

Below I plot the accuracy_1 function along with the original MAE function as a more effective way to give you a sense of what’s happening behind the scenes. Basically, for negative errors (realized value below the prediction), the weights are the same as in the original MAE, while they’re somewhat higher for positive errors (realized value above the prediction).

library(ggtext)
acc_demo <- tibble(
  x = seq(from = -2, to = 2, by = 0.01)
) |> 
  mutate(
    t      = 1:n(),
    Loss_1 = ifelse(x > 0, 2 * x, abs(x)),
    mae    = abs(x)
  )

acc_demo |> 
  ggplot(aes(x = x)) + 
  geom_line(aes(y = Loss_1), color = "darkblue", lwd = 1) +
  geom_line(aes(y = mae), color = "red", lwd = 1) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_light() +
  theme(
    plot.title = element_markdown(lineheight = 1.1),
    axis.title = element_text(size = 13),
    legend.position = "none"
  ) +
  labs(
    title = "<span style='color:#002266;'><b>Custom Loss Function</b></span> vs <span style='color:#ff1a1a;'><b>MAE</b></span>",
    x = "Error", 
    y = "Loss"
  )

Now we’re ready to apply our two custom functions plus the MAE to the errors we computed from the three models in order to decide which one is the most accurate. Again, these metrics aren’t comparable to each other. Instead, we’re interested in the ordering within the same metric.

In addition, there’s no such thing as the best metric. As we saw earlier in this section, the appropriate metric is the one that reflects our objective as closely as possible. Moreover, we find several desirable characteristics in conventional metrics such as MAE or RMSE, and I don’t mean to rule them out (see Rob J. Hyndman and Koehler (2006) for more on this). The main message here is that we must be fully aware of what our objective is and how to translate it into an appropriate function. In this regard, the knowledge of functional forms is essential.

cpi_errors <- cpi_fc |> 
  filter(date >= 'Jan 2019') |> 
  mutate(
    across(
      contains('fc'), 
      ~ value - .x, 
      .names = 'error_{.col}'
    )
  ) |> 
  summarise(
    across(
      contains('error'), 
      list(
        'acc1' = function(x) accuracy_1(x), 
        'acc2' = function(x) accuracy_2(x),
        'mae'  = function(x) mean(abs(x))
      ), 
      .names = '{.col}-{.fn}')) |> 
  pivot_longer(everything(), names_to = 'model_metric', values_to = 'value') |> 
  separate('model_metric', c('model', 'metric'), '-') |> 
  pivot_wider(names_from = 'metric', values_from = 'value') |> 
  mutate(model = str_remove_all(model, 'error_|_fc'))

cpi_errors

# A tibble: 3 × 4
  model  acc1  acc2   mae
  <chr> <dbl> <dbl> <dbl>
1 ets   0.488 0.304 0.317
2 arima 0.456 0.286 0.284
3 avg   0.468 0.292 0.298

The results show that the ARIMA model is the most accurate according to the three metrics, outperforming even the average model. At this point, I’d like to conclude with two considerations. The first is that combining models usually improves performance, but not always, as the above exercise makes clear.

Nevertheless, although the literature shows that using either the mean or median of the models is very difficult to beat, it’s possible to improve accuracy by optimizing the weights assigned to each model.

Finally, we compared the models based on their point forecasts. Despite being very common, this approach does not take into account the fact that each point forecast is a single realization of a random process, and there is a vast literature suggesting the use of density (or probabilistic) forecasts and distributional accuracy measures.

Hyndman, R. J., and G. Athanasopoulos. 2018. Forecasting: Principles and Practice, 2nd Edition. OTexts: Melbourne, Australia.

Hyndman, Rob J, and Anne B Koehler. 2006. “Another Look at Measures of Forecast Accuracy.”

It’s also possible to fit a model for each time period – an approach known as direct forecast – or even to use different models for different horizons. However, we won’t explore these topics here.↩︎