11 Dimensionality reduction

In today’s data-driven world, we increasingly encounter vast datasets that must be distilled into meaningful insights. A common challenge arises when there are dozens, hundreds, or even thousands of potential predictors for a target variable. Many of these predictors are highly correlated, making them redundant, while others exhibit little or no relationship with the outcome of interest, rendering them ineffective. Without the appropriate tools to manage such complexity, it’s all too easy to become overwhelmed or face technical hurdles.

For example, standard OLS regression is well known to struggle in high-dimensional settings. First, the number of predictors must be fewer than the available observations. Yet, including too many predictors relative to the sample size tends to inflate the variance of the coefficient estimates. A similar challenge arises even when only a few highly correlated variables are included, sometimes causing coefficient signs to reverse and thereby undermining the reliability of the results. This chapter outlines two widely used tools designed to tackle the challenges of handling high-dimensional data and isolating the most informative features.

In this exercise, the focus is on a dataset comprising time series data of prices for 33 perfumes collected from websites. The ultimate goal is to select the relevant information that help to explain the variation in the Consumer Price Index (CPI) for perfumes in Brazil. The data on perfumes prices was collected at a daily frequency, and were then aggregated to match the bi-monthly releases of the Brazilian CPI. The sample covers a period from May 2022 to September 2024, resulting in 58 observations.

Data

The data set containing the variables for this exercise is available in the R4ER2data package under the name perfume_prices_cpi_br.

Let’s start by importing the dataset and visualizing its variables. To facilitate inspection of the series, I randomly assigned each product to one of four blocks.

library(tidyverse)
library(R4ER2data)

perfume_prices_cpi_br <- R4ER2data::perfume_prices_cpi_br

Online prices tend to change more frequently than those in brick-and-mortar stores, often featuring promotions exclusive to online purchases. As a result, perfume prices fluctuate dramatically – ranging from –40% to 80% – making even the notoriously volatile Consumer Price Index (CPI) for perfumes appear stable by comparison. To enhance stability and facilitate comparison, I scaled perfume prices to match the CPI’s mean and standard deviation.

perfume_prices_cpi_br_scaled <- perfume_prices_cpi_br |> 
  mutate(
    across(
      starts_with('product'),
      ~ (scale(.x)[, 1] * sd(cpi_perfume)) + mean(cpi_perfume)
    )
  )

We can now see that several time series closely follow the target. Our main goal is to identify the most appropriate ones, assign proper weights, and construct a reliable predictor for the target series.

11.1 PCA

To summarize a large data set into a few key factors, Principal Components Analysis (PCA) is often the go-to tool. Essentially, PCA leverages the eigenvalues and eigenvectors of the covariance matrix to extract factors that explain the variability in the data. Its main advantages are its simplicity and low computational cost, which make it both easy to implement and fast to compute.

However, PCA has its limitations. It is not ideally suited for time series data, as it does not account for autocorrelation. In such cases, the Dynamic Factor Model (DFM), discussed in Chapter 19, provides a more appropriate alternative. Additionally, PCA is an unsupervised method, meaning it does not consider the target variable when computing factors. As a result, the computed factors may or may not be relevant to the target variable – something we can explore later using a regression model. This will become clearer soon.

The first step is to convert the data into a matrix. Next, we call the prcomp function from the stats package. Note that we have chosen not to center or scale the data, as these preprocessing steps were already applied earlier.

perfume_prices_matrix <- perfume_prices_cpi_br |> 
  arrange(date) |> 
  select(starts_with('product')) |> 
  as.matrix()

perfume_prices_pca <- stats::prcomp(
  perfume_prices_matrix, 
  center = FALSE, 
  scale. = FALSE
)

summary(perfume_prices_pca)

Importance of components:
                           PC1     PC2     PC3      PC4      PC5      PC6
Standard deviation     30.3430 27.0098 22.9436 20.88568 20.36978 19.08268
Proportion of Variance  0.1581  0.1253  0.0904  0.07491  0.07126  0.06254
Cumulative Proportion   0.1581  0.2834  0.3738  0.44872  0.51998  0.58252
                            PC7      PC8      PC9    PC10     PC11     PC12
Standard deviation     17.36359 16.11335 15.49840 14.9523 13.94684 13.28717
Proportion of Variance  0.05178  0.04459  0.04125  0.0384  0.03341  0.03032
Cumulative Proportion   0.63430  0.67889  0.72014  0.7585  0.79194  0.82226
                          PC13     PC14    PC15    PC16    PC17    PC18    PC19
Standard deviation     13.1062 11.12409 10.8181 9.58923 9.49272 9.21351 8.09039
Proportion of Variance  0.0295  0.02125  0.0201 0.01579 0.01548 0.01458 0.01124
Cumulative Proportion   0.8518  0.87301  0.8931 0.90890 0.92438 0.93896 0.95020
                          PC20   PC21    PC22    PC23    PC24    PC25    PC26
Standard deviation     7.41773 7.0336 6.68826 5.57626 5.19654 4.45811 4.19738
Proportion of Variance 0.00945 0.0085 0.00768 0.00534 0.00464 0.00341 0.00303
Cumulative Proportion  0.95965 0.9681 0.97583 0.98117 0.98581 0.98922 0.99224
                         PC27    PC28   PC29    PC30    PC31    PC32    PC33
Standard deviation     3.5764 3.34859 2.8501 2.32690 1.93847 1.58521 1.16035
Proportion of Variance 0.0022 0.00193 0.0014 0.00093 0.00065 0.00043 0.00023
Cumulative Proportion  0.9944 0.99637 0.9978 0.99869 0.99934 0.99977 1.00000

The summary function returns statistics about the principal components, with the most important metric being the proportion of variance explained by each component. Ideally, we would like the data’s variability to be captured by only a few factors, with minimal additional gains from including more components. However, this is not always the case. For example, in the output above, the first component accounts for only 15% of the variability, and it takes the first five components together to explain 50% of the total variability. This clearly indicates how divergent our data set is.

Now, let’s select the first four components and assess whether any of them can help explain the CPI series. This choice is somewhat arbitrary – it’s possible that the seventh or tenth factor might be relevant to explain the target CPI. Since our factors were derived solely from online prices, there is no guarantee that they are directly related to the CPI. However, we expect that the strongest co-movements among online prices, captured by the first few factors, are driven by the same underlying forces that influence the prices used in the official indicator.

perfume_prices_pca_series <- predict(
  perfume_prices_pca, 
  perfume_prices_cpi_br_scaled
) |> as.data.frame()

perfume_prices_pca_df <- bind_cols(
  perfume_prices_cpi_br_scaled |> select(date, cpi_perfume),
  perfume_prices_pca_series[, 1:4]
)

perfume_prices_pca_df |> 
  pivot_longer(
    cols = starts_with('PC'),
    names_to = 'pc',
    values_to = 'serie'
  ) |> 
  ggplot(aes(x = date)) +
  geom_line(aes(y = cpi_perfume), color = 'black', lwd = 1) +
  geom_line(aes(y = serie), color = 'red', lwd = 1) +
  scale_y_continuous(labels = function(x) paste0(x, '%')) +
  scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
  facet_wrap(~ pc, scales = 'free_y') +
  labs(
    title = 'CPI for Perfumes (black) vs. Principal Components (red)',
    x = '',
    y = ''
  )

At first glance, the plot might seem a bit confusing, but a closer examination reveals important patterns. For instance, PC1 appears to correlate with the target variable, albeit in the opposite direction, and the same seems to hold for PC3. This is not necessarily a concern, as we are now analyzing factors that summarize variability in the data rather than raw prices. To gain a clearer sense of the correlation, we can invert the factors.

perfume_prices_pca_df |> 
  pivot_longer(
    cols = starts_with('PC'),
    names_to = 'pc',
    values_to = 'serie'
  ) |> 
  ggplot(aes(x = date)) +
  geom_line(aes(y = cpi_perfume, color = 'CPI'), lwd = 1) +
  geom_line(aes(y = -1 * serie), lwd = 1) +
  scale_y_continuous(labels = function(x) paste0(x, '%')) +
  scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
  facet_wrap(~ pc, scales = 'free_y') +
  labs(
    title = 'CPI for Perfumes vs. Principal Components (inverted)',
    x = '',
    y = ''
  )

Additionally, PC2 – and especially PC4 – exhibit a similar pattern of high volatility in the initial part of the sample, followed by lower volatility toward the end. However, PC2 diverges significantly from the target series toward the end of the sample, raising concerns about its alignment with future values. This warrants caution in its interpretation.

In summary, the factors appear to contain valuable information for explaining the target variable. Let’s proceed by incorporating them into a regression model.

library(forecast)
perfume_prices_ols_fit <- lm(cpi_perfume ~ PC1 + PC2 + PC3 + PC4, perfume_prices_pca_df) 

perfume_prices_ols_fit |> summary()


Call:
lm(formula = cpi_perfume ~ PC1 + PC2 + PC3 + PC4, data = perfume_prices_pca_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2483 -1.2809 -0.1793  0.8800  6.8498 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.29876    0.35930  -0.832 0.409420    
PC1         -0.23347    0.05710  -4.089 0.000148 ***
PC2          0.05381    0.06405   0.840 0.404658    
PC3         -0.21816    0.07287  -2.994 0.004181 ** 
PC4         -0.27673    0.06617  -4.182 0.000109 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.238 on 53 degrees of freedom
Multiple R-squared:  0.4346,    Adjusted R-squared:  0.3919 
F-statistic: 10.18 on 4 and 53 DF,  p-value: 3.432e-06

perfume_prices_ols_fit |> forecast::checkresiduals()


    Breusch-Godfrey test for serial correlation of order up to 10

data:  Residuals
LM test = 29.22, df = 10, p-value = 0.001148

The initial results confirm our expectations: PC1 and PC3 both have negative signs, as does PC4. As noted earlier, this is not a cause for concern. PC2, however, is not statistically significant, and its divergence from the target variable toward the end of the sample raises further doubts about its reliability. Given this, it would be safer to exclude it from the model.

Additionally, the residuals exhibit autocorrelation, which suggests that incorporating lags of the target variable may improve the model. Let’s update our regression accordingly.

perfume_prices_ols_fit2 <- perfume_prices_ols_fit |> 
  update(. ~ . - PC2 + lag(cpi_perfume, 1) + lag(cpi_perfume, 2))

perfume_prices_ols_fit2 |> summary()


Call:
lm(formula = cpi_perfume ~ PC1 + PC3 + PC4 + lag(cpi_perfume, 
    1) + lag(cpi_perfume, 2), data = perfume_prices_pca_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4904 -1.0762 -0.0182  0.9980  4.6129 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          0.33293    0.28477   1.169  0.24790    
PC1                 -0.14654    0.04683  -3.129  0.00292 ** 
PC3                 -0.16307    0.05913  -2.758  0.00811 ** 
PC4                 -0.16547    0.06014  -2.751  0.00825 ** 
lag(cpi_perfume, 1)  0.20171    0.10021   2.013  0.04952 *  
lag(cpi_perfume, 2) -0.51323    0.09012  -5.695  6.5e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.73 on 50 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.662, Adjusted R-squared:  0.6282 
F-statistic: 19.59 on 5 and 50 DF,  p-value: 9.345e-11

perfume_prices_ols_fit2 |> forecast::checkresiduals()


    Breusch-Godfrey test for serial correlation of order up to 10

data:  Residuals
LM test = 26.693, df = 10, p-value = 0.002912

We successfully eliminated autocorrelation in the residuals. While some heteroscedasticity remains, it does not pose significant issues for parameter inference. The plot below illustrates how the model’s fitted values align with the target variable.

library(broom)
perfume_prices_ols_fit2_results <- perfume_prices_ols_fit2 |> 
  broom::augment() |> 
  mutate(
    date = perfume_prices_pca_df$date[as.numeric(.rownames)]
  ) |> 
  relocate(date, cpi_perfume, .fitted)

perfume_prices_ols_fit2_results |> 
  ggplot(aes(x = date)) +
  geom_line(aes(y = cpi_perfume, color = 'CPI'), lwd = 1) + 
  geom_line(aes(y = .fitted, color = 'fitted'), lwd = 1) +
  scale_y_continuous(labels = function(x) paste0(x, '%')) +
  scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
  labs(
    title = 'CPI for perfumes vs. fitted values - %MoM NSA',
    x = '',
    y = '%MoM NSA',
    color = ''
  )

The PCA-based approach performed quite well. We started with 33 candidate predictors and distilled them down to just three factors that:

Capture the most relevant information from the data,
Dramatically reduce the risk of overfitting, and
Mitigate instability in both model parameters and predicted values by filtering out noise from individual series—a crucial concern when working with data from websites.

11.2 LASSO

The Least Absolute Shrinkage and Selection Operator (LASSO) is a regression method that extends the standard linear regression by introducing an L1 regularization penalty on the coefficients. This penalty shrinks the coefficients, driving some toward zero, effectively performing variable selection. The objective of LASSO is to solve the following problem:

\[ \hat{\beta}_{LASSO} = \arg\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\} \] where \(\lambda\) is the regularization parameter.

This approach differs conceptually from PCA in several ways. First, LASSO aims to retain only the most relevant variables, whereas PCA seeks to summarize the entire information set. Second, LASSO is a supervised method, meaning that predictors are deemed relevant based on their contribution to explaining the target variable.

Additionally, LASSO has remained highly popular among econometricians, even after the widespread adoption of other machine learning techniques that also perform regularization, due to its computational efficiency, simplicity – it requires tuning only a single parameter, \(\lambda\) – and straightforward interpretability, as it explicitly identifies the relevant variables and their contribution to explain the target.

As an application, let’s revisit the task from the previous section: fitting a model for the CPI of perfumes using data on 33 products extracted from websites. We’ll use the glmnet function from the homonymous package, the most popular R package for fitting LASSO models.

The first step is to prepare the data before passing it to the function. The glmnet function requires the predictor data to be in matrix form and the target data to be in vector form.

The second step is to determine the value of the regularization parameter, \(\lambda\). The most common approach is to perform a k-fold cross-validation, where the model is fitted on multiple subsets of the sample, and \(\lambda\) is selected based on the value that minimizes the prediction error.

The function cv.glmnet performs the k-fold cross validation, and contains a plot to produce a visual representation of the results.

library(glmnet)
set.seed(1)

target    <- perfume_prices_cpi_br_scaled |> pull(cpi_perfume)
lambda_cv <- glmnet::cv.glmnet(x = perfume_prices_matrix, y = target)

lambda_cv


Call:  glmnet::cv.glmnet(x = perfume_prices_matrix, y = target) 

Measure: Mean-Squared Error 

     Lambda Index Measure     SE Nonzero
min 0.04717    37   3.449 0.6667      26
1se 0.13125    26   4.090 0.8686      19

plot(lambda_cv)

The red dots are the mean squared error (MSE) obtained from the k-fold cross-validation along the \(\lambda\) sequence (the x axis). The two vertical dashed lines indicate the value of \(\lambda\) that minimizes the MSE and the most regularized model such that the cross-validated error is within one standard error of the minimum, respectively. The top horizontal axis shows the number of predictors for each setting.

The resulting object contains a coef method that allows us to examine the selected variables and their coefficients for any \(\lambda\) in the sequence. Additionally, it provides text shortcuts for the two values indicated by the vertical dashed lines.

coef(lambda_cv, s = 'lambda.min')

34 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) -3.748582e-01
product1     2.860844e-02
product2     .           
product3     4.284695e-02
product4    -9.690620e-03
product5     1.509501e-02
product6    -2.695724e-03
product7     2.311279e-02
product8     2.467449e-05
product9    -3.094944e-02
product10   -4.206163e-02
product11    9.665337e-02
product12   -1.187188e-02
product13    .           
product14    1.235098e-02
product15    2.704873e-02
product16    1.416197e-02
product17    1.575724e-03
product18    2.486571e-02
product19    3.859810e-02
product20    7.090450e-02
product21    9.734368e-02
product22    .           
product23    .           
product24    .           
product25   -3.640788e-02
product26    7.752064e-02
product27    .           
product28   -1.483790e-01
product29    6.341041e-03
product30   -9.723473e-02
product31    1.066508e-01
product32    .           
product33    5.462544e-02

As we observed in the plot, the \(\lambda\) that minimizes the MSE yields a large model with 26 variables – roughly half the sample size. Moreover, many of these variables have negative coefficients, which is unexpected. This suggests that a slightly stronger regularization may be necessary.

At this point, it is worth noting that introducing some discretion into the process is perfectly acceptable if we believe the resulting model should be more or less restrictive. For instance, we can choose \(\lambda\) to yield a more persimonious model with six variables. According to the plot, \(\lambda \approx 0.5\) should accomplish it.

coef(lambda_cv, s = 0.5)

34 x 1 sparse Matrix of class "dgCMatrix"
                    s1
(Intercept) 0.23186267
product1    .         
product2    .         
product3    .         
product4    .         
product5    .         
product6    .         
product7    0.02049885
product8    .         
product9    .         
product10   .         
product11   .         
product12   .         
product13   .         
product14   .         
product15   .         
product16   .         
product17   .         
product18   .         
product19   0.02332652
product20   0.03652025
product21   0.09091570
product22   .         
product23   .         
product24   .         
product25   .         
product26   .         
product27   .         
product28   .         
product29   .         
product30   .         
product31   0.02172697
product32   .         
product33   0.02922538

We now have a parsimonious model with all the coefficients positive. Let’s check the fitted values.

perfume_prices_lasso_fit_results <- perfume_prices_cpi_br |> 
  select(date, cpi_perfume) |> 
  mutate(
    .fitted_lasso = predict(lambda_cv$glmnet.fit, newx = perfume_prices_matrix, s = 0.5)
  )

perfume_prices_lasso_fit_results |> 
  ggplot(aes(x = date)) +
  geom_line(aes(y = cpi_perfume, color = 'CPI'), lwd = 1) + 
  geom_line(aes(y = .fitted_lasso, color = 'LASSO'), lwd = 1) +
  scale_y_continuous(labels = function(x) paste0(x, '%')) +
  scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
  labs(
    title = 'CPI for perfumes vs. fitted values - %MoM NSA',
    x = '',
    y = '%MoM NSA',
    color = ''
  )

The six variables produced a reasonable fit to the data, especially in the most recent period. Recall that in the previous section, when estimating the OLS model, we included two lags of the response variable to account for autocorrelation in the residuals. These lags improved the model’s \(R^2\) and could also be incorporated into the selection procedure here to enhance performance.

Additionally, we could further analyze the model by adjusting the number of selected variables, either restricting or expanding them. However, the primary goal here was to demonstrate the fundamentals of LASSO estimation. Those interested in more details can find additional resources on the glmnet official page.

Finally, it is important to note that, as discussed earlier in this chapter, website prices tend to be highly volatile and noisy. While the PCA factors helped mitigate these issues, in the LASSO estimation, we used raw prices as inputs. A more effective approach would be to first apply a smoothing filter to the series before performing the estimation.