library(tidyverse)
library(R4ER2data)
<- R4ER2data::perfume_prices_cpi_br perfume_prices_cpi_br
11 Dimensionality reduction
In today’s data-driven world, we increasingly encounter vast datasets that must be distilled into meaningful insights. A common challenge arises when there are dozens, hundreds, or even thousands of potential predictors for a target variable. Many of these predictors are highly correlated, making them redundant, while others exhibit little or no relationship with the outcome of interest, rendering them ineffective. Without the appropriate tools to manage such complexity, it’s all too easy to become overwhelmed or face technical hurdles.
For example, standard OLS regression is well known to struggle in high-dimensional settings. First, the number of predictors must be fewer than the available observations. Yet, including too many predictors relative to the sample size tends to inflate the variance of the coefficient estimates. A similar challenge arises even when only a few highly correlated variables are included, sometimes causing coefficient signs to reverse and thereby undermining the reliability of the results. This chapter outlines two widely used tools designed to tackle the challenges of handling high-dimensional data and isolating the most informative features.
In this exercise, the focus is on a dataset comprising time series data of prices for 33 perfumes collected from websites. The ultimate goal is to select the relevant information that help to explain the variation in the Consumer Price Index (CPI) for perfumes in Brazil. The data on perfumes prices was collected at a daily frequency, and were then aggregated to match the bi-monthly releases of the Brazilian CPI. The sample covers a period from May 2022 to September 2024, resulting in 58 observations.
The data set containing the variables for this exercise is available in the R4ER2data
package under the name perfume_prices_cpi_br
.
Let’s start by importing the dataset and visualizing its variables. To facilitate inspection of the series, I randomly assigned each product to one of four blocks.
Online prices tend to change more frequently than those in brick-and-mortar stores, often featuring promotions exclusive to online purchases. As a result, perfume prices fluctuate dramatically – ranging from –40% to 80% – making even the notoriously volatile Consumer Price Index (CPI) for perfumes appear stable by comparison. To enhance stability and facilitate comparison, I scaled perfume prices to match the CPI’s mean and standard deviation.
<- perfume_prices_cpi_br |>
perfume_prices_cpi_br_scaled mutate(
across(
starts_with('product'),
~ (scale(.x)[, 1] * sd(cpi_perfume)) + mean(cpi_perfume)
) )
We can now see that several time series closely follow the target. Our main goal is to identify the most appropriate ones, assign proper weights, and construct a reliable predictor for the target series.
11.1 PCA
To summarize a large data set into a few key factors, Principal Components Analysis (PCA) is often the go-to tool. Essentially, PCA leverages the eigenvalues and eigenvectors of the covariance matrix to extract factors that explain the variability in the data. Its main advantages are its simplicity and low computational cost, which make it both easy to implement and fast to compute.
However, PCA has its limitations. It is not ideally suited for time series data, as it does not account for autocorrelation. In such cases, the Dynamic Factor Model (DFM), discussed in Chapter 19, provides a more appropriate alternative. Additionally, PCA is an unsupervised method, meaning it does not consider the target variable when computing factors. As a result, the computed factors may or may not be relevant to the target variable – something we can explore later using a regression model. This will become clearer soon.
The first step is to convert the data into a matrix. Next, we call the prcomp
function from the stats
package. Note that we have chosen not to center or scale the data, as these preprocessing steps were already applied earlier.
<- perfume_prices_cpi_br |>
perfume_prices_matrix arrange(date) |>
select(starts_with('product')) |>
as.matrix()
<- stats::prcomp(
perfume_prices_pca
perfume_prices_matrix, center = FALSE,
scale. = FALSE
)
summary(perfume_prices_pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 30.3430 27.0098 22.9436 20.88568 20.36978 19.08268
Proportion of Variance 0.1581 0.1253 0.0904 0.07491 0.07126 0.06254
Cumulative Proportion 0.1581 0.2834 0.3738 0.44872 0.51998 0.58252
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 17.36359 16.11335 15.49840 14.9523 13.94684 13.28717
Proportion of Variance 0.05178 0.04459 0.04125 0.0384 0.03341 0.03032
Cumulative Proportion 0.63430 0.67889 0.72014 0.7585 0.79194 0.82226
PC13 PC14 PC15 PC16 PC17 PC18 PC19
Standard deviation 13.1062 11.12409 10.8181 9.58923 9.49272 9.21351 8.09039
Proportion of Variance 0.0295 0.02125 0.0201 0.01579 0.01548 0.01458 0.01124
Cumulative Proportion 0.8518 0.87301 0.8931 0.90890 0.92438 0.93896 0.95020
PC20 PC21 PC22 PC23 PC24 PC25 PC26
Standard deviation 7.41773 7.0336 6.68826 5.57626 5.19654 4.45811 4.19738
Proportion of Variance 0.00945 0.0085 0.00768 0.00534 0.00464 0.00341 0.00303
Cumulative Proportion 0.95965 0.9681 0.97583 0.98117 0.98581 0.98922 0.99224
PC27 PC28 PC29 PC30 PC31 PC32 PC33
Standard deviation 3.5764 3.34859 2.8501 2.32690 1.93847 1.58521 1.16035
Proportion of Variance 0.0022 0.00193 0.0014 0.00093 0.00065 0.00043 0.00023
Cumulative Proportion 0.9944 0.99637 0.9978 0.99869 0.99934 0.99977 1.00000
The summary
function returns statistics about the principal components, with the most important metric being the proportion of variance explained by each component. Ideally, we would like the data’s variability to be captured by only a few factors, with minimal additional gains from including more components. However, this is not always the case. For example, in the output above, the first component accounts for only 15% of the variability, and it takes the first five components together to explain 50% of the total variability. This clearly indicates how divergent our data set is.
Now, let’s select the first four components and assess whether any of them can help explain the CPI series. This choice is somewhat arbitrary – it’s possible that the seventh or tenth factor might be relevant to explain the target CPI. Since our factors were derived solely from online prices, there is no guarantee that they are directly related to the CPI. However, we expect that the strongest co-movements among online prices, captured by the first few factors, are driven by the same underlying forces that influence the prices used in the official indicator.
<- predict(
perfume_prices_pca_series
perfume_prices_pca,
perfume_prices_cpi_br_scaled|> as.data.frame()
)
<- bind_cols(
perfume_prices_pca_df |> select(date, cpi_perfume),
perfume_prices_cpi_br_scaled 1:4]
perfume_prices_pca_series[,
)
|>
perfume_prices_pca_df pivot_longer(
cols = starts_with('PC'),
names_to = 'pc',
values_to = 'serie'
|>
) ggplot(aes(x = date)) +
geom_line(aes(y = cpi_perfume), color = 'black', lwd = 1) +
geom_line(aes(y = serie), color = 'red', lwd = 1) +
scale_y_continuous(labels = function(x) paste0(x, '%')) +
scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
facet_wrap(~ pc, scales = 'free_y') +
labs(
title = 'CPI for Perfumes (black) vs. Principal Components (red)',
x = '',
y = ''
)
At first glance, the plot might seem a bit confusing, but a closer examination reveals important patterns. For instance, PC1 appears to correlate with the target variable, albeit in the opposite direction, and the same seems to hold for PC3. This is not necessarily a concern, as we are now analyzing factors that summarize variability in the data rather than raw prices. To gain a clearer sense of the correlation, we can invert the factors.
|>
perfume_prices_pca_df pivot_longer(
cols = starts_with('PC'),
names_to = 'pc',
values_to = 'serie'
|>
) ggplot(aes(x = date)) +
geom_line(aes(y = cpi_perfume, color = 'CPI'), lwd = 1) +
geom_line(aes(y = -1 * serie), lwd = 1) +
scale_y_continuous(labels = function(x) paste0(x, '%')) +
scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
facet_wrap(~ pc, scales = 'free_y') +
labs(
title = 'CPI for Perfumes vs. Principal Components (inverted)',
x = '',
y = ''
)
Additionally, PC2 – and especially PC4 – exhibit a similar pattern of high volatility in the initial part of the sample, followed by lower volatility toward the end. However, PC2 diverges significantly from the target series toward the end of the sample, raising concerns about its alignment with future values. This warrants caution in its interpretation.
In summary, the factors appear to contain valuable information for explaining the target variable. Let’s proceed by incorporating them into a regression model.
library(forecast)
<- lm(cpi_perfume ~ PC1 + PC2 + PC3 + PC4, perfume_prices_pca_df)
perfume_prices_ols_fit
|> summary() perfume_prices_ols_fit
Call:
lm(formula = cpi_perfume ~ PC1 + PC2 + PC3 + PC4, data = perfume_prices_pca_df)
Residuals:
Min 1Q Median 3Q Max
-4.2483 -1.2809 -0.1793 0.8800 6.8498
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.29876 0.35930 -0.832 0.409420
PC1 -0.23347 0.05710 -4.089 0.000148 ***
PC2 0.05381 0.06405 0.840 0.404658
PC3 -0.21816 0.07287 -2.994 0.004181 **
PC4 -0.27673 0.06617 -4.182 0.000109 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.238 on 53 degrees of freedom
Multiple R-squared: 0.4346, Adjusted R-squared: 0.3919
F-statistic: 10.18 on 4 and 53 DF, p-value: 3.432e-06
|> forecast::checkresiduals() perfume_prices_ols_fit
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 29.22, df = 10, p-value = 0.001148
The initial results confirm our expectations: PC1 and PC3 both have negative signs, as does PC4. As noted earlier, this is not a cause for concern. PC2, however, is not statistically significant, and its divergence from the target variable toward the end of the sample raises further doubts about its reliability. Given this, it would be safer to exclude it from the model.
Additionally, the residuals exhibit autocorrelation, which suggests that incorporating lags of the target variable may improve the model. Let’s update our regression accordingly.
<- perfume_prices_ols_fit |>
perfume_prices_ols_fit2 update(. ~ . - PC2 + lag(cpi_perfume, 1) + lag(cpi_perfume, 2))
|> summary() perfume_prices_ols_fit2
Call:
lm(formula = cpi_perfume ~ PC1 + PC3 + PC4 + lag(cpi_perfume,
1) + lag(cpi_perfume, 2), data = perfume_prices_pca_df)
Residuals:
Min 1Q Median 3Q Max
-4.4904 -1.0762 -0.0182 0.9980 4.6129
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.33293 0.28477 1.169 0.24790
PC1 -0.14654 0.04683 -3.129 0.00292 **
PC3 -0.16307 0.05913 -2.758 0.00811 **
PC4 -0.16547 0.06014 -2.751 0.00825 **
lag(cpi_perfume, 1) 0.20171 0.10021 2.013 0.04952 *
lag(cpi_perfume, 2) -0.51323 0.09012 -5.695 6.5e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.73 on 50 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.662, Adjusted R-squared: 0.6282
F-statistic: 19.59 on 5 and 50 DF, p-value: 9.345e-11
|> forecast::checkresiduals() perfume_prices_ols_fit2
Breusch-Godfrey test for serial correlation of order up to 10
data: Residuals
LM test = 26.693, df = 10, p-value = 0.002912
We successfully eliminated autocorrelation in the residuals. While some heteroscedasticity remains, it does not pose significant issues for parameter inference. The plot below illustrates how the model’s fitted values align with the target variable.
library(broom)
<- perfume_prices_ols_fit2 |>
perfume_prices_ols_fit2_results ::augment() |>
broommutate(
date = perfume_prices_pca_df$date[as.numeric(.rownames)]
|>
) relocate(date, cpi_perfume, .fitted)
|>
perfume_prices_ols_fit2_results ggplot(aes(x = date)) +
geom_line(aes(y = cpi_perfume, color = 'CPI'), lwd = 1) +
geom_line(aes(y = .fitted, color = 'fitted'), lwd = 1) +
scale_y_continuous(labels = function(x) paste0(x, '%')) +
scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
labs(
title = 'CPI for perfumes vs. fitted values - %MoM NSA',
x = '',
y = '%MoM NSA',
color = ''
)
The PCA-based approach performed quite well. We started with 33 candidate predictors and distilled them down to just three factors that:
- Capture the most relevant information from the data,
- Dramatically reduce the risk of overfitting, and
- Mitigate instability in both model parameters and predicted values by filtering out noise from individual series—a crucial concern when working with data from websites.
11.2 LASSO
The Least Absolute Shrinkage and Selection Operator (LASSO) is a regression method that extends the standard linear regression by introducing an L1 regularization penalty on the coefficients. This penalty shrinks the coefficients, driving some toward zero, effectively performing variable selection. The objective of LASSO is to solve the following problem:
\[ \hat{\beta}_{LASSO} = \arg\min_{\beta} \left\{ \sum_{i=1}^{n} (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^{p} |\beta_j| \right\} \] where \(\lambda\) is the regularization parameter.
This approach differs conceptually from PCA in several ways. First, LASSO aims to retain only the most relevant variables, whereas PCA seeks to summarize the entire information set. Second, LASSO is a supervised method, meaning that predictors are deemed relevant based on their contribution to explaining the target variable.
Additionally, LASSO has remained highly popular among econometricians, even after the widespread adoption of other machine learning techniques that also perform regularization, due to its computational efficiency, simplicity – it requires tuning only a single parameter, \(\lambda\) – and straightforward interpretability, as it explicitly identifies the relevant variables and their contribution to explain the target.
As an application, let’s revisit the task from the previous section: fitting a model for the CPI of perfumes using data on 33 products extracted from websites. We’ll use the glmnet
function from the homonymous package, the most popular R package for fitting LASSO models.
The first step is to prepare the data before passing it to the function. The glmnet
function requires the predictor data to be in matrix form and the target data to be in vector form.
The second step is to determine the value of the regularization parameter, \(\lambda\). The most common approach is to perform a k-fold cross-validation, where the model is fitted on multiple subsets of the sample, and \(\lambda\) is selected based on the value that minimizes the prediction error.
The function cv.glmnet
performs the k-fold cross validation, and contains a plot
to produce a visual representation of the results.
library(glmnet)
set.seed(1)
<- perfume_prices_cpi_br_scaled |> pull(cpi_perfume)
target <- glmnet::cv.glmnet(x = perfume_prices_matrix, y = target)
lambda_cv
lambda_cv
Call: glmnet::cv.glmnet(x = perfume_prices_matrix, y = target)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.04717 37 3.449 0.6667 26
1se 0.13125 26 4.090 0.8686 19
plot(lambda_cv)
The red dots are the mean squared error (MSE) obtained from the k-fold cross-validation along the \(\lambda\) sequence (the x axis). The two vertical dashed lines indicate the value of \(\lambda\) that minimizes the MSE and the most regularized model such that the cross-validated error is within one standard error of the minimum, respectively. The top horizontal axis shows the number of predictors for each setting.
The resulting object contains a coef
method that allows us to examine the selected variables and their coefficients for any \(\lambda\) in the sequence. Additionally, it provides text shortcuts for the two values indicated by the vertical dashed lines.
coef(lambda_cv, s = 'lambda.min')
34 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -3.748582e-01
product1 2.860844e-02
product2 .
product3 4.284695e-02
product4 -9.690620e-03
product5 1.509501e-02
product6 -2.695724e-03
product7 2.311279e-02
product8 2.467449e-05
product9 -3.094944e-02
product10 -4.206163e-02
product11 9.665337e-02
product12 -1.187188e-02
product13 .
product14 1.235098e-02
product15 2.704873e-02
product16 1.416197e-02
product17 1.575724e-03
product18 2.486571e-02
product19 3.859810e-02
product20 7.090450e-02
product21 9.734368e-02
product22 .
product23 .
product24 .
product25 -3.640788e-02
product26 7.752064e-02
product27 .
product28 -1.483790e-01
product29 6.341041e-03
product30 -9.723473e-02
product31 1.066508e-01
product32 .
product33 5.462544e-02
As we observed in the plot, the \(\lambda\) that minimizes the MSE yields a large model with 26 variables – roughly half the sample size. Moreover, many of these variables have negative coefficients, which is unexpected. This suggests that a slightly stronger regularization may be necessary.
At this point, it is worth noting that introducing some discretion into the process is perfectly acceptable if we believe the resulting model should be more or less restrictive. For instance, we can choose \(\lambda\) to yield a more persimonious model with six variables. According to the plot, \(\lambda \approx 0.5\) should accomplish it.
coef(lambda_cv, s = 0.5)
34 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 0.23186267
product1 .
product2 .
product3 .
product4 .
product5 .
product6 .
product7 0.02049885
product8 .
product9 .
product10 .
product11 .
product12 .
product13 .
product14 .
product15 .
product16 .
product17 .
product18 .
product19 0.02332652
product20 0.03652025
product21 0.09091570
product22 .
product23 .
product24 .
product25 .
product26 .
product27 .
product28 .
product29 .
product30 .
product31 0.02172697
product32 .
product33 0.02922538
We now have a parsimonious model with all the coefficients positive. Let’s check the fitted values.
<- perfume_prices_cpi_br |>
perfume_prices_lasso_fit_results select(date, cpi_perfume) |>
mutate(
.fitted_lasso = predict(lambda_cv$glmnet.fit, newx = perfume_prices_matrix, s = 0.5)
)
|>
perfume_prices_lasso_fit_results ggplot(aes(x = date)) +
geom_line(aes(y = cpi_perfume, color = 'CPI'), lwd = 1) +
geom_line(aes(y = .fitted_lasso, color = 'LASSO'), lwd = 1) +
scale_y_continuous(labels = function(x) paste0(x, '%')) +
scale_x_date(date_breaks = '6 months', date_labels = '%b/%y') +
labs(
title = 'CPI for perfumes vs. fitted values - %MoM NSA',
x = '',
y = '%MoM NSA',
color = ''
)
The six variables produced a reasonable fit to the data, especially in the most recent period. Recall that in the previous section, when estimating the OLS model, we included two lags of the response variable to account for autocorrelation in the residuals. These lags improved the model’s \(R^2\) and could also be incorporated into the selection procedure here to enhance performance.
Additionally, we could further analyze the model by adjusting the number of selected variables, either restricting or expanding them. However, the primary goal here was to demonstrate the fundamentals of LASSO estimation. Those interested in more details can find additional resources on the glmnet official page.
Finally, it is important to note that, as discussed earlier in this chapter, website prices tend to be highly volatile and noisy. While the PCA factors helped mitigate these issues, in the LASSO estimation, we used raw prices as inputs. A more effective approach would be to first apply a smoothing filter to the series before performing the estimation.