library(tidyverse)
<- 'data/owid-covid-data.csv'
data_path <- read_csv(data_path) covid_data
1 Intro to Tidyverse
The purpose of this chapter is to introduce the main functions of the core packages in the tidyverse. It’s not intended to be a thorough description of every function; rather, the idea is to provide the basic tools so that those who are either base R users or aren’t R users at all can follow along with the rest of the book. For a more detailed approach to the tidyverse, the best source is undoubtedly the great Grolemund and Wickham (2017). For those who feel comfortable with the tidyverse, a quick skim through the chapter might be enough.
1.1 What’s tidyverse?
To start, what’s the tidyverse? According to the official website:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
To start, what’s tidyverse? According to the official website:
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
From my own experience, I can assure you that the tidyverse is all you need to carry out data analyses as neat and flexible as possible. By neat, I mean succinct code without any redundancy – i.e., a single block of code (not a bunch of loose objects) doing all the desired tasks. Flexible, in turn, means the code is general enough. It depends on the nature of the task at hand, so take it more as a philosophy. For example, does the solution still work if the data is updated with new values? Or does it rely on changeable references such as column positions?
Most of the time, writing neat and flexible code takes longer and requires a dose of creativity, but it surely pays off. Moreover, with practice, this will become more and more natural, and the cost will get significantly lower. In the next sections, we’ll review the most used functions to go through all the steps of everyday data analysis using core tidyverse packages designed to Import, Wrangle, Program, and Plot.
1.2 Importing
1.2.1 Reading from flat files
readr
is the tidyverse package used to import data from flat files (basically .csv
and .txt
files). The most generic function is read_delim
, as it has a large set of parameters that allow us to declare the structure of the data we want to import. In practice, we often use read_csv
, which is a special case of the former with some predefined settings suitable for .csv
files — the most obvious being the comma as the column delimiter.
If you’re running RStudio IDE, I recommend clicking on Import Dataset in the Environment tab (by default, it’s located in the upper-right panel) to manually define the appropriate settings for your file. I know we should never be encouraged to use windows, click buttons, or rely on any kind of shortcuts provided by the IDE. This is the only exception I think is worth it. First, because it can be tedious to set many specific parameters via trial and error to correctly import your data. Second, because once you have the configuration done, the code is displayed at the bottom of the window, so you can copy it and paste it into your script — and eventually get rid of this shortcut as you learn how things work. Lastly, you probably won’t repeat this action many times when working on a given task, and there’s not much to improve on this process. So yes, it’s a big waste of time trying to understand every single parameter.
Just for the sake of illustration, let’s use readr
to import the COVID dataset from Our World in Data. It’s a standard .csv
file, so we can use read_csv
without any additional parameters. We just need to provide the file path:
or the URL:
library(tidyverse)
<- 'https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv'
data_url <- read_csv(data_url) covid_data
Tidyverse also provides packages for reading other usual file types. For instance, we can use readxl
to read MS Excel spreadsheets and haven
to read data generated by popular statistical softwares (SPSS, Stata and SAS) in much the same way we did with readr
for .csv files (you can check them at the Import Dataset dropdown menu) so we won’t cover them. Rather, let’s use the rest of this section to talk about something very important and that usually gets less attention in books: reading data from an API.
1.2.2 Reading from API
Although the use of traditional flat files is still widespread, organizations are increasingly switching to APIs to make data available to the public. This is a huge improvement since we’re able to customize our demand and import only the data we need for the specific task.
If you don’t know what an API is, think of it as an interface to a database. However, instead of reading all the content from this database, you can specify which columns you want, the desired observations (time range if it’s a time series), and so on. These filters are embedded in a request, and the correct way to specify it is usually available in documentation provided by the API maintainer.
For example, the Federal Reserve Bank of St. Louis maintains a huge repository with hundreds of thousands of economic time series, and we can import them using an API. The information on how to build the request is available in the documentation, where we can see a link with instructions on how to “Get the observations or data values for an economic data series.” There, we find an example of a request for the US Real Gross National Product:
We can break this request down to see its parts in more detail:
- The main (static) URL to access the Observations section:
\[ \underbrace{\text{https://api.stlouisfed.org/fred/series/observations?}}_\text{Static: API URL} \]
- The series ID, which identifies the US Real Gross National Product data.
\[ \underbrace{\text{series\_id=GNPCA}}_\text{Parameter: Series ID} \]
- The API Key. We must create an account and request a personal key in order to import data from the API. This one is an example for illustrative purposes only.
\[ \underbrace{\text{api\_key=abcdefghijklmnopqrstuvwxyz123456}}_\text{Parameter: API Key} \]
- The file type for the output. There are several types, but I prefer working with JSON.
\[ \underbrace{\text{file\_type=json}}_\text{Parameter: File Type} \]
Note that all the parameters following the static part are separated by an &. If we want to add any extra parameter, it should be placed right after this character. Also, the order of the parameters is not relevant.
Suppose we want to read monthly data for the US Consumer Price Index (CPI). The series_id
is CPALTT01USM657N. Additionally, we’d like to read data only between January 2010 and December 2022. How can we do so? There are two parameters – observation_start
and observation_end
– which set the range of the observation period (YYYY-MM-DD format).
The code below creates a separate object for each part of the request. Then, we use the glue
function from the homonymous package to merge the pieces into a single string adding the & character between parameters. Note that we could create the full string all at once, but setting separate objects makes it easier to find and edit values when necessary, as well as to transform this task into a custom function if we had to read data from this source often (more on this later).
You’ll need to register in order to be granted access to the API. Since this information is personal, I stored my key as an environment variable named api_fred_key
, which can be accessed like any other object. If you want to create an environment variable, the easiest way is through the function edit_r_environ()
from the usethis
package. This will open your .Renviron
file, where you should place the following line: variable_name=variable_value
. Then, you must save it and restart your R session for changes to take effect. Now, you can get the value from your environment variable using Sys.getenv('variable_name')
library(glue)
<- 'https://api.stlouisfed.org/fred/series/observations?'
api_url <- Sys.getenv('api_fred_key')
api_fred_key <- 'CPALTT01USM657N'
api_series_id <- '2010-01-01'
obs_start <- '2022-12-01'
obs_end <- 'json'
api_filetype <- glue('{api_url}series_id={api_series_id}&observation_start={obs_start}&observation_end={obs_end}&api_key={api_fred_key}&file_type={api_filetype}') api_request
Now we use the httr
package to connect to the API, send the request, and get the content. The other steps transform the content from JSON to a standard R object (a list) and then convert it to a tibble (the tidyverse’s improved format for data frames). Notice that the CPI data is stored in the list element named observations
. However, this is specific to this API and if we were to import data from another source, we’d have to check in which element the data is stored.
library(httr)
library(jsonlite)
library(tidyverse)
<- GET(url = api_request)
cpi_request <- content(cpi_request, as = 'text')
cpi_content <- fromJSON(cpi_content, flatten = FALSE)
cpi_list <- cpi_list[['observations']] |> as_tibble()
cpi_tbl
cpi_tbl
# A tibble: 156 × 4
realtime_start realtime_end date value
<chr> <chr> <chr> <chr>
1 2025-08-29 2025-08-29 2010-01-01 0.34174735701485
2 2025-08-29 2025-08-29 2010-02-01 0.024920738207648
3 2025-08-29 2025-08-29 2010-03-01 0.4106283536571079
4 2025-08-29 2025-08-29 2010-04-01 0.173688491069743
5 2025-08-29 2025-08-29 2010-05-01 0.0775197354237721
6 2025-08-29 2025-08-29 2010-06-01 -0.0976267084673888
7 2025-08-29 2025-08-29 2010-07-01 0.0211043057371509
8 2025-08-29 2025-08-29 2010-08-01 0.138066427840807
9 2025-08-29 2025-08-29 2010-09-01 0.0581736230715569
10 2025-08-29 2025-08-29 2010-10-01 0.12451988884769599
# ℹ 146 more rows
This concludes our section on how to import data. If you’re struggling with other types of data, the tidyverse’s official website provides a comprehensive list of all the supported file formats and the respective packages used to handle them. Also, you can take a look at the rio
package, which makes it easy to import/export data from/to different file extensions.
1.3 Wrangling
This is by far the most important section in this chapter. The main goal here is to provide a general sense of how to get raw data ready to use. For this, we’ll focus on the roles of the main functions from the dplyr
package rather than the idiosyncrasies and generalizations of each one. More sophisticated applications will be shown in the next chapters.
1.3.1 Data manipulation
Let’s use the COVID data we imported in the last section. Starting with the glimpse
function to get a grasp of the data, we can see some useful information such as the number of rows and columns, as well as the columns’ names and classes (whether they’re character, double, etc.).
The COVID-19 data set from Our World in Data is available in the R4ER2data
package under the name owid_covid
.
<- R4ER2data::owid_covid
covid_data
|> glimpse() covid_data
Rows: 346,671
Columns: 67
$ iso_code <chr> "AFG", "AFG", "AFG", "AFG",…
$ continent <chr> "Asia", "Asia", "Asia", "As…
$ location <chr> "Afghanistan", "Afghanistan…
$ date <date> 2020-01-03, 2020-01-04, 20…
$ total_cases <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_cases <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ new_cases_smoothed <dbl> NA, NA, NA, NA, NA, 0, 0, 0…
$ total_deaths <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ new_deaths_smoothed <dbl> NA, NA, NA, NA, NA, 0, 0, 0…
$ total_cases_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_cases_per_million <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ new_cases_smoothed_per_million <dbl> NA, NA, NA, NA, NA, 0, 0, 0…
$ total_deaths_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_deaths_per_million <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ new_deaths_smoothed_per_million <dbl> NA, NA, NA, NA, NA, 0, 0, 0…
$ reproduction_rate <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ icu_patients <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ icu_patients_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ hosp_patients <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ hosp_patients_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ weekly_icu_admissions <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ weekly_icu_admissions_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ weekly_hosp_admissions <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ weekly_hosp_admissions_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ total_tests <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_tests <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ total_tests_per_thousand <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_tests_per_thousand <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_tests_smoothed <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_tests_smoothed_per_thousand <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ positive_rate <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ tests_per_case <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ tests_units <chr> NA, NA, NA, NA, NA, NA, NA,…
$ total_vaccinations <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ people_vaccinated <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ people_fully_vaccinated <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ total_boosters <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_vaccinations <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_vaccinations_smoothed <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ total_vaccinations_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ people_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ people_fully_vaccinated_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ total_boosters_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_vaccinations_smoothed_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_people_vaccinated_smoothed <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ new_people_vaccinated_smoothed_per_hundred <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ stringency_index <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ population_density <dbl> 54.422, 54.422, 54.422, 54.…
$ median_age <dbl> 18.6, 18.6, 18.6, 18.6, 18.…
$ aged_65_older <dbl> 2.581, 2.581, 2.581, 2.581,…
$ aged_70_older <dbl> 1.337, 1.337, 1.337, 1.337,…
$ gdp_per_capita <dbl> 1803.987, 1803.987, 1803.98…
$ extreme_poverty <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ cardiovasc_death_rate <dbl> 597.029, 597.029, 597.029, …
$ diabetes_prevalence <dbl> 9.59, 9.59, 9.59, 9.59, 9.5…
$ female_smokers <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ male_smokers <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ handwashing_facilities <dbl> 37.746, 37.746, 37.746, 37.…
$ hospital_beds_per_thousand <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.…
$ life_expectancy <dbl> 64.83, 64.83, 64.83, 64.83,…
$ human_development_index <dbl> 0.511, 0.511, 0.511, 0.511,…
$ population <dbl> 41128772, 41128772, 4112877…
$ excess_mortality_cumulative_absolute <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ excess_mortality_cumulative <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ excess_mortality <dbl> NA, NA, NA, NA, NA, NA, NA,…
$ excess_mortality_cumulative_per_million <dbl> NA, NA, NA, NA, NA, NA, NA,…
We’re usually not interested in all of these data, so the first couple of tasks we’d like to perform are to filter the relevant categories and select the desired columns. For example, we could be interested in analyzing new COVID cases (column new_cases
) only in Brazil (rows equal to Brazil
in column location
). Furthermore, we’d like to get rid of duplicate rows, if there are any.
One of the great contributions of tidyverse is to assign names (verbs) to functions according to the actions they perform – many are admittedly SQL-inspired. For example, distinct
drops observations (rows) that are not unique, whereas select
picks variables based on their names (or positions). The exception is filter
, which retains the rows that satisfy a given condition (the analogue of WHERE in SQL).
Conditions are expressions that return TRUE
or FALSE
. It’s straightforward to think of conditions using logical operators, such as ==
, >
, <
, etc. Nevertheless, there are a bunch of expressions in R which return TRUE
or FALSE
. Moreover, we can always create our own condition to generate the desired output. We’ll see many examples throughout this book.
The code below performs the initial steps described above to generate a subset of the data.
<- covid_data |>
covid_data_sub1 distinct() |>
select(date, continent, location, new_cases) |>
filter(location == 'Brazil')
Next, we may need to create additional columns. For instance, suppose that actual new cases in Brazil are much higher than reported because the country doesn’t have enough tests, and a conservative estimate points to a number of, say, twice the official count. So we’d like to add a column that is twice the value of the original one, representing our guess of the real situation. In addition, we’d also like to create a column to indicate the dominant strain at each period of time. We know that the Delta strain took over Gamma by the end of July 2021 and then Omicron took over Delta at the beginning of 2022.
The mutate
verb can be used to create new variables as a function of existing ones. Also, it can be used to modify existing variables, as new variables overwrite those with the same name. case_when
is another SQL-inspired function used inside mutate
to create a new variable based on conditions. It’s worth noting that it returns NA
if no condition is met. In this case, a useful workaround is to define an extra condition as TRUE ~ value
, thus assigning a value to all unmet conditions – think of this as an else statement.
<- covid_data_sub1 |>
covid_data_sub2 mutate(
real_new_cases = 2 * new_cases,
dominant_strain = case_when(
<= '2021-07-31' ~ 'Gamma',
date > '2021-07-31' & date <= '2021-12-31' ~ 'Delta',
date > '2021-12-31' & date <= '2022-02-01' ~ 'Omicron',
date TRUE ~ "We don't know"
) )
The between
function is a shortcut for numeric conditions that are bounded both on the left and on the right. It also works with dates if we declare the arguments as date objects. Therefore, we can replace conditions 2 and 3 to have more compact and efficient code (it’s implemented in C++, like many modern R functions).
<- covid_data_sub1 |>
covid_data_sub2 mutate(
real_new_cases = 2 * new_cases,
dominant_strain = case_when(
<= '2021-07-31' ~ 'Gamma',
date between(date, as.Date('2021-07-31'), as.Date('2021-12-31')) ~ 'Delta',
between(date, as.Date('2021-12-31'), as.Date('2022-02-01')) ~ 'Omicron',
TRUE ~ "We don't know"
) )
So far we’ve worked on a single group of data: new cases in Brazil. However, we usually have many categories to work on. We might be interested in analyzing new cases in all European countries, for example. In this case, we’ll need the group_by
function, which allows us to perform operations by group. group_by
is often used in conjunction with mutate
or summarise
to create new data for each group. The latter uses aggregate functions (mean
, max
, min
, etc.) to produce a summary of the data.
For example, suppose we want to know which European country recorded the highest number of new COVID cases in a single day by mid-2022. This might be achieved by grouping the data by location and then using summarise
with max
. In addition, we can use arrange
to sort the rows by value (we use desc
to sort in descending order). Don’t forget to ungroup
the data as soon as you no longer need to perform grouped operations.
<- covid_data |>
covid_data_sub3 distinct() |>
filter(
== 'Europe',
continent <= '2022-06-30'
date |>
) group_by(location) |>
summarise(max_new_cases = max(new_cases)) |>
ungroup() |>
arrange(desc(max_new_cases))
covid_data_sub3
# A tibble: 55 × 2
location max_new_cases
<chr> <dbl>
1 France 2417043
2 Germany 1588891
3 Spain 956506
4 Netherlands 626302
5 Portugal 405306
6 United Kingdom 275647
7 Ukraine 242942
8 Greece 241319
9 Italy 228123
10 Russia 203949
# ℹ 45 more rows
Note that some countries, such as Spain, Portugal and France returned NA
. This happened because they probably have some missing values. We could easily ignore this by passing the argument na.rm = TRUE
to the max
function. However, another problem would arise: countries with no data on new cases would return -Inf
. To get around these two issues, we can filter out all the missing values in the dataset using the logical operator is.na()
(the !
before the condition works as a negation). In this case, removing the missing values isn’t a problem, but be aware that for some tasks this may influence the outcome.
<- covid_data |>
covid_data_sub4 distinct() |>
filter(
== 'Europe',
continent <= '2022-06-30',
date !is.na(new_cases)
|>
) group_by(location) |>
summarise(max_new_cases = max(new_cases)) |>
ungroup() |>
arrange(desc(max_new_cases))
covid_data_sub4
# A tibble: 51 × 2
location max_new_cases
<chr> <dbl>
1 France 2417043
2 Germany 1588891
3 Spain 956506
4 Netherlands 626302
5 Portugal 405306
6 United Kingdom 275647
7 Ukraine 242942
8 Greece 241319
9 Italy 228123
10 Russia 203949
# ℹ 41 more rows
In addition, we might want to know not only what the highest numbers were but also when they occurred (the peak date). Since the summarise
function is designed to return a single value, we must use an expression that returns a single value. If we used date = max(date)
, we’d keep the most recent date for the data in each country. Definitely, that’s not what we want. So, a good way to address this issue is to combine a subset operation with a condition inside. In simpler terms, we’ll subset from the date
column the observation where new cases were at their high. Since we can have multiple dates that satisfy this condition, we’ll keep the most recent one (the max
of them).
<- covid_data |>
covid_data_sub5 distinct() |>
filter(
== 'Europe',
continent <= '2022-06-30',
date !is.na(new_cases)
|>
) group_by(location) |>
summarise(
max_new_cases = max(new_cases),
peak_date = date[which(new_cases == max_new_cases)] |> max()
|>
) ungroup() |>
arrange(desc(max_new_cases), peak_date) |>
slice(1:10)
covid_data_sub5
# A tibble: 10 × 3
location max_new_cases peak_date
<chr> <dbl> <date>
1 France 2417043 2022-01-23
2 Germany 1588891 2022-03-27
3 Spain 956506 2022-01-16
4 Netherlands 626302 2022-02-06
5 Portugal 405306 2022-01-30
6 United Kingdom 275647 2022-01-06
7 Ukraine 242942 2022-02-06
8 Greece 241319 2022-01-09
9 Italy 228123 2022-01-19
10 Russia 203949 2022-02-11
In case it went unnoticed, the previous code showed a really nice feature of mutate
/summarise
: you can use a variable you’ve just created in a subsequent task inside the same call. In this example, we used max_new_cases
as an input in peak_date
– all of this inside the same summarise
call as if all steps were performed sequentially.
In the last two lines, we used arrange
to sort countries first by the maximum number of new cases and then by their peak date, and slice
to subset only the first ten countries out of the forty-nine in our dataset. This ends our approach to single datasets for now.
But before we jump to the next section, there’s something very important to address: merging multiple datasets. There are two families of functions in dplyr
to merge data frames, *_join
and bind_*
. Let’s see how they work.
The *_join
functions are used to merge two data frames horizontally, matching their rows based on specified keys. For example, take the covid_data_sub5
data frame we created above. It contains the top ten European countries with the highest number of new COVID cases in a single day and their peak dates. Suppose we want to add information on the population size for each country, which is available in another data frame named europe_population
, displayed below.1
The European population data set is available in the R4ER2data
package under the name europe_population
.
<- R4ER2data::europe_population
europe_population
europe_population
# A tibble: 55 × 2
Country Pop_size
<chr> <dbl>
1 Russia 144713312
2 Germany 83369840
3 France 67813000
4 United Kingdom 67508936
5 Italy 59037472
6 England 56550000
7 Spain 47558632
8 Poland 39857144
9 Ukraine 39701744
10 Romania 19659270
# ℹ 45 more rows
What we want here is to add the column Pop_Size from europe_population
into covid_data_sub5
, matching rows based on location
(the column with countries in the main dataset). For this, we can use the left_join
function, which adds the content from the second data frame to the first one. The by
argument is needed because the name of the key column differs across the two datasets.2
<- covid_data_sub5 |>
covid_data_sub5_with_pop left_join(
europe_population, by = c('location' = 'Country')
)
covid_data_sub5_with_pop
# A tibble: 10 × 4
location max_new_cases peak_date Pop_size
<chr> <dbl> <date> <dbl>
1 France 2417043 2022-01-23 67813000
2 Germany 1588891 2022-03-27 83369840
3 Spain 956506 2022-01-16 47558632
4 Netherlands 626302 2022-02-06 17564020
5 Portugal 405306 2022-01-30 10270857
6 United Kingdom 275647 2022-01-06 67508936
7 Ukraine 242942 2022-02-06 39701744
8 Greece 241319 2022-01-09 10384972
9 Italy 228123 2022-01-19 59037472
10 Russia 203949 2022-02-11 144713312
We could have been assigned a slightly different task – for example, adding to europe_population
the information on maximum daily number of new COVID cases and peak dates we have from covid_data_sub5
. This can also be achieved with left_join
by reversing the order of the data frames (and the names of the key column since they’re not the same). An effortless alternative is to replace left_join
with right_join
, which adds to the second data frame (the one on the right) the information from the first data frame (the one on the left). In this case, we don’t need to change the order of the parameters.
<- covid_data_sub5 |>
pop_with_covid_info right_join(
europe_population, by = c('location' = 'Country')
)
pop_with_covid_info
# A tibble: 55 × 4
location max_new_cases peak_date Pop_size
<chr> <dbl> <date> <dbl>
1 France 2417043 2022-01-23 67813000
2 Germany 1588891 2022-03-27 83369840
3 Spain 956506 2022-01-16 47558632
4 Netherlands 626302 2022-02-06 17564020
5 Portugal 405306 2022-01-30 10270857
6 United Kingdom 275647 2022-01-06 67508936
7 Ukraine 242942 2022-02-06 39701744
8 Greece 241319 2022-01-09 10384972
9 Italy 228123 2022-01-19 59037472
10 Russia 203949 2022-02-11 144713312
# ℹ 45 more rows
At first glance, it seems to produce the same outcome, but notice that the number of rows is different in the two resulting datasets. In the first case, left_join
kept all the rows from covid_data_sub5
and only the corresponding rows from europe_population
, whereas right_join
did the opposite. In summary, left_join
and right_join
keep all the rows from just one of the two data frames.
Sometimes, however, we want to keep all the rows from both data frames. For example, imagine that covid_data_sub5
had data not only on European but also on South American countries as well. In addition, the europe_population
data frame also included the populations of Asian countries. If we merged the data frames using the functions above, we’d end up losing observations on either COVID in South America or populations in Asia.
In order to merge the data frames by their common countries while keeping the remaining observations from both, we should employ the full_join
function. On the other hand, if we’re willing to keep only the common countries in both data frames, this is the case for the inner_join
function.
These four functions comprise what is called mutating joins since they add columns from one data frame to the other. There are also the filtering joins functions, which filter the rows from one data frame based on the presence (semi_join
) or absence (anti_join
) of matches in the other data frame. Given that the latter category is used to a much lesser extent, I won’t go into detail right now. Eventually, they’ll show up in the coming chapters.
We saw that *_join
operations are particularly useful to merge data frames horizontally and by their matching rows according to key variables. More often than not, we need to stack data vertically by their matching columns. For example, suppose the COVID dataset was released as a single file for each country and we needed to perform some comparisons between France and the United Kingdom – the covid_fr
and covid_uk
datasets shown below.
# A tibble: 1,371 × 3
date location new_cases
<date> <chr> <dbl>
1 2020-01-03 United Kingdom 0
2 2020-01-04 United Kingdom 0
3 2020-01-05 United Kingdom 0
4 2020-01-06 United Kingdom 0
5 2020-01-07 United Kingdom 0
6 2020-01-08 United Kingdom 0
7 2020-01-09 United Kingdom 0
8 2020-01-10 United Kingdom 0
9 2020-01-11 United Kingdom 0
10 2020-01-12 United Kingdom 0
# ℹ 1,361 more rows
# A tibble: 1,371 × 4
date location new_cases total_cases
<date> <chr> <dbl> <dbl>
1 2020-01-03 France 0 NA
2 2020-01-04 France 0 NA
3 2020-01-05 France 0 NA
4 2020-01-06 France 0 NA
5 2020-01-07 France 0 NA
6 2020-01-08 France 0 NA
7 2020-01-09 France 0 NA
8 2020-01-10 France 0 NA
9 2020-01-11 France 0 NA
10 2020-01-12 France 0 NA
# ℹ 1,361 more rows
This can be easily accomplished with bind_rows
, which, unlike the *_join
family, allows us to provide several data frames. What’s cool about bind_rows
is that non-matching columns are kept, with their values filled with NA
for the data frames where the column is absent. In the example above, the covid_fr
dataset has a total cases column that is absent in covid_uk
.
<- bind_rows(covid_fr, covid_uk)
covid_fr_uk
covid_fr_uk
# A tibble: 2,742 × 4
date location new_cases total_cases
<date> <chr> <dbl> <dbl>
1 2020-01-03 France 0 NA
2 2020-01-04 France 0 NA
3 2020-01-05 France 0 NA
4 2020-01-06 France 0 NA
5 2020-01-07 France 0 NA
6 2020-01-08 France 0 NA
7 2020-01-09 France 0 NA
8 2020-01-10 France 0 NA
9 2020-01-11 France 0 NA
10 2020-01-12 France 0 NA
# ℹ 2,732 more rows
The bind_cols
function is more restrictive in this regard. It’s used to merge the columns of data frames, but matches by row position rather than a key variable. For this reason, we can’t use data frames of different lengths. In practice, it’s much more common (and safer) to rely on *_join
functions when we need to merge data frames horizontally.
1.3.2 Data layout
We’ve walked through the main functions of dplyr
. Now, we turn to the tidyr
package. According to the tidyverse’s website, the goal of tidyr
is to help us create tidy data. This means a dataset where every column is a variable, every row is an observation, and each cell is a single value.
Tidy data is also known as wide format – since it increases the number of columns and decreases the number of rows –, as opposed to the long format, where data are stacked, increasing the number of rows and decreasing the number of columns. It took me a while before I could tell if a dataset was either in wide or long format, so don’t worry if it’s not so clear right now. Perhaps a more direct way of thinking about this distinction is to ask yourself: Is all the information contained in a single cell? If so, it’s wide format. If not, then it’s long format.
For example, is the COVID dataset in wide or long format? If we took a single cell from the new_cases column, does it convey all the information for this variable contained in the data set? No, it doesn’t. We know the number of new cases on a given date, but we don’t know which country that value refers to – is this from Germany? Nigeria? Chile?
We can use the pivot_wider
function from the tidyr
package to convert from long to wide format. The syntax is very easy to understand: names_from
is the column we want to widen, whereas values_from
is the column containing the observations that will fill each cell. id_cols
is a parameter used to declare the set of columns that uniquely identify each observation. In practice, it drops all the other columns in the dataset. Hence, if we want to keep all the other variables, we can just skip it.
<- covid_data |>
covid_wide_01 pivot_wider(
id_cols = 'date',
names_from = 'location',
values_from = 'new_cases'
)
covid_wide_01
# A tibble: 1,380 × 256
date Afghanistan Africa Albania Algeria `American Samoa` Andorra Angola
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020-01-03 0 0 0 0 0 0 0
2 2020-01-04 0 0 0 0 0 0 0
3 2020-01-05 0 0 0 0 0 0 0
4 2020-01-06 0 0 0 0 0 0 0
5 2020-01-07 0 0 0 0 0 0 0
6 2020-01-08 0 0 0 0 0 0 0
7 2020-01-09 0 0 0 0 0 0 0
8 2020-01-10 0 0 0 0 0 0 0
9 2020-01-11 0 0 0 0 0 0 0
10 2020-01-12 0 0 0 0 0 0 0
# ℹ 1,370 more rows
# ℹ 248 more variables: Anguilla <dbl>, `Antigua and Barbuda` <dbl>,
# Argentina <dbl>, Armenia <dbl>, Aruba <dbl>, Asia <dbl>, Australia <dbl>,
# Austria <dbl>, Azerbaijan <dbl>, Bahamas <dbl>, Bahrain <dbl>,
# Bangladesh <dbl>, Barbados <dbl>, Belarus <dbl>, Belgium <dbl>,
# Belize <dbl>, Benin <dbl>, Bermuda <dbl>, Bhutan <dbl>, Bolivia <dbl>,
# `Bonaire Sint Eustatius and Saba` <dbl>, `Bosnia and Herzegovina` <dbl>, …
Notice that now we have 255 columns rather than 67 of the original dataset, with each cell conveying the whole information: new cases on a given date for a specific country. So, what if we wanted to have both new_cases
and new_deaths
columns in wide form? We just need to provide a vector with the desired variables in values_from
. By default, the new columns will be named according to the following pattern: variable_location.
Since the variable names already contain a single underscore, it’s a good idea to set a different character as a separator – a double underscore works fine. This is because we might need to reverse the operation later for a given task, then it’s much easier to identify it. Otherwise, we’d have to use regular expression to specify the exact position of the repeated character – for example, whether it’s the first or the second underscore.
<- covid_data |>
covid_wide_02 pivot_wider(
id_cols = 'date',
names_from = 'location',
values_from = c('new_cases', 'new_deaths'),
names_sep = '__'
)
covid_wide_02
# A tibble: 1,380 × 511
date new_cases__Afghanistan new_cases__Africa new_cases__Albania
<date> <dbl> <dbl> <dbl>
1 2020-01-03 0 0 0
2 2020-01-04 0 0 0
3 2020-01-05 0 0 0
4 2020-01-06 0 0 0
5 2020-01-07 0 0 0
6 2020-01-08 0 0 0
7 2020-01-09 0 0 0
8 2020-01-10 0 0 0
9 2020-01-11 0 0 0
10 2020-01-12 0 0 0
# ℹ 1,370 more rows
# ℹ 507 more variables: new_cases__Algeria <dbl>,
# `new_cases__American Samoa` <dbl>, new_cases__Andorra <dbl>,
# new_cases__Angola <dbl>, new_cases__Anguilla <dbl>,
# `new_cases__Antigua and Barbuda` <dbl>, new_cases__Argentina <dbl>,
# new_cases__Armenia <dbl>, new_cases__Aruba <dbl>, new_cases__Asia <dbl>,
# new_cases__Australia <dbl>, new_cases__Austria <dbl>, …
Our new dataset expanded to 510 columns. As we create more and more columns to get a wide dataset, it might become harder to perform some simple tasks. In fact, using filter
is generally easier than using a conditional select
when we want to keep only the relevant data.
In summary, long format may be preferable to wide format when the dataset contains more than one grouping variable or we want to work with more than one variable. Besides, long format datasets are ideal for plotting with the ggplot2
package as we’ll see later.
Therefore, it’s not unusual to convert a dataset from wide to long format. The syntax is very similar to what we saw earlier when converting from long to wide format. The only difference is in the cols
argument, used to declare which columns we want to stack. However, since wide data sets usually have a large number of columns and we’re often interested in putting all of them in long format, it’s much easier to declare which columns we want to leave out using the -
operator.
<- covid_wide_02 |>
covid_long_01 pivot_longer(
cols = -'date',
names_to = c('variable', 'location'),
values_to = 'value',
names_sep = '__'
)
covid_long_01
# A tibble: 703,800 × 4
date variable location value
<date> <chr> <chr> <dbl>
1 2020-01-03 new_cases Afghanistan 0
2 2020-01-03 new_cases Africa 0
3 2020-01-03 new_cases Albania 0
4 2020-01-03 new_cases Algeria 0
5 2020-01-03 new_cases American Samoa 0
6 2020-01-03 new_cases Andorra 0
7 2020-01-03 new_cases Angola 0
8 2020-01-03 new_cases Anguilla 0
9 2020-01-03 new_cases Antigua and Barbuda 0
10 2020-01-03 new_cases Argentina 0
# ℹ 703,790 more rows
Note that now even the variables (new_cases
and new_deaths
) are stored in a single column (variable
). This is probably an abuse of language, but I call this a complete long format – as opposed to the original format, where the dataset was only partially in long format (the variables were still in wide format). For most applications, I think this is the best way to organize the data.
Converting a dataset between wide and long formats might not completely solve our problem. Sometimes, two pieces of information are merged into a single column. For example, suppose that the location
column had both the continent and country names, instead of only the country as in the original dataset. We’ll call this data set covid_loc_cont
.
# A tibble: 330,156 × 4
location date new_cases new_deaths
<chr> <date> <dbl> <dbl>
1 Afghanistan_Asia 2020-01-03 0 0
2 Afghanistan_Asia 2020-01-04 0 0
3 Afghanistan_Asia 2020-01-05 0 0
4 Afghanistan_Asia 2020-01-06 0 0
5 Afghanistan_Asia 2020-01-07 0 0
6 Afghanistan_Asia 2020-01-08 0 0
7 Afghanistan_Asia 2020-01-09 0 0
8 Afghanistan_Asia 2020-01-10 0 0
9 Afghanistan_Asia 2020-01-11 0 0
10 Afghanistan_Asia 2020-01-12 0 0
# ℹ 330,146 more rows
This is undesirable, since we can no longer use group_by
or filter
on either country or continents names alone. Hence, the best practice is to have a single column for each variable. This can be easily achieved using the separate
function, which has a highly self-explanatory syntax. Again, having a unique separator character in the string makes the job much easier.
<- covid_loc_cont |>
covid_separate separate(
col = 'location',
into = c('location', 'continent'),
sep = '_'
)
covid_separate
# A tibble: 330,156 × 5
location continent date new_cases new_deaths
<chr> <chr> <date> <dbl> <dbl>
1 Afghanistan Asia 2020-01-03 0 0
2 Afghanistan Asia 2020-01-04 0 0
3 Afghanistan Asia 2020-01-05 0 0
4 Afghanistan Asia 2020-01-06 0 0
5 Afghanistan Asia 2020-01-07 0 0
6 Afghanistan Asia 2020-01-08 0 0
7 Afghanistan Asia 2020-01-09 0 0
8 Afghanistan Asia 2020-01-10 0 0
9 Afghanistan Asia 2020-01-11 0 0
10 Afghanistan Asia 2020-01-12 0 0
# ℹ 330,146 more rows
The only caveat to all this simplicity arises when we have non-trivial separators. For example, imagine that we had no underscore to separate the location from the continent.
# A tibble: 330,156 × 4
location date new_cases new_deaths
<chr> <date> <dbl> <dbl>
1 AfghanistanAsia 2020-01-03 0 0
2 AfghanistanAsia 2020-01-04 0 0
3 AfghanistanAsia 2020-01-05 0 0
4 AfghanistanAsia 2020-01-06 0 0
5 AfghanistanAsia 2020-01-07 0 0
6 AfghanistanAsia 2020-01-08 0 0
7 AfghanistanAsia 2020-01-09 0 0
8 AfghanistanAsia 2020-01-10 0 0
9 AfghanistanAsia 2020-01-11 0 0
10 AfghanistanAsia 2020-01-12 0 0
# ℹ 330,146 more rows
How could we manage to separate them? Ideally, we should provide a regular expression (or simply regex) to match the appropriate pattern to split the string into location and continent (the sep
argument works with regex). If you don’t know what regex is, think of it as a code used to match patterns, positions, and all kinds of features in a string.
At first glance, a natural choice would be to split the string at the second uppercase letter. This would work for Afghanistan, France, Netherlands, Chile and all single-word countries. However, this would fail for countries with two or more words: United States, New Zealand, and many others.
Then, you might argue that a more general approach would be to use regex to match the last uppercase letter in the string. Not exactly, because we have a couple of two-word continents: North America and South America. So, for example, CanadaNorth America would be split into CanadaNorth and America instead of Canada and North America.
More often than not, the direct solution is the most difficult to implement (or we simply don’t know how to accomplish it) and so we have to think about alternative ways. Data science is all about this. Since a regex solution alone might be very tough or even unfeasible, we’ll get back to this problem later when covering text manipulation with the stringr
package.
For now, let’s just wrap up with the tidyr
package looking at two other very commonly used functions. The first one is unite
, which is the counterpart of separate
. We can use it to convert the covid_separate
data frame back to the original form as in covid_loc_cont
.
<- covid_separate |>
covid_unite unite(
col = 'location',
c('location', 'continent'),
sep = '_')
covid_unite
# A tibble: 330,156 × 4
location date new_cases new_deaths
<chr> <date> <dbl> <dbl>
1 Afghanistan_Asia 2020-01-03 0 0
2 Afghanistan_Asia 2020-01-04 0 0
3 Afghanistan_Asia 2020-01-05 0 0
4 Afghanistan_Asia 2020-01-06 0 0
5 Afghanistan_Asia 2020-01-07 0 0
6 Afghanistan_Asia 2020-01-08 0 0
7 Afghanistan_Asia 2020-01-09 0 0
8 Afghanistan_Asia 2020-01-10 0 0
9 Afghanistan_Asia 2020-01-11 0 0
10 Afghanistan_Asia 2020-01-12 0 0
# ℹ 330,146 more rows
And finally, the nest
function. The nest
function is usually used in conjunction with group_by
to create a new format of data frame in which every cell is now a list rather than a single observation and thus can store any kind of object – data frames, lists, models, plots, and so forth.
The example below shows how to create a nested data frame for the COVID data grouped by country. Note that each cell in the data
column is now a data frame itself, corresponding to the data for each country.
<- covid_data |>
covid_eu_nest filter(continent == 'Europe') |>
group_by(location) |>
nest()
covid_eu_nest
# A tibble: 55 × 2
# Groups: location [55]
location data
<chr> <list>
1 Albania <tibble [1,371 × 66]>
2 Andorra <tibble [1,371 × 66]>
3 Austria <tibble [1,371 × 66]>
4 Belarus <tibble [1,371 × 66]>
5 Belgium <tibble [1,371 × 66]>
6 Bosnia and Herzegovina <tibble [1,371 × 66]>
7 Bulgaria <tibble [1,378 × 66]>
8 Croatia <tibble [1,377 × 66]>
9 Cyprus <tibble [1,371 × 66]>
10 Czechia <tibble [1,378 × 66]>
# ℹ 45 more rows
I find it most useful when we need to perform several tasks over entire datasets. For example, if we had to transform the raw data, create plots, and fit models for each country in the COVID dataset. The base R solution in this case is to create a list of lists, which can be very confusing at times. We’ll come back to nested data frames when we talk about functional programming with the purrr
package.
1.3.3 Text manipulation
The need to handle text data has grown substantially in parallel with the popularity of machine learning models and, more specifically, natural language processing (NLP). For these more advanced applications, the challenge is to standardize a large set of (messy) texts in order to extract features that can then feed the models and generate predictions.
Nevertheless, knowing the basics of text manipulation is critical for everyday tasks. This includes subsetting parts of a word, detecting whether specific patterns are present, replacing a sequence of characters with something else, and so forth. The functions from the stringr
package do a terrific job of simplifying all these operations – and go far beyond.
Similarly to what we did in the previous sections, we’ll focus on the most widely used functions. Let’s start with str_detect
, which is conveniently used in conjunction with dplyr::filter
, since it returns TRUE
if the specific pattern is found in the string and FALSE
otherwise.
For example, let’s say we want to analyze the COVID data only for North America and South America. We’ve learned how to do so using dplyr::filter
.
<- covid_data |>
covid_americas filter(continent %in% c('North America', 'South America'))
This one is not cumbersome. But let’s pretend there were, say, 50 continents on Earth, with 25 having “America” in their names. Would it still make sense to write all these 25 names in a vector? Absolutely not. Since all of them share a common pattern, we can easily employ str_detect
to do the trick.
<- covid_data |>
covid_americas filter(stringr::str_detect(continent, 'America'))
Note that we could have left out several characters by using only, say, ‘Am’ or ‘Ame’. This would have worked fine if there were no other continents with this sequence of characters. Of course, this parsimony makes more sense for lengthy words or long sentences. For short words I recommend writing the full word in order to prevent any undesirable output. In addition, we can also provide multiple patterns to str_detect
by separating them with a |
.
<- covid_data |>
covid_americas filter(stringr::str_detect(continent, 'South|North'))
Finally, we may use the negate = TRUE
argument if we’re interested in the opposite of the pattern we provide – pretty much like using !
in conditions. It’s especially useful when we want to keep most categories except for one (or a few). For example, suppose that now we want to analyze every continent except for Asia. Instead of writing them all, we could simply do the following:
<- covid_data |>
covid_exAsia filter(str_detect(continent, 'Asia', negate = TRUE))
Another recurrent task when dealing with strings is removing part of them. It’s generally required in order to establish a standard within categories so we can perform further operations. The complexity of this task varies depending on the form of this undesired part.
The simplest case is when we have to remove a sequence of characters that is fixed both in length and in position. For example, suppose we have a data frame covid_numCont
in which the observations in the continent column start with a random number from 0 to 9 – think of it as a typing error from the source, though the same reasoning applies if those random numbers were present only in a few observations.
# A tibble: 10 × 4
continent location date new_cases
<chr> <chr> <date> <dbl>
1 4.Africa Congo 2023-05-28 0
2 8.Africa Madagascar 2023-07-20 0
3 9.North America Bonaire Sint Eustatius and Saba 2023-04-02 38
4 2.Europe Vatican 2022-07-06 0
5 2.Africa Togo 2022-01-14 328
6 6.Europe Estonia 2021-08-24 0
7 2.Europe Romania 2022-10-05 1419
8 1.Africa Kenya 2023-05-08 0
9 9.Europe Norway 2020-03-08 19
10 7.South America French Guiana 2021-01-11 0
To solve this, it’s simply a matter of subsetting the string from position three onwards using str_sub
. The end
argument defaults to the last character, so we don’t need to make it explicit.
|>
covid_numCont mutate(continent = str_sub(continent, start = 3))
# A tibble: 330,156 × 4
continent location date new_cases
<chr> <chr> <date> <dbl>
1 Asia Afghanistan 2020-01-03 0
2 Asia Afghanistan 2020-01-04 0
3 Asia Afghanistan 2020-01-05 0
4 Asia Afghanistan 2020-01-06 0
5 Asia Afghanistan 2020-01-07 0
6 Asia Afghanistan 2020-01-08 0
7 Asia Afghanistan 2020-01-09 0
8 Asia Afghanistan 2020-01-10 0
9 Asia Afghanistan 2020-01-11 0
10 Asia Afghanistan 2020-01-12 0
# ℹ 330,146 more rows
Nice. But what if the random numbers ranged from 0 to 10, as in the data frame below?
# A tibble: 10 × 4
continent location date new_cases
<chr> <chr> <date> <dbl>
1 2.North America Costa Rica 2021-04-19 1060
2 8.Africa Benin 2023-09-02 0
3 8.Europe Spain 2021-01-20 0
4 3.Oceania Cook Islands 2020-07-09 0
5 5.Oceania Niue 2021-12-07 0
6 3.Europe Moldova 2021-09-19 1031
7 9.Asia Philippines 2021-04-28 6862
8 3.Africa Guinea-Bissau 2022-05-26 0
9 10.Europe Slovakia 2022-03-19 9386
10 2.Africa Lesotho 2022-11-21 0
With an extra digit, we could no longer resort to the previous solution. I don’t intend to dedicate an exclusive section to regular expressions, but simple examples will eventually appear throughout the book. In this case, we could use a simple regular expression inside str_remove
to get rid of everything before and up to the dot, .*\\.
.
|>
covid_numCont mutate(continent = str_remove(continent, ".*\\."))
# A tibble: 330,156 × 4
continent location date new_cases
<chr> <chr> <date> <dbl>
1 Asia Afghanistan 2020-01-03 0
2 Asia Afghanistan 2020-01-04 0
3 Asia Afghanistan 2020-01-05 0
4 Asia Afghanistan 2020-01-06 0
5 Asia Afghanistan 2020-01-07 0
6 Asia Afghanistan 2020-01-08 0
7 Asia Afghanistan 2020-01-09 0
8 Asia Afghanistan 2020-01-10 0
9 Asia Afghanistan 2020-01-11 0
10 Asia Afghanistan 2020-01-12 0
# ℹ 330,146 more rows
Typing errors may arise in different forms along a character column. These cases usually require a more thorough evaluation and, more often than not, the solution is to manually replace the incorrect words with the correct ones. For instance, in the data frame below (named covid_typo
), we can find two different typos for “North America”: there’s a missing h in rows 1 and 5; whereas there’s an extra h in rows 3 and 8.
# A tibble: 56,221 × 4
continent location date new_cases
<chr> <chr> <date> <dbl>
1 Nort America Anguilla 2020-01-03 0
2 North America Anguilla 2020-01-04 0
3 Northh America Anguilla 2020-01-05 0
4 North America Anguilla 2020-01-06 0
5 Nort America Anguilla 2020-01-07 0
6 North America Anguilla 2020-01-08 0
7 North America Anguilla 2020-01-09 0
8 Northh America Anguilla 2020-01-10 0
9 North America Anguilla 2020-01-11 0
10 North America Anguilla 2020-01-12 0
# ℹ 56,211 more rows
Since the data frame contains a large number of observations, the typo may be present in other positions as well. We can use str_replace
to fix it across the entire column.
|>
covid_typo mutate(
continent = str_replace(continent, 'Nort America', 'North America'),
continent = str_replace(continent, 'Northh America', 'North America')
)
Or we can simply pass a vector with all the replacements to the str_replace_all
function.
|>
covid_typo mutate(
continent = str_replace_all(
continent, c('Nort America' = 'North America',
'Northh America' = 'North America')
) )
There’s one last kind of typo we can’t help but mention: whitespace. Whitespace is particularly troublesome when they’re misplaced at the start or end of string or repeated within it. Because it’s very easy to go unnoticed, the stringr
package includes two functions to handle it: str_trim
and str_squish
. The former removes whitespace from start and end of a string, whereas the latter removes whitespace from within a string.
The data frame below (named covid_ws
) uses the same example as above, but now with a trailing whitespace in observations 1 and 5, and repeated whitespace inside the observations 3 and 8. Note that tibble
automatically adds quotation marks around the strings to highlight the extra whitespace. This is awesome!
# A tibble: 56,221 × 4
continent location date new_cases
<chr> <chr> <date> <dbl>
1 "North America " Anguilla 2020-01-03 0
2 "North America" Anguilla 2020-01-04 0
3 "North America" Anguilla 2020-01-05 0
4 "North America" Anguilla 2020-01-06 0
5 "North America " Anguilla 2020-01-07 0
6 "North America" Anguilla 2020-01-08 0
7 "North America" Anguilla 2020-01-09 0
8 "North America" Anguilla 2020-01-10 0
9 "North America" Anguilla 2020-01-11 0
10 "North America" Anguilla 2020-01-12 0
# ℹ 56,211 more rows
We can easily get rid of the whitespace using these functions. It’s advisable to use them as a preprocessing step whenever we’re working with character columns that should not contain extra whitespace.
|>
covid_ws mutate(
continent = continent |>
str_trim() |>
str_squish()
)
To finish up, let’s use the tools we’ve just learned to solve the problem we left unresolved from the previous subsection. To recap, we wanted to separate the location
column into country and continent. The issue was that with no separator character between the two names, we would have to resort to some kind of complicated regular expression to do the trick.
# A tibble: 330,156 × 4
location date new_cases new_deaths
<chr> <date> <dbl> <dbl>
1 AfghanistanAsia 2020-01-03 0 0
2 AfghanistanAsia 2020-01-04 0 0
3 AfghanistanAsia 2020-01-05 0 0
4 AfghanistanAsia 2020-01-06 0 0
5 AfghanistanAsia 2020-01-07 0 0
6 AfghanistanAsia 2020-01-08 0 0
7 AfghanistanAsia 2020-01-09 0 0
8 AfghanistanAsia 2020-01-10 0 0
9 AfghanistanAsia 2020-01-11 0 0
10 AfghanistanAsia 2020-01-12 0 0
# ℹ 330,146 more rows
As I said earlier, data science is all about finding workarounds. Of course, we’re often interested in general approaches, but sometimes we have to settle for a lower-level solution that gets the job done. For example, in this case, we could spend a long time figuring out the best possible solution, whereas a simpler one is at hand.
This simpler solution consists of employing the str_extract
function to extract the names of the continents, in conjunction with str_remove
to remove them from the original column. Remember that multiple patterns should be provided as a single string, with the patterns separated by |
.
<- c(
continents 'Asia',
'Europe',
'Africa',
'South America',
'North America',
'Oceania'
|>
) paste0(collapse = '|')
|>
covid_loc_cont2 mutate(
continent = str_extract(
location,
continents
),location = str_remove(
location,
continents
) )
# A tibble: 330,156 × 5
location date new_cases new_deaths continent
<chr> <date> <dbl> <dbl> <chr>
1 Afghanistan 2020-01-03 0 0 Asia
2 Afghanistan 2020-01-04 0 0 Asia
3 Afghanistan 2020-01-05 0 0 Asia
4 Afghanistan 2020-01-06 0 0 Asia
5 Afghanistan 2020-01-07 0 0 Asia
6 Afghanistan 2020-01-08 0 0 Asia
7 Afghanistan 2020-01-09 0 0 Asia
8 Afghanistan 2020-01-10 0 0 Asia
9 Afghanistan 2020-01-11 0 0 Asia
10 Afghanistan 2020-01-12 0 0 Asia
# ℹ 330,146 more rows
Job done! Next, we turn to dates, which are a special form of string. Handling them properly is essential for analyzing time series data.
1.3.4 Date manipulation
Having knowledge of date manipulation is crucial for performing many tasks when we’re dealing with time series data. Date objects are very convenient, since they allow us to extract features that can be used for many purposes. The subject is so vast that the lubridate
package was created specifically to handle date objects.
The first step when we’re working with dates is to convert the string to a date object. To avoid ambiguity issues, lubridate
contains a set of predefined functions that account for the ordering of year, month, and day in the string.
For instance, if we have a date in the standard YYYY-MM-DD
format, we can use the ymd
function. Note that it works regardless of how these terms are separated: it might be like 2021-12-01
, 2022/12/01
or even 20221201
. Also, month names (full or abbreviated forms) are allowed – 2021-December-01
or 2021-Dec-01
. The same logic applies to the whole family of related functions: mdy
, dmy
, ydm
, dym
, my
, ym
, and so on.
library(lubridate)
ymd('2022/12/01')
[1] "2022-12-01"
mdy('december, 1, 2022')
[1] "2022-12-01"
dmy('01122022')
[1] "2022-12-01"
my('Dec-2021')
[1] "2021-12-01"
In case the string format doesn’t match any of the predefined patterns, we can use the lubridate::as_date
function and declare the unusual format using specific operators: %Y
for year, %m
for month, and %d
for day. There are two other useful ones: %b
for month names (full or abbreviated) and %y
for two-digit year.
Now, let’s see how to extract features from date objects and how to use them to perform common operations. Take the following data set (named brl_usd
), which provides daily values of the Brazilian Real (BRL) versus the US Dollar from January 2010 to December 2021, where the column date
is in the DD/MM/YYYY
format.
The BRL/USD data set is available in the R4ER2data
package under the name brl_usd
.
<- R4ER2data::brl_usd brl_usd
Note that the column date is in standard string (or character) format. We’ll first convert it to the appropriate date format using the functions we’ve just learned.
<- brl_usd |> mutate(date = dmy(date))
brl_usd_aux
brl_usd_aux
# A tibble: 3,014 × 2
date brl
<date> <dbl>
1 2010-01-04 1.72
2 2010-01-05 1.72
3 2010-01-06 1.73
4 2010-01-07 1.74
5 2010-01-08 1.74
6 2010-01-11 1.73
7 2010-01-12 1.74
8 2010-01-13 1.74
9 2010-01-14 1.76
10 2010-01-15 1.77
# ℹ 3,004 more rows
Now, suppose we want to obtain the BRL monthly average. We have two ways to do this, one that is more logical and one that is more compact (see later when we talk about rounding dates). In the logical approach, all we have to do is create the columns we need to perform a grouping operation: year and month.
<- brl_usd_aux |>
brl_usd_monthly mutate(
year = year(date),
month = month(date)
|>
) group_by(year, month) |>
summarise(brl_monthly = mean(brl))
brl_usd_monthly
# A tibble: 144 × 3
# Groups: year [12]
year month brl_monthly
<dbl> <dbl> <dbl>
1 2010 1 1.78
2 2010 2 1.84
3 2010 3 1.79
4 2010 4 1.76
5 2010 5 1.81
6 2010 6 1.81
7 2010 7 1.77
8 2010 8 1.76
9 2010 9 1.72
10 2010 10 1.68
# ℹ 134 more rows
You might be wondering how to recover the YYYY-MM-DD
date format. We can do this by simply using the make_date
function. This function creates a standard date object from user-provided year, month, and day. Since we’ve aggregated daily data into monthly, we have two common choices for the day parameter: we either set it to 1 (the default), or we set it to the last day of the month using days_in_month
.
|>
brl_usd_monthly mutate(
date = make_date(
year = year,
month = month,
day = 1),
date2 = make_date(
year = year,
month = month,
day = days_in_month(date)
) )
# A tibble: 144 × 5
# Groups: year [12]
year month brl_monthly date date2
<dbl> <dbl> <dbl> <date> <date>
1 2010 1 1.78 2010-01-01 2010-01-31
2 2010 2 1.84 2010-02-01 2010-02-28
3 2010 3 1.79 2010-03-01 2010-03-31
4 2010 4 1.76 2010-04-01 2010-04-30
5 2010 5 1.81 2010-05-01 2010-05-31
6 2010 6 1.81 2010-06-01 2010-06-30
7 2010 7 1.77 2010-07-01 2010-07-31
8 2010 8 1.76 2010-08-01 2010-08-31
9 2010 9 1.72 2010-09-01 2010-09-30
10 2010 10 1.68 2010-10-01 2010-10-31
# ℹ 134 more rows
Note that creating a column with the number of days in each month may be particularly useful. For instance, when we have only the monthly totals of a variable and we need to compute daily averages. And, of course, it works fine with February since it takes the year into account.
The same procedure applies if we want to get quarterly means – it’s just a matter of creating a column with quarters. Be aware, however, that the quarter
function has a parameter named with_year
which, when set to TRUE
, eliminates the need to create a separate column for the year.
<- brl_usd_aux |>
brl_usd_quarterly mutate(quarter = quarter(date, with_year = TRUE)) |>
group_by(quarter) |>
summarise(brl_quarterly = mean(brl))
brl_usd_quarterly
# A tibble: 48 × 2
quarter brl_quarterly
<dbl> <dbl>
1 2010. 1.80
2 2010. 1.79
3 2010. 1.75
4 2010. 1.70
5 2011. 1.67
6 2011. 1.60
7 2011. 1.64
8 2011. 1.80
9 2012. 1.77
10 2012. 1.96
# ℹ 38 more rows
Special attention must be paid when we want to work with weeks, because lubridate
has two different functions to extract this feature: week
and isoweek
. The former returns the number of complete seven-day periods that have occurred between the date and January 1st, while the latter returns the number of the week (from Monday to Sunday) the date belongs to.
To get a better sense of the difference between them, suppose we provide the date ‘2022-01-01’. week
will return 1, since it’s in the first seven-day period after January 1st. On the other hand, isoweek
will return 52, because it’s Saturday and thus part of the last week of the previous year. It will only return 1 as of ‘2022-01-03’, since it belongs to a new week.
week('2022-01-01')
[1] 1
isoweek('2022-01-01')
[1] 52
isoweek('2022-01-03')
[1] 1
Therefore, if we want to compute weekly averages – and by weekly we mean a period of seven consecutive days – then we should pick isoweek
instead of week
. Another feature we can extract from dates is the weekday. In addition to temporal aggregation, it’s often used to filter or label data we want to plot later.
|>
brl_usd_aux mutate(wday = wday(date, label = TRUE))
# A tibble: 3,014 × 3
date brl wday
<date> <dbl> <ord>
1 2010-01-04 1.72 Mon
2 2010-01-05 1.72 Tue
3 2010-01-06 1.73 Wed
4 2010-01-07 1.74 Thu
5 2010-01-08 1.74 Fri
6 2010-01-11 1.73 Mon
7 2010-01-12 1.74 Tue
8 2010-01-13 1.74 Wed
9 2010-01-14 1.76 Thu
10 2010-01-15 1.77 Fri
# ℹ 3,004 more rows
Now we turn to operations with date objects. The lubridate
package contains two special operators, %m+%
and %m-%
, that work nicely with date objects to perform, respectively, addition and subtraction.
<- ymd('2020-02-29')
d1 %m+% years(2) d1
[1] "2022-02-28"
%m-% months(3) d1
[1] "2019-11-29"
%m+% days(1) d1
[1] "2020-03-01"
In addition, there’s also the lesser known add_with_rollback
function, which we can use to have more control over the output. For example, when adding one month to ‘2022-01-31’, we might want either ‘2022-02-28’ (the last day of the next month) or ‘2022-03-01’ (a full one-month period). To get the latter, we set the roll_to_first
parameter to TRUE
.
<- ymd('2022-01-31')
d2 add_with_rollback(d2, months(1), roll_to_first = TRUE)
[1] "2022-03-01"
add_with_rollback(d2, months(1), roll_to_first = FALSE)
[1] "2022-02-28"
I couldn’t help but mention two useful functions used to round dates: floor_date
and ceiling_date
. They take a date object and round it down or up, respectively, to the nearest boundary of the specified time unit.
<- ymd('2021-03-13')
d3 floor_date(d3, unit = 'month')
[1] "2021-03-01"
ceiling_date(d3, unit = 'month')
[1] "2021-04-01"
floor_date(d3, unit = 'quarter')
[1] "2021-01-01"
ceiling_date(d3, unit = 'quarter')
[1] "2021-04-01"
These functions can be helpful in several ways, and you’ll discover their benefits as you go through your own tasks. For example, they’re useful when we want to match (or re-write) dates that refer to the same period but are written differently – say, when the monthly date in dataset A is ‘2021-12-01’ and in dataset B is ‘2021-12-31’.
I use them very often as a simpler way to perform temporal aggregation. Remember that earlier in this section we computed monthly averages by creating two grouping columns: year and month. The logic was simply to treat every day in the same year/month as belonging to the same group. We can easily accomplish the same result by rounding dates down – with the great benefit of preserving the date column.
<- brl_usd_aux |>
brl_usd_monthly2 mutate(date = floor_date(date, unit = 'month')) |>
group_by(date) |>
summarise(brl_monthly = mean(brl))
brl_usd_monthly2
# A tibble: 144 × 2
date brl_monthly
<date> <dbl>
1 2010-01-01 1.78
2 2010-02-01 1.84
3 2010-03-01 1.79
4 2010-04-01 1.76
5 2010-05-01 1.81
6 2010-06-01 1.81
7 2010-07-01 1.77
8 2010-08-01 1.76
9 2010-09-01 1.72
10 2010-10-01 1.68
# ℹ 134 more rows
To finish up, let’s have a quick look at a family of functions: wday
, mday
, qday
, and yday
. They are used to get the number of days that have passed within each respective time period.
wday('2021-06-10') # 5th day of that week
[1] 5
qday('2021-06-10') # 71th day of the 2nd quarter of 2021
[1] 71
yday('2021-06-10') # 161th day of 2021
[1] 161
It’s very useful, for example, when you need to compare observations from the same period in different years or to create high-frequency seasonal variables.
1.4 Looping
Iteration is an indispensable tool in programming, and every language has its own structure. Essentially, loops are used to repeat an action over a set of values, thus saving us from the annoying and risky copy-and-paste approach. Whenever we encounter any kind of redundancy, there’s a good reason to use loops.
The purrr
package provides many interesting tools for working with functions and vectors. For our purposes, we’ll stick with the family of map
functions. The logic is always the same: applying a function – existing or user-defined – over a vector (or list) of arguments.
Let’s start with a very simple example. Suppose we have three numeric vectors and we want to compute their means. Instead of calling mean
on each vector separately, we can put them into a list and then use the map
function in conjunction with the existing mean
function.
<- c(1,4,7,8)
v1 <- c(3,5,9,0)
v2 <- c(12,0,7,1)
v3 <- list(v1, v2, v3)
v_all
map(.x = v_all, .f = mean)
[[1]]
[1] 5
[[2]]
[1] 4.25
[[3]]
[1] 5
Note that by default the output will be a list, but we can have other output formats using map_*
: map_dbl
will return the results as a vector, whereas map_dfc
will return them as a (column) data frame. We just need to consider whether the output can be coerced to the desired class.
map_dbl(.x = v_all, .f = mean)
[1] 5.00 4.25 5.00
map_dfc(.x = v_all, .f = mean)
# A tibble: 1 × 3
...1 ...2 ...3
<dbl> <dbl> <dbl>
1 5 4.25 5
Most of the time, I prefer to return the results as a list, because that makes it easier to apply further operations if needed. Data frames are usually a better choice for final results.
Now, let’s introduce some degree of complexity to the exercise by providing our own function. For this, let’s use the example of importing data from an API we saw earlier in the Importing section. Suppose that, in addition to CPI, we also want to get the time series for GDP and the unemployment rate.
Remember (or scroll up if necessary) that we created an object called api_series_id
with the ID of the CPI time series – and that was the only specific parameter. Everything else would be the same for any other series we wanted. Therefore, our first task here is to create a function whose only parameter is the series ID. Then, we create a vector (or list) with the desired IDs and – guess what? – use them inside the map
function.
We will create the get_series
function using the same content we already saw above, but leaving the api_series_id
as a parameter (series_id
). Note that I’ll keep some objects with their original names – starting with cpi_ – just to avoid confusion. This has no practical effect, though.
<- function(series_id) {
get_series <- 'https://api.stlouisfed.org/fred/series/observations?'
api_url <- api_fred_key
api_key <- series_id
api_series_id <- '2010-01-01'
obs_start <- 'json'
api_filetype <- glue::glue('{api_url}series_id={api_series_id}&observation_start={obs_start}&api_key={api_key}&file_type={api_filetype}')
api_request <- httr::GET(url = api_request)
cpi_request <- httr::content(cpi_request, as = 'text')
cpi_content <- jsonlite::fromJSON(cpi_content, flatten = FALSE)
cpi_list <- cpi_list[['observations']] |> tibble::as_tibble()
cpi_tbl }
We can test our function using the ID for CPI we used before.
get_series(series_id = 'CPALTT01USM657N')
Great, it’s working fine! The next step is to create a vector (or list) with the IDs for each series. Assigning names to the vector (list) elements is a good idea, since these names are carried forward, helping us to identify the elements in the output list.
<- list(
id_list 'CPI' = 'CPALTT01USM657N',
'GDP' = 'GDPC1',
'Unemp' = 'UNRATE'
)<- purrr::map(.x = id_list, .f = get_series)
fred_data
fred_data
$CPI
# A tibble: 171 × 4
realtime_start realtime_end date value
<chr> <chr> <chr> <chr>
1 2025-08-21 2025-08-21 2010-01-01 0.34174735701485
2 2025-08-21 2025-08-21 2010-02-01 0.024920738207648
3 2025-08-21 2025-08-21 2010-03-01 0.4106283536571079
4 2025-08-21 2025-08-21 2010-04-01 0.173688491069743
5 2025-08-21 2025-08-21 2010-05-01 0.0775197354237721
6 2025-08-21 2025-08-21 2010-06-01 -0.0976267084673888
7 2025-08-21 2025-08-21 2010-07-01 0.0211043057371509
8 2025-08-21 2025-08-21 2010-08-01 0.138066427840807
9 2025-08-21 2025-08-21 2010-09-01 0.0581736230715569
10 2025-08-21 2025-08-21 2010-10-01 0.12451988884769599
# ℹ 161 more rows
$GDP
# A tibble: 62 × 4
realtime_start realtime_end date value
<chr> <chr> <chr> <chr>
1 2025-08-29 2025-08-29 2010-01-01 16582.71
2 2025-08-29 2025-08-29 2010-04-01 16743.162
3 2025-08-29 2025-08-29 2010-07-01 16872.266
4 2025-08-29 2025-08-29 2010-10-01 16960.864
5 2025-08-29 2025-08-29 2011-01-01 16920.632
6 2025-08-29 2025-08-29 2011-04-01 17035.114
7 2025-08-29 2025-08-29 2011-07-01 17031.313
8 2025-08-29 2025-08-29 2011-10-01 17222.583
9 2025-08-29 2025-08-29 2012-01-01 17367.01
10 2025-08-29 2025-08-29 2012-04-01 17444.525
# ℹ 52 more rows
$Unemp
# A tibble: 187 × 4
realtime_start realtime_end date value
<chr> <chr> <chr> <chr>
1 2025-08-21 2025-08-21 2010-01-01 9.8
2 2025-08-21 2025-08-21 2010-02-01 9.8
3 2025-08-21 2025-08-21 2010-03-01 9.9
4 2025-08-21 2025-08-21 2010-04-01 9.9
5 2025-08-21 2025-08-21 2010-05-01 9.6
6 2025-08-21 2025-08-21 2010-06-01 9.4
7 2025-08-21 2025-08-21 2010-07-01 9.4
8 2025-08-21 2025-08-21 2010-08-01 9.5
9 2025-08-21 2025-08-21 2010-09-01 9.5
10 2025-08-21 2025-08-21 2010-10-01 9.4
# ℹ 177 more rows
We could make our function more general by allowing more parameters to vary. For example, we could have a different time span for each series (the obs_start
object). The procedure would be almost the same: we would add an extra parameter to the function, create two vectors (lists) with the parameter values, and use map2
instead of map
.
<- function(series_id, series_start) {
get_series_2pars <- 'https://api.stlouisfed.org/fred/series/observations?'
api_url <- api_fred_key
api_key <- series_id
api_series_id <- series_start
obs_start <- 'json'
api_filetype <- glue::glue('{api_url}series_id={api_series_id}&observation_start={obs_start}&api_key={api_key}&file_type={api_filetype}')
api_request <- httr::GET(url = api_request)
cpi_request <- httr::content(cpi_request, as = 'text')
cpi_content <- jsonlite::fromJSON(cpi_content, flatten = FALSE)
cpi_list <- cpi_list[['observations']] |> tibble::as_tibble()
cpi_tbl
}
<- list(
time_list 'CPI' = '2010-01-01',
'GDP' = '2012-04-01',
'Unemp' = '2014-06-01'
)
<- purrr::map2(.x = id_list, .y = time_list, .f = get_series_2pars) fred_data2
We must be careful with the example above because it may give the impression that map2
takes variable names into account. It doesn’t! It actually relies on the order of elements in each list. So, if we moved CPI to the second position in time_list
, we would end up with GDP data using the CPI time span.
A safer – and therefore preferable – alternative is to use names rather than positions as indexes. Since we can use list[['element_name']]
to access an element in a list, we can reduce our problem to a single dimension by looping over variable names, which are the same in both lists.
<- c('CPI', 'GDP', 'Unemp')
vars_fred
<- map(
fred_data2_names .x = vars_fred,
.f = function(x) get_series_2pars(
series_id = id_list[[x]],
series_start = time_list[[x]]
)|>
) ::set_names(vars_fred) magrittr
Notice how powerful this solution is: you can generalize it to as many parameters as needed without running the risk of using a parameter value from one series in another. The only additional work is to explicit the parameters of the function in .f
.
To finish this topic, let’s go back to the end of subsection on data layout. There, we had an overview of nested data frames, and I mentioned that I find them most useful when we need to perform several tasks over entire datasets. Also, remember that nesting a data frame is, roughly speaking, converting every cell from a single observation into a list.
Let’s print again the nested data frame we created earlier, named covid_eu_nest
.
# A tibble: 55 × 2
# Groups: location [55]
location data
<chr> <list>
1 Albania <tibble [1,371 × 66]>
2 Andorra <tibble [1,371 × 66]>
3 Austria <tibble [1,371 × 66]>
4 Belarus <tibble [1,371 × 66]>
5 Belgium <tibble [1,371 × 66]>
6 Bosnia and Herzegovina <tibble [1,371 × 66]>
7 Bulgaria <tibble [1,378 × 66]>
8 Croatia <tibble [1,377 × 66]>
9 Cyprus <tibble [1,371 × 66]>
10 Czechia <tibble [1,378 × 66]>
# ℹ 45 more rows
We can use map
to perform computations for every country at once. Moreover, we can create new columns to store the results, thereby gathering all the information in the same object.
|>
covid_eu_nest mutate(
max_total_cases = map_dbl(
.x = data,
.f = function(x) {
|>
x pull(total_cases) |>
max(na.rm = TRUE)
}
),min_total_cases = map_dbl(
.x = data,
.f = function(x) {
|>
x pull(total_cases) |>
min(na.rm = TRUE)
}
) )
# A tibble: 55 × 4
# Groups: location [55]
location data max_total_cases min_total_cases
<chr> <list> <dbl> <dbl>
1 Albania <tibble [1,371 × 66]> 334090 2
2 Andorra <tibble [1,371 × 66]> 48015 1
3 Austria <tibble [1,371 × 66]> 6081287 1
4 Belarus <tibble [1,371 × 66]> 994037 1
5 Belgium <tibble [1,371 × 66]> 4817196 1
6 Bosnia and Herzegovina <tibble [1,371 × 66]> 403155 2
7 Bulgaria <tibble [1,378 × 66]> 1302188 4
8 Croatia <tibble [1,377 × 66]> 1275337 1
9 Cyprus <tibble [1,371 × 66]> 660854 2
10 Czechia <tibble [1,378 × 66]> 4651538 3
# ℹ 45 more rows
And since each cell is a list rather than a single observation, we’re by no means restricted to numeric elements. We could use the same strategy to create a column with plots, for example. Plotting is, by the way, the subject of the next section.
1.5 Graphics
We’re almost there in our goal to review the basics of Tidyverse. Summarizing all the previous sections succinctly was challenging, but I believe you’ve received enough information to start carrying out data analyses on your own. When it comes to making graphics, extrapolating from the basics is somewhat harder, because there are infinite possibilities for customization.
In fact, the grammar used in ggplot2
package is broad and far from simple at first glance. But as you practice, it becomes increasingly graspable. Still, I encourage you to dive into the amazing Wickham, Navarro, and Pedersen (2019), since becoming fluent on this subject is indispensable.
Let’s think about the process of making a graphic as a set of layers arranged sequentially. The first layer is the data you want to plot – we’ll use the CPI data frame from the Importing section. The second layer is the ggplot
function. So far, there’s nothing to visualize – all these two layers do is set the stage for what’s coming next.
|> |>
ggplot()
The third layer we must provide is the geom
, which defines the geometry used to represent our data. This is the most important layer, as it contains a set of parameters that effectively creates a visualization of the data. For time series data, we generally use a line graph, so geom_line
is the appropriate geometry. In addition, we need to provide the values for the x
and y
axes. In this case, they’re the dates and CPI values, respectively.
Before proceeding, we need to make sure that the values for x
and y
are of the appropriate class. We can use the glimpse
function from dplyr
package to check this.
|>
cpi_tbl glimpse()
Rows: 156
Columns: 4
$ realtime_start <chr> "2025-08-29", "2025-08-29", "2025-08-29", "2025-08-29",…
$ realtime_end <chr> "2025-08-29", "2025-08-29", "2025-08-29", "2025-08-29",…
$ date <chr> "2010-01-01", "2010-02-01", "2010-03-01", "2010-04-01",…
$ value <chr> "0.34174735701485", "0.024920738207648", "0.41062835365…
We see that both date
and value
(the CPI value) are in character format, and so we need to convert them to date and numeric format, respectively. One last thing to keep in mind is that ggplot
uses +
instead of |>
as the operator to chain actions. Therefore, every layer we add is preceded by a +
. The code below produces the simplest plot from CPI data.
|>
cpi_tbl mutate(
date = ymd(date),
value = as.numeric(value)
|>
) ggplot() +
geom_line(aes(x = date, y = value))
What should we care about in this plot? Well, every plot should contain a meaningful title and labels for both the x and y axes. We can set them using the labs
layer. Additionally, we might want to use shorter intervals for dates (x-axis) and values (y-axis). There are also specific layers to control axis settings.
|>
cpi_tbl mutate(
date = ymd(date),
value = as.numeric(value)
|>
) ggplot() +
geom_line(aes(x = date, y = value)) +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_y_continuous(breaks = seq(-1, 1.5, 0.25), limits = c(-1,1.5)) +
labs(
title = 'US CPI showing an upward trend as of 2021',
x = 'Date',
y = 'US Monthly CPI (%)'
)
Finally, we might be interested in adding a feature to the plot to highlight the upward trend in CPI starting in 2021. Either the 12-month accumulated CPI or the 3-month accumulated seasonally adjusted CPI would be common choices for this task, and we’ll see examples of how to perform this kind of computation later. For now, just to keep things simple, we’ll use the average CPI from 2010 to 2019 (the pre-COVID period) as a measure of “normal” CPI.
|>
cpi_tbl mutate(
date = ymd(date),
value = as.numeric(value)
|>
) mutate(value_avg = mean(value[which(year(date) %in% 2010:2019)])) |>
ggplot() +
geom_line(aes(x = date, y = value)) +
geom_line(aes(x = date, y = value_avg), color = 'red', lwd = 1) +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_y_continuous(breaks = seq(-1, 1, 0.25)) +
labs(
title = 'US CPI showing an upward trend as of 2021',
subtitle = 'Red line is the 2010-2019 average',
x = 'Date',
y = 'US Monthly CPI (%)'
)
Since the parameter x = date
appears in both geom_line
layers, we can make the code more compact by moving this parameter into the ggplot()
function. All parameters inside ggplot()
are carried forward to every subsequent layer. This saves a lot of effort when adding multiple geom_*
layers to a graphic with fixed features. In this case, x
will always be the date variable, regardless of any layer we add to the plot.
|>
cpi_tbl mutate(
date = ymd(date),
value = as.numeric(value)
|>
) mutate(value_avg = mean(value[which(year(date) %in% 2010:2019)])) |>
ggplot(aes(x = date)) +
geom_line(aes(y = value)) +
geom_line(aes(y = value_avg), color = 'red', lwd = 1) +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_y_continuous(breaks = seq(-1, 1.5, 0.25), limits = c(-1, 1.5)) +
labs(
title = 'US CPI showing an upward trend as of 2021',
subtitle = 'Red line is the 2010-2019 average',
x = 'Date',
y = 'US Monthly CPI (%)'
)
You may be wondering why the x
and y
parameters go inside the aes
function, whereas color
appears outside of it. First of all, the parameters that define the axes of the graphic always go inside aes
. Other parameters behave differently depending on whether they are placed inside or outside aes
. According to the documentation, aesthetic mappings describe how variables in the data are mapped to visual properties (aesthetics) of geoms. It’s much easier to understand through an example.
In the graphic above, we used color
outside aes
to define the specific color of the plot line. However, if we were to use color
as an attribute to highlight different groups in the data, then color
should go inside aes
. In that case, we should pass a variable to it instead of a color name. This variable might be discrete – in which case we would have one color per group – or continuous, in which case we would have a color gradient according to its magnitude.
In the code below, we create a new variable named covid_period
to separate the CPI data into pre-COVID and COVID periods, and use it as a color attribute.
|>
cpi_tbl mutate(
date = ymd(date),
value = as.numeric(value),
covid = if_else(between(date, ymd('2020-03-01'), ymd('2022-12-01')), 'Yes', 'No')
|>
) ggplot() +
geom_line(aes(x = date, y = value, color = covid), lwd = 1) +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_y_continuous(breaks = seq(-1, 1.5, 0.25), limits = c(-1, 1.5)) +
labs(
title = 'US CPI showing an upward trend as of 2021',
x = 'Date',
y = 'US Monthly CPI (%)'
)
Note that ggplot
automatically assigns colors to attributes and adds a legend. So, we might be interested in customizing both the attribute colors and the legend position. Again, this can be achieved by adding two specific layers. The logic remains the same: parameters inside aes
turn a variable into an attribute, which can then be customized using specific layers.
|>
cpi_tbl mutate(
date = ymd(date),
value = as.numeric(value),
covid = if_else(between(date, ymd('2020-03-01'), ymd('2022-12-01')), 'Yes', 'No')
|>
) ggplot() +
geom_line(aes(x = date, y = value, color = covid), lwd = 1) +
scale_x_date(date_breaks = '1 year', date_labels = '%Y') +
scale_y_continuous(breaks = seq(-1, 1.5, 0.25), limits = c(-1, 1.5)) +
scale_color_manual(values = c('darkgreen', 'orange')) +
theme(legend.position = 'top') +
labs(
title = 'US CPI showing an upward trend as of 2021',
x = 'Date',
y = 'US Monthly CPI (%)'
)
As a final example, we’ll see how to build a scatter plot and explore some additional features of ggplot
. Scatter plots are typically used to highlight the relationship between two variables at a given point in time. For this task, let’s use the COVID dataset once again. We’ll first filter the data to a specific date – say, ‘2021-07-31’. Next, we’ll plot people_fully_vaccinated_per_hundred
on the x-axis and new_deaths_smoothed_per_million
on the y-axis.3
|>
covid_data filter(date == '2021-07-31') |>
ggplot() +
geom_point(
aes(
x = people_fully_vaccinated_per_hundred,
y = new_deaths_smoothed_per_million
) )
We saw earlier how to manually set the color for each group. It’s worth emphasizing that we’re by no means limited to using color names, as ggplot
also works with other tools such as HEX codes and the excellent ColorBrewer. ColorBrewer is great resource for color schemes carefully designed to enhance visual communication. You can check it out at https://colorbrewer2.org/.
We’ll stick with ColorBrewer, as it makes it effortless to assign a different color to each of the six continents. After choosing your desired pattern and palette on the website, you’ll find some metadata (type and scheme) that will be used here.
In addition, we’ll use theme_light()
to make our plot cleaner. Essentially, a theme is a set of predefined visual features for a graph. You can explore other built-in themes as well, including theme_dark()
, theme_minimal()
, theme_classic()
, and so on.
|>
covid_data filter(
== '2021-07-31',
date !is.na(continent)
|>
) ggplot() +
geom_point(
aes(
x = people_fully_vaccinated_per_hundred,
y = new_deaths_smoothed_per_million,
color = continent
),size = 3
+
) scale_color_brewer(type = 'qual', palette = 2) +
theme_light() +
labs(
title = 'New deaths in late July were above 5 per million people in countries where \n vaccination falls below 30% of population'
)
To finish, instead of showing all the continents in a single plot, we could separate them into multiple panels using facet_wrap
. This feature is especially useful when the range of values across groups differs greatly in magnitude.
|>
covid_data filter(
== '2021-07-31',
date !is.na(continent)
|>
) ggplot() +
geom_point(
aes(
x = people_fully_vaccinated_per_hundred,
y = new_deaths_smoothed_per_million
),size = 3
+
) theme_light() +
facet_wrap(~ continent) +
labs(
title = 'New deaths in late July were above 5 per million people in countries where \n vaccination falls below 30% of population'
)
Since we have a separate panel for each continent, it’s no longer necessary to assign a different color to each of them. As you go through the next chapters, you’ll see more plotting features and tips on how to improve data visualization.
1.6 Tables
Graphs are the right tool to show historical data or relationship between variables. Nonetheless, we often need to send reports with detailed data from the latest release of a certain variable. Tables are much better suited for this type of task, and the gt
package is excellent when it comes to customizing tables. It’s worth taking a look at the package’s website to get a better sense of the syntax.
In the following example, we’ll see how to organize the data properly to convert it into a nice-looking table. The data frame us_gdp
below contains data on the contributions to percent change in real gross domestic product (annual rates) for the U.S. economy, released by the Bureau of Economic Analysis of the U.S. Department of Commerce (BEA).
The data set containing the variables for this exercise is available in the R4ER2data
package under the name gdp_contribution_us
.
# A tibble: 6 × 5
Quarter Goods Services Structures `Motor vehicle output`
<yearqtr> <dbl> <dbl> <dbl> <dbl>
1 2019.25 0.46 2.07 0.86 0.24
2 2019.50 1.82 2.29 0.65 0.22
3 2019.75 0.78 2.25 -0.27 -0.8
4 2020.00 -1.79 -4.33 0.66 -0.7
5 2020.25 -7.37 -18.2 -2.55 -3.85
6 2020.50 19.6 13.7 1.88 5.99
The data frame is in wide format, which is not suitable for a table. So, we’ll start by pivoting the data frame so that the dates go across the columns and the variables (the breakdown of the contributions to U.S. GDP) go across the rows. Then, we’ll convert it into a gt
object.
library(gt)
<- us_gdp |>
us_gdp_gt pivot_longer(-Quarter, names_to = 'var', values_to = 'value') |>
mutate(Quarter = zoo::as.yearqtr(Quarter)) |>
filter(Quarter >= '2021 Q1') |>
pivot_wider(names_from = Quarter, values_from = value) |>
gt(rowname_col = "var")
us_gdp_gt
2021 Q1 | 2021 Q2 | 2021 Q3 | 2021 Q4 | 2022 Q1 | 2022 Q2 | 2022 Q3 | 2022 Q4 | 2023 Q1 | 2023 Q2 | 2023 Q3 | 2023 Q4 | 2024 Q1 | 2024 Q2 | 2024 Q3 | 2024 Q4 | 2025 Q1 | 2025 Q2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Goods | 1.80 | 1.84 | 0.32 | 5.51 | -1.50 | -1.14 | 2.09 | 2.12 | 0.14 | 0.53 | 2.22 | 0.80 | -1.12 | 1.56 | 1.41 | 0.23 | -0.22 | 2.67 |
Services | 3.45 | 5.04 | 3.56 | 2.26 | 0.44 | 1.92 | 1.74 | 2.00 | 2.29 | 1.01 | 1.50 | 1.67 | 1.91 | 1.48 | 1.92 | 1.81 | -0.12 | 0.74 |
Structures | 0.38 | -0.45 | -0.42 | -0.36 | 0.02 | -0.50 | -1.10 | -0.77 | 0.36 | 0.91 | 0.64 | 0.73 | 0.85 | -0.05 | -0.26 | 0.41 | -0.16 | -0.44 |
Motor vehicle output | 0.50 | -0.32 | -1.40 | 0.72 | -0.15 | 0.07 | 0.30 | 0.00 | 0.60 | 0.35 | -0.24 | -0.56 | 0.09 | 0.47 | -0.51 | 0.10 | -0.30 | 0.74 |
Okay, we have a rough draft of a table. Now, we can add some labels, such as a title, subtitle, and source note.
<- us_gdp_gt |>
us_gdp_gt tab_header(
title = "Contributions to percent change in real gross domestic product",
subtitle = "Annual rates"
|>
) tab_source_note(
source_note = "Source: U.S Bureau of Economic Analysis (BEA)"
)
us_gdp_gt
Contributions to percent change in real gross domestic product | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Annual rates | ||||||||||||||||||
2021 Q1 | 2021 Q2 | 2021 Q3 | 2021 Q4 | 2022 Q1 | 2022 Q2 | 2022 Q3 | 2022 Q4 | 2023 Q1 | 2023 Q2 | 2023 Q3 | 2023 Q4 | 2024 Q1 | 2024 Q2 | 2024 Q3 | 2024 Q4 | 2025 Q1 | 2025 Q2 | |
Goods | 1.80 | 1.84 | 0.32 | 5.51 | -1.50 | -1.14 | 2.09 | 2.12 | 0.14 | 0.53 | 2.22 | 0.80 | -1.12 | 1.56 | 1.41 | 0.23 | -0.22 | 2.67 |
Services | 3.45 | 5.04 | 3.56 | 2.26 | 0.44 | 1.92 | 1.74 | 2.00 | 2.29 | 1.01 | 1.50 | 1.67 | 1.91 | 1.48 | 1.92 | 1.81 | -0.12 | 0.74 |
Structures | 0.38 | -0.45 | -0.42 | -0.36 | 0.02 | -0.50 | -1.10 | -0.77 | 0.36 | 0.91 | 0.64 | 0.73 | 0.85 | -0.05 | -0.26 | 0.41 | -0.16 | -0.44 |
Motor vehicle output | 0.50 | -0.32 | -1.40 | 0.72 | -0.15 | 0.07 | 0.30 | 0.00 | 0.60 | 0.35 | -0.24 | -0.56 | 0.09 | 0.47 | -0.51 | 0.10 | -0.30 | 0.74 |
Source: U.S Bureau of Economic Analysis (BEA) |
The tab_spanner
feature allows us to group columns under a common label. We can use one of its variations, tab_spanner_delim
, to organize the quarters under their respective year, making the table more compact and visually appealing.
<- us_gdp_gt |>
us_gdp_gt tab_spanner_delim(
columns = -var,
delim = ' '
)
us_gdp_gt
Contributions to percent change in real gross domestic product | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Annual rates | ||||||||||||||||||
2021 | 2022 | 2023 | 2024 | 2025 | ||||||||||||||
Q1 | Q2 | Q3 | Q4 | Q1 | Q2 | Q3 | Q4 | Q1 | Q2 | Q3 | Q4 | Q1 | Q2 | Q3 | Q4 | Q1 | Q2 | |
Goods | 1.80 | 1.84 | 0.32 | 5.51 | -1.50 | -1.14 | 2.09 | 2.12 | 0.14 | 0.53 | 2.22 | 0.80 | -1.12 | 1.56 | 1.41 | 0.23 | -0.22 | 2.67 |
Services | 3.45 | 5.04 | 3.56 | 2.26 | 0.44 | 1.92 | 1.74 | 2.00 | 2.29 | 1.01 | 1.50 | 1.67 | 1.91 | 1.48 | 1.92 | 1.81 | -0.12 | 0.74 |
Structures | 0.38 | -0.45 | -0.42 | -0.36 | 0.02 | -0.50 | -1.10 | -0.77 | 0.36 | 0.91 | 0.64 | 0.73 | 0.85 | -0.05 | -0.26 | 0.41 | -0.16 | -0.44 |
Motor vehicle output | 0.50 | -0.32 | -1.40 | 0.72 | -0.15 | 0.07 | 0.30 | 0.00 | 0.60 | 0.35 | -0.24 | -0.56 | 0.09 | 0.47 | -0.51 | 0.10 | -0.30 | 0.74 |
Source: U.S Bureau of Economic Analysis (BEA) |
We could keep adding more features – a color scheme, for instance – but we can be content for now. We have a well-formatted and highly functional table that effectively conveys a clear summary of the U.S. GDP release.
Actually, this information is contained in the
covid_data
object, but let’s pretend it’s not.↩︎Another case for the
by
argument is when we have more than one column with the same name across the datasets, but we don’t want to use them all as keys.↩︎Be aware that we’re not trying to establish a causal relationship, nor to find a threshold for vaccination above which people are safe. We must be cautious when interpreting this graphic, as omitted variables – such as age and prior exposure to the virus – may influence the outcome.↩︎