Week 3
Cleaning and Data
Analysis in II
SOCI 269
Our Syllabus Will Change
I’ll be updating the syllabus over the weekend. Specifically, I will—in all likelihood—extend the deadline for the Coding Assignment by a week.
Launch RStudio and execute the following code:
Environment
tab. What do you notice?
Here are a list of objects available in your Global Environment:
vdem <- vdem |> as_tibble()
high_level <- c("v2x_polyarchy", "v2x_libdem", "v2x_partipdem", "v2x_delibdem", "v2x_egaldem")
high_level_names <- c("electoral", "liberal", "participatory", "deliberative", "egalitarian")
can_mex_usa <- vdem |> as_tibble() |>
# Here, we (i) select (and rename) the first column;
# (ii) select the year variable; then
# (iii) select all the variables in the high_level vector.
select(country = 1, year, all_of(high_level)) |>
# Here, we rename all the high_level variables using the
# high_level_names vector.
rename_at(high_level, ~ high_level_names) |>
# We're performing "per-operation" grouping here---within
# the mutate() function (by year). We're then relocating the
# grouped variable we created and ensuring that it appears after
# "electoral."
mutate(electoral_global_avg = mean(electoral, na.rm = TRUE),
.by = year, .after = electoral) |>
# Isolating Canada, the US and Mexico + years in the
# 21st century:
filter(str_detect(country, "Can|United St|Mex"),
year >= 2000) |>
# Arranging countries alphabetically:
arrange(country)
can_mex_usa
# A tibble: 72 × 8
country year electoral electoral_global_avg liberal participatory
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Canada 2000 0.84 0.492 0.765 0.601
2 Canada 2001 0.837 0.495 0.761 0.594
3 Canada 2002 0.831 0.505 0.754 0.587
4 Canada 2003 0.831 0.514 0.754 0.584
5 Canada 2004 0.83 0.514 0.758 0.58
6 Canada 2005 0.828 0.518 0.758 0.579
7 Canada 2006 0.836 0.522 0.765 0.579
8 Canada 2007 0.838 0.519 0.767 0.579
9 Canada 2008 0.835 0.522 0.766 0.578
10 Canada 2009 0.835 0.524 0.765 0.578
# ℹ 62 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>
dplyr
—dplyr::bind_rows()
We can use bind_rows()
to combine observations—or rows—from different data frames.
Quick Exercise
In the next 5-10 minutes, try to recreate can_mex_usa
using (i) the data you have stored in your Environment
; and (ii) bind_rows()
.
Note: If you’re running into issues, fear not: the answer’s on the next slide.
dplyr::bind_rows()
# A tibble: 72 × 8
country year electoral electoral_global_avg liberal participatory
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Canada 2000 0.84 0.492 0.765 0.601
2 Canada 2001 0.837 0.495 0.761 0.594
3 Canada 2002 0.831 0.505 0.754 0.587
4 Canada 2003 0.831 0.514 0.754 0.584
5 Canada 2004 0.83 0.514 0.758 0.58
6 Canada 2005 0.828 0.518 0.758 0.579
7 Canada 2006 0.836 0.522 0.765 0.579
8 Canada 2007 0.838 0.519 0.767 0.579
9 Canada 2008 0.835 0.522 0.766 0.578
10 Canada 2009 0.835 0.524 0.765 0.578
# ℹ 62 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>
dplyr::bind_cols()
Another Mini-Exercise
Okay, that wasn’t so hard. Let’s try to use bind_cols()
, to append a new variable—e_regionpol
from truncated_regions
—to our data.
dplyr::bind_cols()
# The Solution:
can_mex_usa_1 <- can_mex_usa |> bind_cols(# Avoiding duplicate columns:
truncated_regions |>
select(-c(1:2))) |>
# Relocating e_regionpol variable
relocate(e_regionpol, .after = country)
can_mex_usa_1
# A tibble: 72 × 9
country e_regionpol year electoral electoral_global_avg liberal
<chr> <hvn_lbll> <dbl> <dbl> <dbl> <dbl>
1 Canada 2 2000 0.84 0.492 0.765
2 Canada 2 2001 0.837 0.495 0.761
3 Canada 2 2002 0.831 0.505 0.754
4 Canada 2 2003 0.831 0.514 0.754
5 Canada 2 2004 0.83 0.514 0.758
6 Canada 2 2005 0.828 0.518 0.758
7 Canada 2 2006 0.836 0.522 0.765
8 Canada 2 2007 0.838 0.519 0.767
9 Canada 2 2008 0.835 0.522 0.766
10 Canada 2 2009 0.835 0.524 0.765
# ℹ 62 more rows
# ℹ 3 more variables: participatory <dbl>, deliberative <dbl>,
# egalitarian <dbl>
dplyr::left_join()
What happens when we try to bind can_mex_usa
with all_regions
using the bind_cols()
function?
The powerful *_join()
family of verbs from dplyr
are especially useful when we’re stitching together data frames of different dimensions (e.g., different numbers of rows).
dplyr::left_join()
Yet Another Mini-Exercise
Try to use left_join()
to attach the e_regionpol
variable from all_regions
to our original can_mex_usa
data frame.
Store your new object as can_mex_usa_2
and relocate e_regionpol
so that it appears right after country
.
dplyr::left_join()
# The Solution:
can_mex_usa_2 <- can_mex_usa |> left_join(all_regions) |>
relocate(e_regionpol, .after = country)
can_mex_usa_2
# A tibble: 72 × 9
country e_regionpol year electoral electoral_global_avg liberal
<chr> <hvn_lbll> <dbl> <dbl> <dbl> <dbl>
1 Canada 5 2000 0.84 0.492 0.765
2 Canada 5 2001 0.837 0.495 0.761
3 Canada 5 2002 0.831 0.505 0.754
4 Canada 5 2003 0.831 0.514 0.754
5 Canada 5 2004 0.83 0.514 0.758
6 Canada 5 2005 0.828 0.518 0.758
7 Canada 5 2006 0.836 0.522 0.765
8 Canada 5 2007 0.838 0.519 0.767
9 Canada 5 2008 0.835 0.522 0.766
10 Canada 5 2009 0.835 0.524 0.765
# ℹ 62 more rows
# ℹ 3 more variables: participatory <dbl>, deliberative <dbl>,
# egalitarian <dbl>
dplyr
—dplyr::*_if
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::*_at
dplyr
—Note: Keep clicking or the space bar on your to advance through the slide deck.
Try to produce the following data frame:
# A tibble: 360 × 5
country region year measure score
<chr> <fct> <dbl> <chr> <dbl>
1 Canada North America and Western Europe 2000 electoral 84
2 Canada North America and Western Europe 2000 liberal 76.5
3 Canada North America and Western Europe 2000 participatory 60.1
4 Canada North America and Western Europe 2000 deliberative 76.5
5 Canada North America and Western Europe 2000 egalitarian 72.6
6 Canada North America and Western Europe 2001 electoral 83.7
7 Canada North America and Western Europe 2001 liberal 76.1
8 Canada North America and Western Europe 2001 participatory 59.4
9 Canada North America and Western Europe 2001 deliberative 76.2
10 Canada North America and Western Europe 2001 egalitarian 72.4
# ℹ 350 more rows
You can access the can_mex_usa_3
data frame via GitHub or by copying—and executing—the following line:
If you’re done with the first task, check out the data below
(cf. Healy 2023)—
You can explore the variables here: