Week 3
Cleaning and Data
Analysis in II

SOCI 269

Sakeef M. Karim
Amherst College

AN INTRODUCTION TO QUANTITATIVE SOCIOLOGY—CULTURE AND POWER

Data Wrangling III—
February 13th

A Brief Update

Our Syllabus Will Change

I’ll be updating the syllabus over the weekend. Specifically, I will—in all likelihood—extend the deadline for the Coding Assignment by a week.

Are We All on the Same Page?

Launch RStudio and execute the following code:

load(url("https://github.com/sakeefkarim/intro_quantitative_sociology/raw/refs/heads/main/data/week%203/week3.RData"))
A Question Click your Environment tab. What do you notice?

Are We All on the Same Page?

Here are a list of objects available in your Global Environment:

ls()
[1] "all_regions"       "canada"            "gss_2010"         
[4] "mexico"            "truncated_regions" "usa"              
Another Question What do each of these objects correspond to?

A Reminder

We’ll Be Using this Data Once Again

Show the underlying code
vdem <- vdem |> as_tibble()

high_level <- c("v2x_polyarchy", "v2x_libdem", "v2x_partipdem", "v2x_delibdem", "v2x_egaldem")

high_level_names <- c("electoral", "liberal", "participatory", "deliberative", "egalitarian")

can_mex_usa <- vdem |> as_tibble() |> 
                       # Here, we (i) select (and rename) the first column; 
                       # (ii) select the year variable; then 
                       # (iii) select all the variables in the high_level vector.
                       select(country = 1, year, all_of(high_level)) |> 
                       # Here, we rename all the high_level variables using the 
                       # high_level_names vector.
                       rename_at(high_level, ~ high_level_names) |> 
                       # We're performing "per-operation" grouping here---within
                       # the mutate() function (by year). We're then relocating the
                       # grouped variable we created and ensuring that it appears after
                       # "electoral."
                       mutate(electoral_global_avg = mean(electoral, na.rm = TRUE),
                              .by = year, .after = electoral) |> 
                       # Isolating Canada, the US and Mexico + years in the
                       # 21st century:
                       filter(str_detect(country, "Can|United St|Mex"),
                              year >= 2000) |> 
                       # Arranging countries alphabetically:
                       arrange(country)

can_mex_usa
# A tibble: 72 × 8
   country  year electoral electoral_global_avg liberal participatory
   <chr>   <dbl>     <dbl>                <dbl>   <dbl>         <dbl>
 1 Canada   2000     0.84                 0.492   0.765         0.601
 2 Canada   2001     0.837                0.495   0.761         0.594
 3 Canada   2002     0.831                0.505   0.754         0.587
 4 Canada   2003     0.831                0.514   0.754         0.584
 5 Canada   2004     0.83                 0.514   0.758         0.58 
 6 Canada   2005     0.828                0.518   0.758         0.579
 7 Canada   2006     0.836                0.522   0.765         0.579
 8 Canada   2007     0.838                0.519   0.767         0.579
 9 Canada   2008     0.835                0.522   0.766         0.578
10 Canada   2009     0.835                0.524   0.765         0.578
# ℹ 62 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>

Introduction to dplyr
Combining Data Frames

dplyr::bind_rows()

We can use bind_rows() to combine observations—or rows—from different data frames.

Quick Exercise

In the next 5-10 minutes, try to recreate can_mex_usa using (i) the data you have stored in your Environment; and (ii) bind_rows().

Note: If you’re running into issues, fear not: the answer’s on the next slide.

dplyr::bind_rows()

# The Solution:

can_mex_usa <- bind_rows(canada, mexico, usa)

can_mex_usa
# A tibble: 72 × 8
   country  year electoral electoral_global_avg liberal participatory
   <chr>   <dbl>     <dbl>                <dbl>   <dbl>         <dbl>
 1 Canada   2000     0.84                 0.492   0.765         0.601
 2 Canada   2001     0.837                0.495   0.761         0.594
 3 Canada   2002     0.831                0.505   0.754         0.587
 4 Canada   2003     0.831                0.514   0.754         0.584
 5 Canada   2004     0.83                 0.514   0.758         0.58 
 6 Canada   2005     0.828                0.518   0.758         0.579
 7 Canada   2006     0.836                0.522   0.765         0.579
 8 Canada   2007     0.838                0.519   0.767         0.579
 9 Canada   2008     0.835                0.522   0.766         0.578
10 Canada   2009     0.835                0.524   0.765         0.578
# ℹ 62 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>

dplyr::bind_cols()

Another Mini-Exercise

Okay, that wasn’t so hard. Let’s try to use bind_cols(), to append a new variable—e_regionpol from truncated_regions—to our data.

A Quick Question

Will this work?

can_mex_usa |> bind_cols(truncated_regions)

dplyr::bind_cols()

# The Solution:

can_mex_usa_1 <- can_mex_usa |> bind_cols(# Avoiding duplicate columns:
                                          truncated_regions |> 
                                          select(-c(1:2))) |> 
                                 # Relocating e_regionpol variable
                                 relocate(e_regionpol, .after = country)

can_mex_usa_1
# A tibble: 72 × 9
   country e_regionpol  year electoral electoral_global_avg liberal
   <chr>    <hvn_lbll> <dbl>     <dbl>                <dbl>   <dbl>
 1 Canada            2  2000     0.84                 0.492   0.765
 2 Canada            2  2001     0.837                0.495   0.761
 3 Canada            2  2002     0.831                0.505   0.754
 4 Canada            2  2003     0.831                0.514   0.754
 5 Canada            2  2004     0.83                 0.514   0.758
 6 Canada            2  2005     0.828                0.518   0.758
 7 Canada            2  2006     0.836                0.522   0.765
 8 Canada            2  2007     0.838                0.519   0.767
 9 Canada            2  2008     0.835                0.522   0.766
10 Canada            2  2009     0.835                0.524   0.765
# ℹ 62 more rows
# ℹ 3 more variables: participatory <dbl>, deliberative <dbl>,
#   egalitarian <dbl>

dplyr::left_join()

What happens when we try to bind can_mex_usa with all_regions using the bind_cols() function?

can_mex_usa |> bind_cols(all_regions)

The powerful *_join() family of verbs from dplyr are especially useful when we’re stitching together data frames of different dimensions (e.g., different numbers of rows).

dplyr::left_join()

Yet Another Mini-Exercise

  1. Try to use left_join() to attach the e_regionpol variable from all_regions to our original can_mex_usa data frame.

  2. Store your new object as can_mex_usa_2 and relocate e_regionpol so that it appears right after country.

dplyr::left_join()

# The Solution:

can_mex_usa_2 <- can_mex_usa |> left_join(all_regions) |> 
                                relocate(e_regionpol, .after = country)

can_mex_usa_2
# A tibble: 72 × 9
   country e_regionpol  year electoral electoral_global_avg liberal
   <chr>    <hvn_lbll> <dbl>     <dbl>                <dbl>   <dbl>
 1 Canada            5  2000     0.84                 0.492   0.765
 2 Canada            5  2001     0.837                0.495   0.761
 3 Canada            5  2002     0.831                0.505   0.754
 4 Canada            5  2003     0.831                0.514   0.754
 5 Canada            5  2004     0.83                 0.514   0.758
 6 Canada            5  2005     0.828                0.518   0.758
 7 Canada            5  2006     0.836                0.522   0.765
 8 Canada            5  2007     0.838                0.519   0.767
 9 Canada            5  2008     0.835                0.522   0.766
10 Canada            5  2009     0.835                0.524   0.765
# ℹ 62 more rows
# ℹ 3 more variables: participatory <dbl>, deliberative <dbl>,
#   egalitarian <dbl>

Introduction to dplyr
Extending Verbs

dplyr::*_if

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::*_at

Introduction to dplyr
Reshaping Data

Wide to Long and Back Again

Note: Keep clicking or the space bar on your to advance through the slide deck.

Today’s Exercise

Try to produce the following data frame:

# A tibble: 360 × 5
   country region                            year measure       score
   <chr>   <fct>                            <dbl> <chr>         <dbl>
 1 Canada  North America and Western Europe  2000 electoral      84  
 2 Canada  North America and Western Europe  2000 liberal        76.5
 3 Canada  North America and Western Europe  2000 participatory  60.1
 4 Canada  North America and Western Europe  2000 deliberative   76.5
 5 Canada  North America and Western Europe  2000 egalitarian    72.6
 6 Canada  North America and Western Europe  2001 electoral      83.7
 7 Canada  North America and Western Europe  2001 liberal        76.1
 8 Canada  North America and Western Europe  2001 participatory  59.4
 9 Canada  North America and Western Europe  2001 deliberative   76.2
10 Canada  North America and Western Europe  2001 egalitarian    72.4
# ℹ 350 more rows

You can access the can_mex_usa_3 data frame via GitHub or by copying—and executing—the following line:

readRDS(url("https://github.com/sakeefkarim/intro_quantitative_sociology/raw/refs/heads/main/data/week%203/can_mex_usa_3.rds"))

Optional Exercise

If you’re done with the first task, check out the data below
(cf. Healy 2023)

readRDS(url("https://github.com/sakeefkarim/intro_quantitative_sociology/raw/refs/heads/main/data/week%202/gss_2010.rds"))

You can explore the variables here:

Enjoy the Weekend

Reference(s)

Healy, Kieran Joseph. 2023. gssr: General Social Survey Data for Use in R.