Foundations Module: Unit 3

Author

Benjamin Feder, Roy McKenzie, Corey Sparks, Joshua Edelmann

1 Introduction

Welcome to the third unit of the Foundations module, where we will cover descriptive analysis. With this unit, we build off our project scoping carried out through the data discovery and analytic frame creation phases, applying it to our data as we proceed to the analysis and communication portion of the development process.

In our last notebook, we used the insights gained from our data discovery process to construct our analytic frame - our base dataset that is going to serve as the starting point for the rest of our analysis. In this notebook, we will learn how to leverage our analytic frame for a descriptive analysis aimed at answering our research focus - recreating the following visual in the Workforce Overview Report for Kentucky Regions (WORKR), designed and maintained by the Kentucky Center for Statistics:

Descriptive analysis involves constructing a set of descriptive statistics (average, minimum, maximum, sum, etc.) from which we can draw inferences related to our research question. This might consist of evaluating trends over time, comparing averages across different groups, or correlations between certain variables. These statistics often can be presented in a final output table.

The process of descriptive analysis is not just about generating numbers, though. The code to carry out a descriptive analysis may be quite simple, but understanding the output required to answer your research question and the developing the narrative in discussing these results can be much more complex. This process requires a deep understanding of the policy context of your analysis and the ability to think creatively about your results. It is linked closely to the process of project scoping, and will require you to continue building on the data literacy skills we have already introduced in this course.

In this notebook, we will put these skills into practice using the same Local Area Unemployment Statistics (LAUS) data we explored in the first two units of the course. We will reexamine our analytic frame and construct an analysis plan to recreate a portion of the WORKR dashboard. To carry out this analysis plan, we will begin working in R, a new tool in our data analytics toolbox. We will implement our fundamental data operations, this time in R, and use this operations to carry out a simple descriptive analysis. This will then prepare us for our final unit, where we will transform the results of our analysis into the exemplary WORKR visual to help communicate the narrative of the research project.

Note

In this notebook, we will continue to use the language of Fundamental Data Operations introduced in the videos and in class. Fundamental operations in this notebook will appear formatted like this. For more information, please review the fundamental data operations cheat sheet.

2 Setup

If you are not interested with the technical details of unit and are solely focused on building your data literacy capabilities, we recommend you focus on the text in this notebook, thinking about the optimal structure of the resulting output and the fundamental data operations required to transform our base analytic frame into the meaningful output. If you fit this description, we recommend you skip ahead to the next section, Basic Data Discovery in R, as we will go ahead and begin to discuss setting up R, the new tool we will be using for this part of the module.

Like SQL, R is a programming language - a set of commands that you can use to instruct the computer in interacting with your data. Also like SQL, we will interface with an additional piece of software to write and run our R scripts - RStudio.

R, though, has its advantages beyond SQL (and vice versa), and unlocks a new suite of analysis and visualization tools that we can use for our analysis. We will not dive too far into the details for now; the high-level takeaway is that for the rest of our Foundations module, we are going to primarily use R through RStudio.

2.1 Creating an R script

Similar to a SQL script, an R script contains a set of R commands that you can store and easily rerun on the same data.

To create a new R script (a .R file):

  1. Open RStudio, located on the ADRF Desktop. The icon looks like this:

  2. In the top menu bar, click File then New File, and select the first option, R Script:

  3. To confirm your script is accessible, try going to the R console

    and running the following code in your script, which outputs a built-in R dataset in the Console below:

    mtcars

    You can run it by either clicking the run button on the right-hand side of the script, or by pressing CTRL + Enter (or SHIFT + Enter):

    Please reach out to the instructors for this module is you have any issues completing this setup.

2.2 R Packages

R uses packages to store functions that do different types of analysis, so we will need to install lots of different packages to do different things. There are over 20,000 different packages currently for R.

In general, You can install a package by typing:

install.packages("packagename")

into Rstudio. You will have to load the package by using the library() command:

library(packagename)

which gives you access to the functions in that package. You do not need to install the package again, unless you update your R software, but each time you start a new session (i.e. open Rstudio), you will have to load the package again.

2.2.1 Setting up basic packages - the ColeridgeInitiative Package

We have written a R package to help you install fundamental packages that you will need for this portion of the training. The package is called ColeridgeInitiative and can be installed from within RStudio.

To install the package, at the Console in RStudio

type (or copy) the following:

install.packages('P:/tr-foundation-module/packages/ColeridgeInitiative_0.1.0.tar.gz', type = 'source', repos = NULL)

The package will install and when it is finished , you should see:

Next, load the package by running

library(ColeridgeInitiative)

Next, to install the basic packages needed for the course, run the following command in the console:

install_new()

This will take a minute or so, but when it is done, you should see the prompt returned to you

Now you can use the packages as you would normally use them. For example. to load the dplyr package, simply type (at the console, or in a script)

library(dplyr)

and the functions in that package will be available to you.

Note

Much of the remaining code in this notebook will not work unless you first complete the steps above. Please let us know if you have any issues installing the necessary packages.

2.3 Loading our analytic frame from Redshift

The final step in getting set up for our descriptive analysis is loading the analytic frame we created in the previous unit from Redshift into R.

You might scratch your head a little at that - if we could just load this data into R, why bother learning SQL? Well, as we hinted at earlier, R also has its limitations. In particular, the base tools in R are insufficient in handling many the very, very large datasets we will use later on in this course.

With that being said, the standard version R probably could handle our raw LAUS data - it is not that large. But we wanted to introduce you to a common workflow that we will use throughout this course: using SQL to directly interface with the database for data exploration and the creation of an analytic frame, and then reading that analytic frame into R for your descriptive analysis and visualization.

Thankfully, instead of needing to save a flat file (think .csv) from the database to read into R, we can leverage a separate package, RJDBC, to interface with the Redshift server by running SQL code within R.

First, we need to set up a connection to the specific database:

library(RJDBC)

dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")

url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"

driver <- JDBC(
  "com.amazon.redshift.jdbc42.Driver",
  classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
  identifier.quote="`"
)

con <- dbConnect(driver, url, dbusr, dbpswd)

For now, don’t worry too much about the details of this connection - you can simply copy and paste this code each time you want to connect your R script to the Redshift database. The only important note is that, for this code to work, you need to create a file named .Renviron in your user folder (i.e. U:.Doe.P00002) that contains the following:

DBUSER='adrf\John.Doe.T00112'
DBPASSWD='xxxxxxxxxxxx'

where John.Doe.T00112 is replaced with your username and xxxxxxxxxx is replaced with your password (both still in quotes!) The setup of this code in connection will is covered in the “Introduction to RStudio” video available in Unit 3’s accompanying videos, so please watch that if you have any questions.

Now that we are connected to Redshift, we can use the following snippet to load our analytic frame, developed in the last query in the previous notebook, into our R environment:

qry <- paste0("
SELECT l.area_text,
       x.stwibname,
       l.year, 
       l.period_name,
       l.unemployment_rate,
       l.unemployment,
       l.employment,
       l.labor_force
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
WHERE 
  l.year >= 2022 AND 
  period_name != 'Annual Average' AND 
  seasonal_code != 'S' AND 
  (x.stname ='", "Kentucky","' OR l.area_text = '", "Kentucky", "')
")

analytic_frame <- dbGetQuery(con, qry)

analytic_frame

 

Again, you do not need to worry too much about the specific syntax here. If you want to reuse this command for your own project, you can copy and paste the main structure of the code, and just replace the SQL command:

SELECT l.area_text,
       x.stwibname,
       l.year, 
       l.period_name,
       l.unemployment_rate,
       l.unemployment,
       l.employment,
       l.labor_force
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
WHERE 
  l.year >= 2022 AND 
  period_name != 'Annual Average' AND 
  seasonal_code != 'S' AND 
  (x.stname = 'Kentucky' OR l.area_text = 'Kentucky')

with whatever query you wish. The results of this query will be stored in R as an object named analytic_frame (which could also be changed to whatever you wish).

If you try this in your own script in R studio, you should see analytic_frame appear in the R Environment tab. If you click on this, you should be able to preview the data. Now that analytic_frame is loaded in our R environment, we can use this object in our analysis.

3 Basic Data Discovery in R

Now that we’ve learned how to access our analytic_frame in R, let’s learn some basic commands we can use for data discovery. Since we have already constructed this data frame, the results of these commands should not be too surprising, but we will use them to re-familiarize ourselves with our analytic frame before we proceed with our descriptive analysis.

As per usual, if you aren’t concerned with learning the details of the R code, please just read through the following output to re-familiarize yourself with the structure of the analytic frame, so that you can better follow along with our descriptive analysis later on in this notebook.

We can get a preview of the data by running head, which outputs a limited number of rows:

head in R works very similarly to SQL’s LIMIT, as they both restrict the preview of the data to show the first set of observations. While the row restriction in SQL follows directly after LIMIT, such as LIMIT 5, head contains an additional argument, n, that can be used - for example, head(analytic_frame, n = 5) will restrict the output of analytic_frame to just five rows. Help documentation for any function available in R can be accessed by running the function with a ? preceding it, such as ?head.

Code comments in R are preceded by a #, and usually preceded by -- in SQL which ensures that all comments noted on the given line are not executed as commands.

# output limited number of rows of analytic_frame
head(analytic_frame)
  area_text stwibname year period_name unemployment_rate unemployment
1  Kentucky      <NA> 2022     January               4.5        90803
2  Kentucky      <NA> 2022    February               4.3        87208
3  Kentucky      <NA> 2022       March               4.4        90204
4  Kentucky      <NA> 2022       April               3.4        69639
5  Kentucky      <NA> 2022         May               3.8        78429
6  Kentucky      <NA> 2022        June               4.4        91481
  employment labor_force
1    1947438     2038241
2    1956157     2043365
3    1974773     2064977
4    1976715     2046354
5    1975455     2053884
6    1975070     2066551

Instead of just previewing your data, if you want to display some basic information about all of the variables contained in your dataset, then you can use R’s summary command. This command is a shortcut for running several of our aggregation fundamental data operations on all of the variables in our dataset.

For our analytic frame, summary returns the following:

summary only requires one argument, the object you wish to get the “summary” statistics of. The display per variable differs by variable type, as variables of type character display the total number of observations in the total object for the variable, whereas numeric variables return a consistent set of basic descriptive statistics for each variable: the minimum, maximum, mean, median, 1st quartile, and 3rd quartile.

summary(analytic_frame)
  area_text          stwibname              year      period_name       
 Length:2178        Length:2178        Min.   :2022   Length:2178       
 Class :character   Class :character   1st Qu.:2022   Class :character  
 Mode  :character   Mode  :character   Median :2022   Mode  :character  
                                       Mean   :2022                     
                                       3rd Qu.:2023                     
                                       Max.   :2023                     
 unemployment_rate  unemployment       employment       labor_force     
 Min.   : 2.400    Min.   :   30.0   Min.   :    755   Min.   :    792  
 1st Qu.: 3.600    1st Qu.:  204.0   1st Qu.:   4420   1st Qu.:   4632  
 Median : 4.300    Median :  340.0   Median :   7390   Median :   7718  
 Mean   : 4.553    Mean   : 1335.7   Mean   :  32537   Mean   :  33872  
 3rd Qu.: 5.200    3rd Qu.:  636.8   3rd Qu.:  14048   3rd Qu.:  14636  
 Max.   :13.800    Max.   :92855.0   Max.   :1987567   Max.   :2068390  

These summary statistics can be helpful in getting a feel for our analytic frame. For example, not that the output above includes the range (min, max) of all of the continuous variables in our dataset. This provides a quick check of whether or not the values that our data takes on are within a sensible range for each variable.

For instance, if we were looking at a variable for wages and we saw a minimum value of -99999, that likely isn’t a real value for that variable. Sometimes datasets use these kinds of nonsensical values to signal an error or missing data.

Checkpoint

Looking at the summary statistics for labor_force, do these make sense to you? Did you expect the range between the minimum and maximum values to be so large? Why might there be such a big difference between the mean and the median?

Using R, identify any observations that may be contributing to this difference.

4 Descriptive Analysis

Now that we’ve done some light data discovery, we are ready to proceed with our descriptive data analysis. Before diving in, however, it usually helpful to set up a descriptive analysis plan outlining in plain language the steps we would like to carry out in our code.

4.1 Analysis Plan

In developing our analysis plan, it is helpful to start by envisioning our final table, pre-visualization. Recall the output we are trying to recreate:

What type of table of results might allow us to construct this kind of visual? Since the line plot displays monthly unemployment rates at the state and region levels, it may make sense to aim for the first few rows of the final table to look something like this:

region month year unemployment_rate
state Jan 2022 ?
region … Jan 2022 ?
state Feb 2022 ?
region … Feb 2022 ?

Our analytic frame is not that far off from this table! The main difference, however, is that the data in this hypothetical table is presented at the region level, while the data in our analytic frame is at the county level. Changing this, using our fundamental data operations, requires:

  • Collapsing, or aggregating the county-level portion of the data to the region
  • Recalculating the unemployment rates at this aggregated granularity

In the following sections, we will translate these fundamental data operations into R to construct this desired output table.

Checkpoint

Does the structure of this table make sense to you? Do you understand how it could be helpful in developing our desired product? If given this table with all values filled in, what might you expect to see?

4.2 Complete Code Snippet

As a preview, the following chunk of code will carry out all these steps and generate our final table. If you are just interested in the results, feel free to look at them here, or circle back to these results at the end of the document for the corresponding checkpoint. If you are interested in the anatomy of how this command is structured in R, read on, because as a mini spoiler alert, we will need to use another one of our fundamental data operations not yet mentioned in this notebook!

analytic_frame %>%
  mutate(
    stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
  ) %>%
  group_by(
    stwibname, year, period_name, 
  ) %>%
  summarize(
    total_unemployment = sum(unemployment),
    total_labor_force = sum(labor_force)
  ) %>%
  ungroup() %>%
  mutate(
    unemployment_rate = total_unemployment / total_labor_force
  )
`summarise()` has grouped output by 'stwibname', 'year'. You can override using
the `.groups` argument.
# A tibble: 198 x 6
   stwibname               year period_name total_unemployment total_labor_force
   <chr>                  <int> <chr>                    <int>             <int>
 1 A Purchase/Pennyrile ~  2022 April                     6660            168460
 2 A Purchase/Pennyrile ~  2022 August                    7341            168884
 3 A Purchase/Pennyrile ~  2022 December                  6297            168208
 4 A Purchase/Pennyrile ~  2022 February                  8109            168284
 5 A Purchase/Pennyrile ~  2022 January                   8252            168147
 6 A Purchase/Pennyrile ~  2022 July                      8106            169502
 7 A Purchase/Pennyrile ~  2022 June                      8468            170288
 8 A Purchase/Pennyrile ~  2022 March                     7562            169062
 9 A Purchase/Pennyrile ~  2022 May                       6867            168925
10 A Purchase/Pennyrile ~  2022 November                  6282            167448
# i 188 more rows
# i 1 more variable: unemployment_rate <dbl>

4.2.1 tidyverse

Technical Warning

The upcoming section dives heavily into the mechanics necessary for working with this data in R. If you aren’t interested in this material, feel free to skip ahead to the next section, Aggregation.

If you tried to run the above command in your own R script, you might have gotten an error like this:

Error in mutate(., unemployment_rate = total_unemployment/total_labor_force): could not find function "mutate"

This is because the above command relies on a package called the tidyverse [@tidyverse2019]. From the tidyverse website (tidyverse.org):

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

What that means is that the tidyverse is a set of tools that you can use in R to make your analytic life much easier. R itself was invented in 1993 - quite some time ago! - and it is the tidyverse suite of packages that has helped keep it relevant and increasingly popular in recent years. When looking for R help online, be sure to always include the “tidyverse” keyword in your search, to ensure that you find helpful results which fit inside the paradigm we will learn here. And, to begin any of our scripts for this course, we want to make sure to always run this piece of code:

As we discussed in the Setup section, library loads external packages not available in the base version of R into the environment. Here, by loading the tidyverse, this function brings in a whole suite of other packages we will leverage in our analysis. You may have noticed library used once before, to load the RJDBC package, which allows us to connect to the Redshift database.

Remember, these packages must be installed before they can be loaded with library. For a reminder on how to do so, please see the section R Packages.

library(tidyverse)

In addition to bringing in all of the packages within the tidyverse ecosystem, using the tidyverse also allows us to access a very important programming operation in R. Let’s take another look at the above (shortened) example:

analytic_frame %>%
  mutate(
    stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
  ) 

If you are wondering what %>% means - great question! While this set of symbols can look rather intimidating, it is not too bad once you get comfortable using this expression. The %>% operator is called a pipe and it takes whatever is on the left side, or in front, of it and “pipes” it into the first argument of whatever function is on the right. In other words:

The pipe takes this %>%
  f (and puts it in here)

The pipe is primarily about making your code cleaner and more readable. It also makes it easy to deconstruct larger “chains” of code, running each part one at a time to view the progress of the entire manipulation. In our shortened example, we could have just as easily ran:

mutate(
  analytic_frame,
  stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
)

and gotten the same results. But if we look at our full command written this why, it becomes pretty obvious why the pipe is so helpful:

mutate(
  ungroup(
    summarize(
      group_by(
        mutate(
          analytic_frame, 
          stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
        ),
        stwibname, year, period_name
      ), 
      total_unemployment = sum(unemployment),
      total_labor_force = sum(labor_force)
    )
  ),
  unemployment_rate = total_unemployment / total_labor_force
)

4.3 Aggregation

Now that we’ve discussed the mechanics of using the R and the pipe operator for analysis, we can get back to translating our analysis plan into code. The first step, as noted above, is “collapsing, or aggregating the county-level portion of the data to the region.” Linking this back to our fundamental data operations, we want to aggregate our data after grouping by region. Remember that this allows us to get one or more summary statistics for each group in our data - which sounds like exactly what want hear!

To identify the actual variables by which we would like to aggregate, we can take another look at analytic_frame.

head(analytic_frame)
  area_text stwibname year period_name unemployment_rate unemployment
1  Kentucky      <NA> 2022     January               4.5        90803
2  Kentucky      <NA> 2022    February               4.3        87208
3  Kentucky      <NA> 2022       March               4.4        90204
4  Kentucky      <NA> 2022       April               3.4        69639
5  Kentucky      <NA> 2022         May               3.8        78429
6  Kentucky      <NA> 2022        June               4.4        91481
  employment labor_force
1    1947438     2038241
2    1956157     2043365
3    1974773     2064977
4    1976715     2046354
5    1975455     2053884
6    1975070     2066551

We can see that the name of each region is available in stwibname, just as we might have expected from the last notebook.

While we could group by this one variable, it may not be in our best interest. Recall that we also want to look at the values by month. If we grouped and aggregated only on stwibname, we would average across all of the months available, removing the time dimension. We need to group by year and period_name as well as stwibname to avoid this.

Here we see that we are piping our analytic_frame into the function group_by, which carries out the grouping operation in R. According to the documentation for group_by (run ?group_by if you are curious), this function “takes an existing tbl and converts it into a grouped tbl where operations are performed ‘by group’”. In other words, it tells R that it should treat our dataset as a bunch of smaller datasets that are separating by a certain subset of variables and that we can analyze independently.

The group_by function requires at least two arguments: the data that we want to group and the variable(s) by which to group. In this case, we use the pipe (%>%) to place the data in the first argument, and then tell group_by to group by stwibname, year, and period_name.

You might think that this function looks very similar to the GROUP BY command in SQL, and you would be right! These are two different ways to carry out the grouping fundamental data operation in two different languages.

There is one important difference to note, however. In SQL, commands have to be in a set order, so you can’t do a GROUP BY before a FROM statement for example. In R we can use commands in many different orders, as long as the logic implied by the order is correct.

In addition, in R, there is an implied SELECT statement, where R automatically includes all variables in the data in the output. We can use an equivalent select() statement in our workflow if we want to specifically limit the input data to contain only a subset of columns.

analytic_frame %>%
  group_by(stwibname, year, period_name)
# A tibble: 2,178 x 8
# Groups:   stwibname, year, period_name [198]
   area_text stwibname  year period_name unemployment_rate unemployment
   <chr>     <chr>     <int> <chr>                   <dbl>        <int>
 1 Kentucky  <NA>       2022 January                   4.5        90803
 2 Kentucky  <NA>       2022 February                  4.3        87208
 3 Kentucky  <NA>       2022 March                     4.4        90204
 4 Kentucky  <NA>       2022 April                     3.4        69639
 5 Kentucky  <NA>       2022 May                       3.8        78429
 6 Kentucky  <NA>       2022 June                      4.4        91481
 7 Kentucky  <NA>       2022 July                      4.4        89721
 8 Kentucky  <NA>       2022 August                    3.9        79092
 9 Kentucky  <NA>       2022 September                 3.5        70984
10 Kentucky  <NA>       2022 October                   3.8        76593
# i 2,168 more rows
# i 2 more variables: employment <int>, labor_force <int>

Weird - this looks exactly like our dataset before! That is because we still have not applied any aggregation operation to our groups. Grouping our dataset alone doesn’t change the way that our data looks - it just changes the way that R (or SQL) processes any other aggregation commands behind the scenes. We need to actually use some sort of aggregation operation to see a change in our data.

Which aggregation operations do we want to run? Recall that for our final output we want to calculate the unemployment_rate variable at the month-region level. To do so, we need to the total unemployment and labor force counts for each region, so our aggregation should sum all unemployment and labor_force values within each group.

The summarize function (or summarise - the tidyverse/dplyr is flexible!) is the general command we use for aggregation in R. Similar to group_by, summarize takes at least two arguments. The first is our dataset, which we pipe in here, continuing on from the grouping step. The second (and beyond) argument(s) for are then one or more aggregation formulas, containing both the aggregation operations we want to use, and the new variable names we should use to refer to our aggregates.

As you can see in the above output, summarize develops a new, smaller data frame, with one row for each unique combination of the variables that you specified in the group_by command. The other columns in this aggregated dataset are those specified by the formulas in the summarize command.

For example, in the code we have total_unemployment = sum(unemployment). This will then create a variable called total_unemployment in our aggregated dataset, which sums the values of unemployment for all of the observations in each unique combination of period_name, year, and stwibname. In other words, this will give us our total count of unemployment by group, just as we desired.

Similar to aggregation in SQL, summarize can work without a group_by - in that case, it will return a single row summarizing all rows in the data.

One question that you might have here is how summarize differs from mutate. Note that mutate creates a new variable for the existing set of observations, while summarize collapse these observations according to their group, and creates a new variable using aggregation.

analytic_frame %>%
  group_by(
    stwibname, year,  period_name
  ) %>%
  summarize(
    total_unemployment = sum(unemployment),
    total_labor_force = sum(labor_force)
  )
`summarise()` has grouped output by 'stwibname', 'year'. You can override using
the `.groups` argument.
# A tibble: 198 x 5
# Groups:   stwibname, year [22]
   stwibname               year period_name total_unemployment total_labor_force
   <chr>                  <int> <chr>                    <int>             <int>
 1 A Purchase/Pennyrile ~  2022 April                     6660            168460
 2 A Purchase/Pennyrile ~  2022 August                    7341            168884
 3 A Purchase/Pennyrile ~  2022 December                  6297            168208
 4 A Purchase/Pennyrile ~  2022 February                  8109            168284
 5 A Purchase/Pennyrile ~  2022 January                   8252            168147
 6 A Purchase/Pennyrile ~  2022 July                      8106            169502
 7 A Purchase/Pennyrile ~  2022 June                      8468            170288
 8 A Purchase/Pennyrile ~  2022 March                     7562            169062
 9 A Purchase/Pennyrile ~  2022 May                       6867            168925
10 A Purchase/Pennyrile ~  2022 November                  6282            167448
# i 188 more rows

Now we have a properly aggregated data frame, and all that’s left is to re-calculate our unemployment rates at the workforce board level.

But, before we do so, there’s one last thing we need to take care of, however. The code above gets the aggregated unemployment and labor force counts for each region, but it assigns the statewide counts to a missing value of stwibname - because, of course, the entire state does not fit in a single workforce board. This is fine, if you remember why those values are missing, but it might be a little cleaner to fix this first, and add a simple command to modify the stwibname variable and give the statewide observations their own unique value of stwibname.

The code above uses a fundamental data operation that we haven’t used much so far - variable creation. In this case, however, we aren’t creating a fully new variable, but instead creating an updated version of our existing stwib variable.

To do so, we use the mutate statement, which similarly to group_by takes the dataset we are modifying as a first argument. For the second (or beyond) argument(s), however, it takes one or more formulas of the format newvariable = formula, where newvariable is the name of the new variable to be created (or overwritten) and formula is the formula that we use to recreate.

For example, if we ran mutate(data, variable2 = 2), mutate would create a variable named variable2 which would take on a value of 2 for every observation.

As we said above, however, in this case we not using mutate to create a new variable, but instead to create a new version of our pre-existing variable stwibname, where we replace all the missing values of stwibname with the name “Statewide”. To do this, the formula above uses anifelse statement, which checks if stwibname is missing - using the handy is.na function - and then replace it with “Statewide” if it is missing, or with the workforce board name it already had, if it is not missing.

analytic_frame %>%
  mutate(
    stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
  ) %>%
  group_by(
    stwibname, year,  period_name
  ) %>%
  summarize(
    total_unemployment = sum(unemployment),
    total_labor_force = sum(labor_force)
  )
`summarise()` has grouped output by 'stwibname', 'year'. You can override using
the `.groups` argument.
# A tibble: 198 x 5
# Groups:   stwibname, year [22]
   stwibname               year period_name total_unemployment total_labor_force
   <chr>                  <int> <chr>                    <int>             <int>
 1 A Purchase/Pennyrile ~  2022 April                     6660            168460
 2 A Purchase/Pennyrile ~  2022 August                    7341            168884
 3 A Purchase/Pennyrile ~  2022 December                  6297            168208
 4 A Purchase/Pennyrile ~  2022 February                  8109            168284
 5 A Purchase/Pennyrile ~  2022 January                   8252            168147
 6 A Purchase/Pennyrile ~  2022 July                      8106            169502
 7 A Purchase/Pennyrile ~  2022 June                      8468            170288
 8 A Purchase/Pennyrile ~  2022 March                     7562            169062
 9 A Purchase/Pennyrile ~  2022 May                       6867            168925
10 A Purchase/Pennyrile ~  2022 November                  6282            167448
# i 188 more rows

Now our statewide observations all map to the “Statewide” workforce development board.

4.4 Recalculating the unemployment rates

Looking at our output above, you might be concerned that the unemployment_rate variable from our analytic_frame is not present anymore in our aggregated data frame. This isn’t setting us back too much, though, because the unemployment_rate was only available by county, and not by region. Unlike the unemployment and labor_force values, unemployment_rate is also hard to aggregate up from the county level to the region level. It probably doesn’t make much sense, for example, to simply average the unemployment_rate values for all of the counties within a region, as this would not properly weight the influence of counties of different sizes.

Instead, by already aggregating the unemployment and labor_force values to the region, we can leverage variable creation to create a new variable holding our recalculated unemployment rates. In this case, the unemployment rate for each of our month-region combinations can be easily calculated by dividing total_labor_force by total_unemployment ,so it’s the result of that operation that we want our new variable to store:

There are two new additions to our code here we need to observe. First, the ungroup statement. Since after our aggregation each row already represents a unique combination of month and region, we do not need to work with the grouped data frame. It is therefore best practice to remove grouping when it is no longer necessary - usually immediately after the summarize step - by running ungroup. This is a great way to avoid some very confusing errors that can arise from continuing to calculate things at a group by group level.

Now that we have removed all groups from our data, we can proceed with our variable creation process; we can accomplish this in R using mutate as before. Our formula for actually creating the new unemployment_rate variable is fairly straightforward division.

Note that mutate does not disrupt any of the previously-existing variables in the data unless they are specifically overwritten as a new variable with the same name as a prior one.

analytic_frame %>%
  mutate(
    stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
  ) %>%
  group_by(
    stwibname, year, period_name
  ) %>%
  summarize(
    total_unemployment = sum(unemployment),
    total_labor_force = sum(labor_force)
  ) %>%
  ungroup() %>%
  mutate(
    unemployment_rate = total_unemployment / total_labor_force  
  )
`summarise()` has grouped output by 'stwibname', 'year'. You can override using
the `.groups` argument.
# A tibble: 198 x 6
   stwibname               year period_name total_unemployment total_labor_force
   <chr>                  <int> <chr>                    <int>             <int>
 1 A Purchase/Pennyrile ~  2022 April                     6660            168460
 2 A Purchase/Pennyrile ~  2022 August                    7341            168884
 3 A Purchase/Pennyrile ~  2022 December                  6297            168208
 4 A Purchase/Pennyrile ~  2022 February                  8109            168284
 5 A Purchase/Pennyrile ~  2022 January                   8252            168147
 6 A Purchase/Pennyrile ~  2022 July                      8106            169502
 7 A Purchase/Pennyrile ~  2022 June                      8468            170288
 8 A Purchase/Pennyrile ~  2022 March                     7562            169062
 9 A Purchase/Pennyrile ~  2022 May                       6867            168925
10 A Purchase/Pennyrile ~  2022 November                  6282            167448
# i 188 more rows
# i 1 more variable: unemployment_rate <dbl>

With all of that, we now have our table in our desired structure, and have satisfied the requirements of our analysis plan!

Checkpoint

This is our final table that we will use for our visualization in the next unit. Looking at it now, what patterns do you see? Do they surprise you, or is it what you’d have expected? How would you visualize these results, and what story would you tell?

4.5 Saving Final Table

This table will provide the data that will underlie our visualization in the next unit. But we don’t want to have to load the Redshift data and run all these commands again - we want to save this data so that we can directly read it for use in our visualization, which we will create in R. We can do so by leveraging our pipe, as well as a new function, write_csv, which will save this table as a csv:

The code is the exact same as in the previous snippet, except it pipes the output one step further into the write_csv function, which saves the results as a .csv file. write_csv requires two arguments - the table by which it transforms to a .csv, and the resulting file name. It also has other optional arguments for capturing various edge cases that may arise in converting the data in R to a .csv file.

We can modify the default path for write_csv by explicitly providing the path we want. For instance, if you wanted to save the output in a folder on the C: drive called My_output we could do : write_csv("C:/My_output/final_table.csv").

analytic_frame %>%
  group_by(
    stwibname, year, period_name, 
  ) %>%
  summarize(
    total_unemployment = sum(unemployment),
    total_labor_force = sum(labor_force)
  ) %>%
  ungroup() %>%
  mutate(
    unemployment_rate = total_unemployment / total_labor_force  
  ) %>%
  write_csv("U:/User.Name.T00112/final_table.csv")

If you want to run the code above, please replace User.Name with your ADRF user name as it appears in you U: drive path, in order to save the final table to your U: drive.

5 Conclusion

This notebook introduces a framework for developing and executing a descriptive analysis, both from a data literacy and a technical standpoint. It relies on a new coding language, R, to apply the analysis plan, largely relying on our fundamental data operations in doing so. We will build on this product in our final unit, where we will discuss practices and mediums for conveying findings. Before then, come prepared to class ready to discuss the results of our analysis, as well as any other questions you might have come up with in working through this notebook. See you there!

6 References

Applied Data Analytics Training Program - Foundations Module (LODES data), Unit 3 notebook (citation to be added)