Foundations Module: Unit 4

Author

Corey Sparks, Benjamin Feder, Roy McKenzie, Joshua Edelmann

1 Introduction

Welcome to our final workbook of the foundations module, Unit 4: Data Visualization. This unit is the culmination of all the work we have done so far with the LAUS data aimed at answering our research focus - recreating the following visual in the Workforce Overview Report for Kentucky Regions (WORKR), designed and maintained by the Kentucky Center for Statistics:

In the previous workbook, we used our analytic frame as the basis for some descriptive analysis aimed at calculating unemployment rates in workforce regions for a single state. The idea of this was to create the final “table” that we would need to support the above visualization - a table that examines unemployment rates across workforce regions within the selected state. Now, in this workbook, we want to learn how to display the values from that final table in our data visualization.

It’s important to remember that the end goal of this visualization is, of course, sharing these results with your audience - whomever that may be. In many ways, we already have all of the information that we would want out of this graphic in the final table we constructed in the third notebook. But, this information is dense, and hard to parse - especially for a more policy focused audience. Building an effective data visualization can help you efficiently communicate important analyses in a compelling way - even to an audience with a less technical background.

Even before communicating your results, creating your visualization can be an important step in helping build your own understanding of your results. This stage of the process is the time for you to hone your project narrative and the story behind your data. Building off all of the data literacy skills we’ve already learned in this module, creating and describing your data visualization requires you to act as the data expert and translate the picture into a meaningful and policy-relevant story.

More than anything else in this module, visualization is an art, not a science. Creating an effective data visualization is an iterative process involving many formulations and reformulations, and you have to trust your own knowledge and data literacy experience on how to most effectively convey your final results. The process we will work through in this workbook is just one (simplified) example of how you might approach this task, but, as always, the truth is much messier!

So with that in mind, let’s get started.

2 Technical details

There are a couple technical considerations we want to note here before diving into the actual creation of our visualization. We’re going to learn about the package we are going to use to make our visualizations, and we’re going to read our final table data into R and prep it slightly for use with this package. If this sounds somewhat technical to you, or if your primary focus is on data literacy, then you can skip this section and [proceed straight to Data Visualization][Data Visualization].

2.1 `ggplot2`

For making our visualizations in R, we will use the package ggplot2, included as part of the tidyverse suite of packages which we discussed the previous unit.

Note

The code below will not work unless you have completed the package installation setup detailed in the unit 3 notebook. If you have not done so, please complete this package installation now following the instructions in that notebook.

Let’s go ahead and load the tidyverse here:

library(tidyverse)

As defined in the tidyverse page (ggplot2.tidyverse.org):

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what geometries to use, and it takes care of the details.

That definition is a little dense, and we’ll get a clearer example of what each piece means throughout the rest of this workbook. The main idea, however, is this: you provide ggplot with your data, and then use a consistent set of commands to transform aspects of that data into visual attributes of your graph. The “Grammar of Graphics” provides a strong theoretical underpinning for the structure of these translations, which makes it very easy to apply the principles learned in creating one ggplot graph to any other.

Since it is in the tidyverse package, the syntax used in ggplot2 should also feel very familiar to you moving forwards. Similarly to the other tidyverse commands we have learned, a normal ggplot call works by stacking commands (or, in this case, ggplot layers) on top of each other until we get the final result we’re looking for. We’ll see this in practice soon, but it goes to show how by working within the tidyverse you begin to learn common programming principles that can strengthen your work in a variety of unexpected areas.

Note

For your reference, there is a ggplot cheat sheet accessible on the ADA website in the Foundations page that contains many of the most common commands we will use

2.2 Data setup

Before we begin making our visualization, we have to make sure we have the data behind our visualization available to us! If you saved the final table to your U: drive last time, then you can use the read_csv function to simply read it into R now. But, in case you are catching up, we will quickly remake it here, using the code we have already seen to first load analytic frame from Redshift into R and then generate our final table from this analytic frame.

2.2.1 Loading our analytic frame from Redshift

The first step of our visualization analysis involves loading the analytic frame into R. We covered the construction of this analytic frame in the second notebook, so we are going to jump directly into connecting to Redshift from R and running the same query to pull our analytic frame into R. We followed identical steps at the beginning of notebook 3, so feel free to refer back to that for more details.

First, we need to set up a connection to the specific database:

library(RJDBC)

dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")

url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"

driver <- JDBC(
  "com.amazon.redshift.jdbc42.Driver",
  classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
  identifier.quote="`"
)

con <- dbConnect(driver, url, dbusr, dbpswd)

For now, don’t worry too much about the details of this connection - you can simply copy and paste this code each time you want to connect your R script to the Redshift database. The only important thing to remember is that, if you haven’t already, you need to create a file named .Renviron in your user folder (i.e. U:.Doe.P00002) that contains the following:

DBUSER='adrf\John.Doe.P00002’
DBPASSWD='xxxxxxxxxxxx'

where John.Doe.P00002 is replaced with your username and xxxxxxxxxx is replaced with your password (both still in quotes!) The setup of this code and connection is covered in the “Introduction to RStudio” video available in Unit 3’s accompanying videos, so please watch that if you have any questions.

Now that we are connected to Redshift, we have options in terms of querying our data: either using an SQL query, or the dbplyr package. Explore each option below:

We have already covered using an SQL query to load our analytic frame. The query itself was constructed at the end of our second notebook, and we used it to pull our analytic frame into R at the beginning of our third notebook into our R environment:

qry <- paste0("
SELECT l.area_text,
       x.stwibname,
       l.year, 
       l.period_name,
       l.unemployment_rate,
       l.unemployment,
       l.employment,
       l.labor_force
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
WHERE 
  l.year >= 2022 AND 
  period_name != 'Annual Average' AND 
  seasonal_code != 'S' AND 
  (x.stname ='", "Kentucky","' OR l.area_text = '", "Kentucky", "')")

analytic_frame <- dbGetQuery(con, qry)

We could also use the tidyverse to connect to the database directly and use dplyr syntax to perform our queries. This uses a package called dbplyr to connect to the database, but dplyr syntax, which is often more compact than SQL.

Code Explanation

The dbplyr query below starts by defining the tables you will use in your query, including the schemas, that is what the individual tbl(in_schema()) commands are doing. From that we do the same operations as the SQL query, just using dplyr statements instead of writing SQL code. The filter() command takes the place of WHERE statements, and the select() command is analogous to the SQL SELECT command. arrange() serves the same role as an ORDER BY statement, which just sorts the data. Finally, the collect() statement executes the query and retrieves the data into an R object, in this case called analytic_frame, but which can really be called anything we like.

library(dbplyr)

tb_l <- con %>%
  tbl(in_schema(
    schema = "ds_public_1",
    table = "laus"
  ))

tb_xwalk <- con %>%
  tbl(in_schema(
    schema = "tr_foundations_module",
    table = "xwalk"
  ))

analytic_frame <- tb_l %>%
  left_join(
    tb_xwalk,
    by = c("area_text" = "ctyname")
  ) %>%
  filter(
    year >= 2022,
    period_name != "Annual Average",
    seasonal_code != "S",
    (stname == !!"Kentucky" | area_text == !!"Kentucky")
  ) %>%
  select(
    area_text, stwibname, year, period_name,
    unemployment_rate, unemployment,
    employment, labor_force
  ) %>%
  arrange(area_text, year, period_name) %>%
  collect()

Here we see the data in our analytic frame, consisting of county-level monthly reported unemployment figures for the state of Kentucky.

2.2.2 Recreating our final table

Now, we want to use our analytic frame to again recreate our final table from the prior notebook. Again, if you saved this table at the end of the third notebook, you can use read_csv to read it here. But, if you would like to recreate a fresh copy, we can use the same code we used in that notebook:

final_table <- analytic_frame %>%
  mutate(
    stwibname = ifelse(is.na(stwibname), "Statewide", stwibname)
  ) %>%
  group_by(
    stwibname, year, period_name
  ) %>%
  summarize(
    total_unemployment = sum(unemployment),
    total_labor_force = sum(labor_force)
  ) %>%
  ungroup() %>%
  mutate(
    unemployment_rate = total_unemployment / total_labor_force  
  )

Again, this is identical to what we saw in the prior notebook, and we encourage you to review it there if you have further questions. The only difference here is that we store the final table in an R data frame called final_table. We can preview that table now to ensure everything was loaded into R correctly:

head(final_table)

# A tibble: 6 x 6
  stwibname                year period_name total_unemployment total_labor_force
  <chr>                   <int> <chr>                    <int>             <int>
1 A Purchase/Pennyrile W~  2022 April                     6660            168460
2 A Purchase/Pennyrile W~  2022 August                    7341            168884
3 A Purchase/Pennyrile W~  2022 December                  6297            168208
4 A Purchase/Pennyrile W~  2022 February                  8109            168284
5 A Purchase/Pennyrile W~  2022 January                   8252            168147
6 A Purchase/Pennyrile W~  2022 July                      8106            169502
# i 1 more variable: unemployment_rate <dbl>

Just as we saw last time, our final table has the unemployment rate for each of our workforce boards by month - just the data we need to recreate our table! We’re now ready to dive into developing a visualization to show this data in a more intuitive way.

3 Intro to data visualization

Now that we have our final table re-loaded into R, we will walk through the process of creating a data visualization to display the results of our final descriptive analysis of our analytic frame. Remember: All of the information that we want to show is already present in our final table - it is the result of our descriptive analysis! We just want to show this information in a more easily interpretable format.

As a reminder, here’s our final table again:

                    stwibname year period_name total_unemployment
  1: A Purchase/Pennyrile WIB 2022       April               6660
  2: A Purchase/Pennyrile WIB 2022      August               7341
  3: A Purchase/Pennyrile WIB 2022    December               6297
  4: A Purchase/Pennyrile WIB 2022    February               8109
  5: A Purchase/Pennyrile WIB 2022     January               8252
 ---                                                             
194:                Statewide 2023    February              92855
195:                Statewide 2023     January              85014
196:                Statewide 2023        June              92313
197:                Statewide 2023       March              78219
198:                Statewide 2023         May              79733
     total_labor_force unemployment_rate  n
  1:            168460        0.03953461 10
  2:            168884        0.04346771 10
  3:            168208        0.03743579 10
  4:            168284        0.04818640 10
  5:            168147        0.04907611 10
 ---                                       
194:           2039691        0.04552405 10
195:           2039026        0.04169344 10
196:           2068390        0.04463036 10
197:           2045271        0.03824383 10
198:           2063028        0.03864853 10

Now, remember that the ultimate goal for this notebook is to transform this final table into a visualization similar to the following graph from the KYSTATS WORKR dashboard, which shows regional time series plots of unemployment by region, within the state of Kentucky:

In the following sections, we’ve outlined three steps for creating a visual:

Step 1: Visualization foundation: Developing the base image.
Step 2: Cleaning up: Strengthening our visual
Step 3: Final housekeeping: Final touches and interpreting results.

Checkpoint

Look again at our final table and at the visualization we are hoping to mimic. Do you see how the numbers and categories displayed in the final table map onto the data points in the visualization? What differences are there?

3.1 Step 1: Visualization foundation

There are two key questions that lie at the foundation of the visualization process:

What variables do we want to display in our graph?
What type of chart will we use to display our data?

Let’s think about the first question. It’s not too hard to answer - we know that we want to show the overall rate of unemployment in both the state as a whole and for each region within the state.

But what about the second question - what type of visualization do we want to use to display these variables? This is a very open-ended question, and oftentimes the best way to start is just to look at how other people are visualizing data. In our case, we already know what our final output should look like - since we are working on recreating the WORKR dashboard - but let’s forget about that for a moment and try to work through the process of designing our visualization ourselves.

A great resource when starting to design a new visualization is the R Graph Gallery (r-graph-gallery.com), which has a fantastic set of over 400 data visualizations that might serve as inspiration when you are working through your own visualizations.

Note

Remember, you cannot access external web resources like this inside the ADRF, but we do encourage you to pull it up in another window in your browser!

This site categorizes visualization types based on the type of underlying data they represent. For example:

Distribution data: Histogram, density plot, or box plot
Correlation data: scatter plot, heat map
Parts of a whole: Grouped/stacked bar plot, pie chart
Trends over time: Line chart, area chart

Thinking about the structure of the data you are trying to present in these terms can be very effective in helping you narrow down the type of chart you want to use.

Let’s apply this idea to our example. We are trying to compare the regions within the state to one another, and to the state as a whole. At first blush, this might side like parts of a whole - after all, each region makes up a “part” of the “whole” state. But, importantly, we don’t want to make this comparison just once - we want to show how these differences evolve and change over time - which (perhaps unsurprisingly) - puts us directly in the trends over time category.

You might not be convinced by that argument, and for good reason. If we were comparing only two months, or years, for example, it might make more sense to use two bar charts, or something similar, to compare our regions with the state. The presence of date data alone doesn’t mean we have to think about it as “trends over time”. But, in this case we 1) have a lot more than just two periods to compare and 2) are really more concerned with how unemployment rates have evolved over time, with the comparison across reasons being a (slightly) secondary concern. For that reason, thinking about this data in a “trends over time” manner probably makes more sense.

When thinking about “trends over time” data, it is almost always best to just keep it simple and use a line plot - it is a classic for reason! But, does that make sense for our use case, of plotting unemployment rates over time? Well, often a great a first step for creating visualizations like this is just to do a quick sketch, not worrying about the specifics of the data, but just to see if the structure of the visualization makes sense with the structure of the data. Here, our initial back of the napkin sketch of could look like this:

This fundamentally does a pretty good job showing exactly what we have in our table: unemployment rates (on the y-axis) over time (on the x-axis). Even though we’re still glossing over some of the details (for now), this gives a good starting place for us to build on. Referring back to our discussion from above, let’s compare this to using a bar plot to show the same trend over time:

At its core, this chart is conveying the same information from our final table: unemployment rates over time. But, even from our simple sketch, we can tell that it’s quite a bit more cluttered than our line plot - and that’s even before we start thinking about having separate bars for each of our regions!

Speaking of, let’s turn back to that question - how can we show data for each of our different regions in our plot? Well, a natural way to do this is to allow the color of the lines to vary with region.

This is now an example of a grouped line chart, where are three regions represented by different colors, so that we can make direct comparisons between the unemployment rate in each region.

Perhaps unsurprisingly, this type of graph is effectively what is shown in the WORKR dashboard. But hopefully, through this discussion, you’ve gained some appreciation for why the WORKR dashboard displays the data from our descriptive analysis in the way that it does, and what other options might have been available.

Either way, now that we’ve got a mental idea of what we might want our final visualization to look like, let’s actually take a first pass at putting it together using our data. In thinking through this first pass of the visualization, it’s essential to think through what our x-axis should be, what our y-axis should be, and what groups we should use (if any).

Luckily, with our sketch, we’ve already answered these questions for our plot: we want to show the unemployment rates from our final table (y-axis) over time (x-axis), grouped by region. But, these are intuitive answers - we still need to decide what variables these actually correspond to in our final_table. To answer that, let’s take a look at our final table again:

head(final_table)

# A tibble: 6 x 6
  stwibname                year period_name total_unemployment total_labor_force
  <chr>                   <int> <chr>                    <int>             <int>
1 A Purchase/Pennyrile W~  2022 April                     6660            168460
2 A Purchase/Pennyrile W~  2022 August                    7341            168884
3 A Purchase/Pennyrile W~  2022 December                  6297            168208
4 A Purchase/Pennyrile W~  2022 February                  8109            168284
5 A Purchase/Pennyrile W~  2022 January                   8252            168147
6 A Purchase/Pennyrile W~  2022 July                      8106            169502
# i 1 more variable: unemployment_rate <dbl>

From this, it’s pretty clear that we want our unemployment rate variable to (in one form or another) go on the y-axis. Similarly, we know that our region variable should be stwibname - we went through all that work just to generate this variable! But - what variable do we want to put on our x-axis? Our date variable is currently split across two separate columns - year and period_name. We need to combine these into one column, like so:

Code Explanation

This code creates the variable combined_date using the function ymd from the lubridate package, combined with paste. lubidate is a useful package which allows dates stored as many different text formats to be easily parsed into the R Date variable type. Here we arbitrarily set the “day” portion of our combined dates as the first day of the month - since all months are the same, the actual day doesn’t matter, as long as it is consistent.

plot_data <- final_table %>%
  mutate(
    combined_date = lubridate::ymd(paste(year, period_name, "01", sep = "-"))
  ) 

table(plot_data$combined_date)


2022-01-01 2022-02-01 2022-03-01 2022-04-01 2022-05-01 2022-06-01 2022-07-01 
        11         11         11         11         11         11         11 
2022-08-01 2022-09-01 2022-10-01 2022-11-01 2022-12-01 2023-01-01 2023-02-01 
        11         11         11         11         11         11         11 
2023-03-01 2023-04-01 2023-05-01 2023-06-01 
        11         11         11         11

Note that we now save our table as plot_data - this helps us avoid overwriting our final_table, and gives us a place to build in case we need to make any other modifications.

Now that we have variables for both of our axes and our group, let’s try throwing that data onto the first pass at a chart:

Code Explanation

The bulk of our ggplot code here is just telling R exactly what we figured out for ourselves above - which variables should correspond to each axis, and which variable should color each of our separate lines. To do this, the basic set of commands is fairly simple, and should be consistent across most graphs you would like to create.

First, start with ggplot()

::: {.cell}

ggplot()

:::

For the first argument, we pipe in the dataset (our plot_data). The second argument of ggplot is an aesthetic mapping that assigns x to the variable on the x-axis, y to the variable on the y-axis, color to say which grouping variable should determine color, or more (could match fill, linetype, etc.). This is where we are telling ggplot what should go on each axis, as we discussed above.

::: {.cell}

plot_data %>%
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname))

:::

Next, we need to provide a geometric object that begins with geom_ to convey the desired type of visualization (geon_line for a line plot, geom_bar for a bar plot, geom_boxplot for a box plot, geom_point for a scatter plot, etc.). In other words, this is where we tell ggplot which kind of graph we want to make.

::: {.cell}

plot_data %>%
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname)) + 
  geom_line()

:::

Note: Rather than using a %>% operator to combine ggplot elements here, as we’ve gotten used to, we use a + to combine our two ggplot layers. This is just a part of how the ggplot2 package works, but it’s a common source of confusion, so watch out for it as you proceed!

Finally, you add additional layers as necessary, again using + (We will see more on this soon…)

::: {.cell}

plot_data %>%
  ggplot(aes(x = combined_date, y = unemployment_rate, group = stwibname)) + 
  geom_line() + 
  labs(title = "This is a plot title")

:::

Again, this pattern should be very common and consistent across all ggplot graphs that you make - that’s why they call it a grammar of graphics!

plot_data %>% 
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname))+
  geom_line()

Checkpoint

Look at the “rough draft” of our visualization above. What do you think about the use of a grouped bar to display this data? What other choice could you have made? Why? What would you like to add to or change about this “rough draft”?

3.1.1 Dealing with many WIBs

Before we dive into our next iteration of our visualization, there is one more change we need to make to our base graph. In our output above, we see that there are 11. This many lines on the same graph is very overwhelming, and means that we are doing a bad job conveying the unemployment rate information for any particular geography. Unfortuntately, this is one of the cons of working with Workforce Innovation Boards - in some states (including Kentucky) there are a lot of them!

To address this issue, we are going to preserve our data for the statewide measure and the top 3 WIBs (in terms of the total number of individuals in the labor force), and then collapse the remaining WIBs into an “Other” category. Depending on your policy interest, there could be a better way to address this question, but for pedagogical clarity, this is the route we’ll pursue for today.

Code Explanation

four_largest <- plot_data %>%
  group_by(stwibname) %>%
  summarize(combined_labor_force = sum(total_labor_force)) %>%
  arrange(desc(combined_labor_force)) %>%
  head(n = 4) 

plot_data <- plot_data %>% 
  left_join(four_largest) %>%
  mutate(
    stwibname = ifelse(!is.na(combined_labor_force), stwibname, "Other")
  ) %>%
  select(-combined_labor_force) %>%
  group_by(stwibname, year, period_name, combined_date) %>%
  summarize(
    total_unemployment = sum(total_unemployment), 
    total_labor_force = sum(total_labor_force) 
  ) %>%
  ungroup() %>%
  mutate(
    unemployment_rate = total_unemployment / total_labor_force
  )

Our new base visualization after these steps then looks like this:

plot_data %>% 
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname)) +
  geom_line()

This is a lot more manageable to work with as a starting place for our visualization!

3.2 Step 2: Cleaning up

At the most basic level, the plot we created above does show all of the data from our final table. But, generally, sharing this “rough draft” data visualizations won’t get us very far. We aren’t yet achieving our broader goal of effectively communicating the story behind this information.

There’s a couple of clear issues in the case of our example:

The labels on the x-axis, y-axis, and color key are just our variable names, which are not very clear from an audience perspective.
The unemployment rate on the y-axis is shown as a decimal, where a percent might be clearer. Similarly, the dates on the x-axis are shown as 2022-01. The WORKR dashboard formatted these as “Jan. 2022”, which might also be clearer.
The default ggplot colors aren’t very nice to look at, and certainly aren’t very colorblind friendly

These particular issues are all special cases of a broader concern: we want to make the information in our visualization clear and accessible. So let’s dive into that now.

Increasing clarity might mean reordering discrete variables, adding labels to our axes, or providing other information and context to highlight information from our chart.

Increasing accessibility means ensuring that our visualization can be interpreted by a wide range of audiences. The most common example of this is ensuring that colors used in your visualization are color-blind friendly, but it could also look like providing an alt-text write-up of the main results or adding white space to make it easier to distinguish between elements.

And of course, there is a large amount of overlap between both clarity and accessibility.

In general, to address these concerns, it might be helpful to imagine interacting with your visualization from the perspective of someone with no familiarity with the context of your data or the process of your analysis. Adding sufficient context and clarity to the presentation of your results to make it clear even from this perspective can help ensure that the story you are trying to tell with your visualization is the one that is coming across.

Let’s look at a cleaned up version of our “rough draft” visualization:

Code Explanation

We’ve added three layers to our ggplot graph: scale_x_date, scale_y_continuous, and scale_color_brewer. Each of these functions allows us to manipulate one of our aesthetic elements - created in our original ggplot call - to fit better with our desired specification:

In the call to scale_x_discrete, we use the the date_labels argument to specify that our date is printed as the month abbreviation (%b), followed by a new line (\n), followed by the last two digits of the year (%y). We then use the date_breaks argument to specify that the labels appear for each month of our data, rather than every six months as in the first version of the viz.
In the call to scale_y_continuous, we use the labels argument to turn the original decimal values into a percent (telling ggplot to take advantage of the handy percent function from the scales package). We also use the limits argument to tell the y-axis to range from 0 to 8 percent, rather than having those limits be arbitrarily set by ggplot.
In the call to scale_fill_brewer, we use the palette argument to select a color-blind friendly color palette from the brewer set of palettes.

And, in all three functions, we use the name argument to set the label for that part of the graphic. As you might be able to tell, these commands follow a common pattern. They begin with scale_, indicating we are adjusting one of our scales specified in our original aes argument. Then we specify the scale - here either x, y, or color. Finally, we specify the type of scale and modification we will use: _date, _continuous, or _brewer here.

For now, take a look at how this works and see if you can follow the logic of how these adjustments appear in our updated graph. With that being said - don’t worry too much if the details aren’t totally clear - we will discuss these functions in much greater detail later on in this course. But if you are interested, please feel free to read through the help documentation for some of these functions! (As a reminder - run ?scale_x_date)

plot_data %>% 
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname)) +
  geom_line() + 
  scale_x_date(name = "Month", date_labels = "%b.\n'%y", date_breaks = "1 month") + 
  scale_y_continuous(name = "Unemployment Rate", labels = scales::percent, limits = c(0, .08)) + 
  scale_color_brewer(name = "Region", palette = "Set1")

Checkpoint

What changes did we make from the “rough draft” version of our visualization? Why do you think we might have made these changes? Did we address our concerns? What other changes would you suggest to make this visualization more clear?

3.3 Step 3: Final housekeeping

With that, our plot is starting to look close to the final product we were aiming for - we now have a working version of our visualization, which does everything it needs to do. There’s still more we can do, however, in getting the visualization to have the specific look and feel that we want it to. In this stage, it’s time to apply the final polish of making your visualization to share. It’s also a great time to begin receiving feedback on your presentation of the results and iterating based on the input you receive.

Importantly, here we also add a title to our visualization to explain what’s going on, and a subtitle with our data citation. These aspects add context to our graphic, and are very important for the overall effectiveness of the visualization. Nevertheless, they are often best saved for the end of your development process, so that you have a better idea of the story you are going to tell.

Note

An effective title for your visualization is very context dependent. Usually, in this class, we recommend that you give your figures titles which give a short description of the story that’s in that figure. This could be something like (for example) “Unemployment rates skyrocketed in 2020” - the conclusion of the visualization is stated plainly in this title. In contrast to this, however, the original KYSTATS visualization we are trying to replicate here was from a dashboard, where the data would be regularly updated and where individuals would likely interact with the visualization to find their own story. In this case, a more generic title like “Unemployment rates (not seasonally adjusted) over time” might make more sense. Again this title is helping the audience understand how to interpret the results presented in the graph in front of them, but from a broader perspective than that which might be necessary for a paper in a figure or report. Either way, it is important to always keep the context of your audience in mind when naming your figures!

With all that said, let’s see the final iteration of our plot:

plot_data %>% 
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname)) +
  geom_line(linewidth = 1) + 
  geom_point(size = 2, fill = "white", color = "black", pch = 21) + 
  scale_x_date(name = "Month", date_labels = "%b.\n'%y", date_breaks = "1 month") + 
  scale_y_continuous(name = "Unemployment Rate", labels = scales::percent, limits = c(0, .08)) + 
  scale_color_brewer(name = "Region", palette = "Set1") + 
  labs(
    title = "Unemployment Rates (Not Seasonally Adjusted) Over Time",
    subtitle = "Statewide and for Workforce Innovation Boards",
    caption = "Data from BLS LAUS Estimates"
  ) + 
  theme_classic()

Checkpoint

As a challenge for our last serious code chunk of the foundations module, we encourage you here to try and figure out the changes we’ve made to our visualization yourself. The structure of each of the calls follows very closely with much of what we’ve already done. Can you tell which new commands correspond to which changes in the graph? Try commenting out individual lines of our call and see what happens!

Finally, we will save the plot to an image on our hard drive. This is essential and surprisingly tricky step when you make it to the point of exporting your graphics from your project. Here, we will use the ggsave() function. This function allows us to specify the dimensions and resolution of the image produced - sometimes getting these parameters right can take some trial and error! Here we save our plot output as plot_final, and then use this to create a .png image which is a lightweight image format. If you want to run this code, remember to insert your U: drive folder where it says YOUR.FOLDER:

plot_final <- plot_data %>% 
  ggplot(aes(x = combined_date, y = unemployment_rate, color = stwibname)) +
  geom_line(size = 1) + 
  geom_point(size = 2, fill = "white", color = "black", pch = 21) + 
  scale_x_date(name = "Month", date_labels = "%b.\n'%y", date_breaks = "1 month") + 
  scale_y_continuous(name = "Unemployment Rate", labels = scales::percent, limits = c(0, .08)) + 
  scale_color_brewer(name = "Region", palette = "Set1") + 
  labs(
    title = "Unemployment Rates (Not Seasonally Adjusted) Over Time",
    subtitle = "Statewide and for Workforce Innovation Boards",
    caption = "Data from BLS LAUS Estimates"
  ) + 
  theme_classic()

ggsave(plot_final,
       filename = "U:/YOUR.FOLDER/unit4_unemployment.png", 
       width = 10, height = 6, dpi = "print")

Here is our final plot after saving:

Checkpoint

Take another look at our final plot. What changes did we make for this iteration? What changes might we still want to make? Why? How does it compare to the WORKR graph we’ve been trying to recreate? What differences are there? Why might we have kept those difference?

4 Conclusion

With the creation of our final plot, we have now worked through a micro-version of the project scoping process. Taking the BLS LAUS data, we have moved from exploring raw data all the way to creating a publication-quality visualization that could be used to interpret and share an answer to our fundamental research question. Along the way, we hope that you have gathered not only the technical skills to carry out each part of this project scoping process, but also an appreciation for the large amount of policy knowledge and data literacy that is required to carry out these steps in an informed, responsible, and meaningful manner.

Even at this stage, creating the visualization itself is only half the battle. We still haven’t built a very strong narrative around this visualization, or interpreted the results for our state of interest. We’ll begin on this process in our final class section, where we will talk about what these results look like for several states, compare and contrast what we see, and try to work together to develop a clearer story around our research focus. We’ll also provide a space to answer any remaining questions you have around this module or the project scoping process. From there, we will prepare to dive into the real, micro-level data that will underlie the second part of this course. Get ready - now the fun part begins!

5 Citation

Grolemund, G., & Wickham, H. (2017). R for Data Science. O’Reilly Media.

Foundations Module - May 2023 update - Notebook 4

1 Introduction

2 Technical details

2.1 ggplot2

2.2 Data setup

2.2.1 Loading our analytic frame from Redshift

2.2.2 Recreating our final table

3 Intro to data visualization

3.1 Step 1: Visualization foundation

3.1.1 Dealing with many WIBs

3.2 Step 2: Cleaning up

3.3 Step 3: Final housekeeping

4 Conclusion

5 Citation

2.1 `ggplot2`