Foundations Module: Unit 2

Author

Roy McKenzie, Benjamin Feder, Corey Sparks, Joshua Edelmann

1 Introduction

Welcome to Unit 2 of the Foundations module! Remember that in the last notebook, we worked through the process of data discovery on the Local Area Unemployment Statistics (LAUS) data, which helped us familiarize ourselves with the structure of the dataset and the information it contained. In this unit, we will direct that knowledge toward our second step along the project development roadmap, creation of the analytic frame.

Our analytic frame is the base dataset that we will use as the starting place for our main analysis, or for the creation of our data product. This step of the project development roadmap is where data literacy and project scoping begin to matter a lot. The data discovery that we did in the last notebook was very general; the steps we took to understand the raw structure of the LAUS data would look about the same for a wide variety of research questions. But now, for creating our analytic frame, we need to get more specific to our research question of interest.

Remember that, for this module, our primary goal is to recreate a portion of the Kentucky Center for Statistics Workforce Overview Report for Kentucky Regions, or WORKR, dashboard. In this notebook, we will hone in on the specific attributes and observations of the LAUS data that our version of the WORKR dashboard will require. As mentioned in the pre-notebook video (watch now if you haven’t!), this will require both our skills of data cleaning and data evaluation. In applying these concepts, we will walk through identifying the desired structure of our analytic frame, and then work step-by-step through the data operations necessary to create this dataset.

Note

In this notebook, we will continue to use the language of Fundamental Data Operations introduced in the videos and in class. Fundamental operations in this notebook will appear formatted like this. For more information, please review the fundamental data operations cheat sheet.

2 Setup

As you are working through this notebook, keep in mind that the processes to create an analytic frame requires both technical coding and content-specific data literacy skills.

If you are interested in learning the technical details of joins and the dataset creation, you are encouraged to work on understanding and reproducing the code from this workbook in a separate .sql file.

On the other hand, if you are looking to build your data literacy capabilities, we recommend you focus on the narrative of this workbook, thinking about how the overall process of this module informs our final dataset and its relation to our overarching product goal.

As always, these aspects of this workbook are not exclusive - you are welcome to work on both!

Note

A description of how to create a new .sql file, or SQL script, is available in the first notebook.

3 Creating our analytic frame

When you’re about to begin creating your analytic frame, it is always a good idea to project scoping, and thinking about what the end goal of your analysis looks like. Your analytic frame should not necessarily work for any research question; it should be specific to the analysis that you are trying to accomplish. By thinking about your end goal, the choices you need to make for your data cleaning and evaluation should become clear. Then, the rest is just actually putting those choices into practice.

With that in mind, let’s think about the end goal of our analysis in the Foundations module: recreating a portion of the KYStats WORKR dashboard for your state of interest. There’s a lot going on in this dashboard, so let’s narrow it even further and think the data that we would need to create an output similar to the following graph:

Let’s think about what this graph shows. From the title, we see that this graph is intended to show “Unemployment Rates (not seasonally adjusted) Over Time”. On the horizontal axis, we see dates ranging from March 2022 to March 2023 - this seems to be the “over time” component of the graph. The vertical axis is labelled “Unemployment Rates”, which from the title we can assume are not seasonally adjusted. Then, in the key, we can see that each of the the colored lines corresponds to different regions of Kentucky, as well as an aggregated line for the full state level.

That tells us what information is in a graph, but let’s take it down a level, and think about what this tells us about the data that underlies the graph:

Because the graph displays unemployment rates by month, then we want our analytic frame to contain data by month
Because the graph displays non-seasonally adjusted unemployment rates, our analytic frame should contain the same
Because the graph displays the data aggregated to subregions (here workforce planning regions), we need our analytic frame to be linked to subregions
Because the graph displays the aggregated, state-level line, we also want our analytic frame to include the data aggregated at the state level
Because the graph displays data only for a single state, we also want to limit our analytic frame to a single state of interest (for this version of the notebook, Kentucky)

3.1 Evaluating our data

From all of these points above, we start to get some idea of what the underlying data for our version of the WORKR dashboard should look like. Now that we have this target outlined, we want to think about how our LAUS data as it is matches the targets - this is the process of data evaluation.

Two of the points above we get almost for free:

We want our analytic frame to contain data by month: Recall that the LAUS data is already available by month - we just need to remember to filter out the extra “Annual average” category that we observed in the previous notebook.
Our analytic frame should contain non-seasonally unadjusted unemployment rates: We know that the LAUS data does have these rates - they are marked by seasonal_code = 'U'.

Because we know that our base data does contain the information need, then the actual data cleaning needed to accomplish these goals just means ensuring that we use our filtering operation to select the monthly and non-adjusted rates from our overall dataset and drop any extraneous observations outside these categories.

Unfortunately, accomplishing the remaining three points from above is a little more complicated. From our data discovery, while we do know that the LAUS data presents unemployment rates at the statewide level, we also know that it does not contain any information about sub-regions of our state in the LAUS data - other than statewide data, unemployment rates are only available at the county level. Indeed, we do not even have a separate variable for state, meaning that it would be difficult to limit our county-level data to a specific state in the first place.

3.1.1 A solution - geographic crosswalk

To remedy this lack of subregion information in the LAUS data, we are going to need a geographic crosswalk to link our county-level information to larger subregions within the state. Luckily, we have one available on Redshift already, derived from the U.S. Census LEHD Origin-Destination Employment Statistics (LODES) data. “LEHD” is short for Longitudinal Employer-Household Dynamics, which contains public information about the workplace and residence locations of workers in most U.S. states, as well as a useful crosswalk to reference across many geographical areas.

SELECT * 
FROM tr_foundations_module.xwalk
LIMIT 10

Looking at the above, we can see that tr_foundations_module schema contains the table xwalk, which links counties (represented by cty and ctyname) to both states (st and stname and state workforce innovation boards (stwib and stwibname).

There are a few caveats we need to acknowledge before we begin working with this crosswalk:

First, you might note that this table links counties to state Workforce Innovation Boards (WIBs) - also known as Workforce Development Areas (WDA) or Workforce Development Boards (WDBs) - whereas the original dashboard is aggregated to Workforce Planning Regions (WPRs), a higher level of aggregation. Unfortunately, due to differences in state implementation of the Workforce Innovation and Opportunity (WIOA), there is no nationally consistent level of geography consistent with the Workforce Planning Regions in Kentucky. What is available in the LODES crosswalk is Workforce Innovation Boards, which is what we’ll use here.

Second, if you look at the data dictionary for the raw LODES data, you will see that the full LODES crosswalk links from census blocks to both counties and Workforce Innovation Boards. For pedagogical efficiency, we have pre-aggregated this for this notebook to directly link from counties to WIBs, as you have already seen. There are, however, some counties in the full LODES crosswalk are actually served by multiple WIBs, splitting up the census blocks within that county. For example, individuals in Pulaski County, Arkansas (where Little Rock is located) are served by either the Central Arkansas WIB and the Little Rock WIB, depending on which part of the county they live in.

For our purposes here, we did not want to deal with a single county mapping to multiple WIBs, so in our reduced crosswalk we linked each county only to the geographically largest WIB that served any census blocks within that county. So for the Arkansas example, we would ignore the Little Rock WIB, which serves only the city of Little Rock, and instead treat the entirety of Pulaski County as served by the Central Arkansas WIB, which also covers the entire central region of the state.

Checkpoint

How might this limiting decision affect our outputs? What alternatives might there be? What issue might this decision cause?

To better understand the reason why we made this decision, write an SQL query from the base LODES crosswalk ds_public_1.all_states_lodes_xwalk to determine how many counties in your state map to multiple WIBs.

3.2 Expanding our data: joining on the crosswalk

Now, with those caveats aside, now that we have access to a crosswalk providing the geographic linkages we need, how can we link this data to our LAUS table?

For this task, we need to use a new fundamental data operation: joining. Joining allows us to combine two or more datasets that contain information about the same units of observations.

The first step in conducting a join is identifying the information we need from each dataset. We just did this, starting with the LAUS table containing the unemployment information for each county, and then taking into account the geographical crosswalk. We will need information from both of these tables to create our analytic frame.

The next step in our join is pinpointing the variable(s) that provide a common link between the two datasets - in other words, the variables that we are joining on. In carrying out a join, we have to specify how the datasets should be combined by developing rule(s) for how the observations from the first table should be associated with specific observations from the second table. The usual way to do this is by saying, “if this variable, or set of variables, in table one is equal to this variable, or set of variables, in table 2, then we should combine the information for that observation across the two tables”.

In this case, our common variable across the two datasets in the county name, as we want to link observations across the two datasets that have the same value of county name. Let’s do this now:

Code Explanation

This is what joining looks like in practice in SQL. As usual, we use our SELECT statement to choose the variables we desire in our output dataset. This time, though, we need to specify the tables from which each of these variables are derived. To make it easier in specifying variables from multiple tables, oftentimes we choose to assign an alias to each table, allowing us to refer back to the table in shorthand notation. Here, these aliases are loaded in the FROM statement for the LAUS data (with the alias l) and the JOIN statement for the crosswalk (with the alias x), and then referred to throughout the rest of the query.

With the aliases specified and the JOIN initialized, we must specify the variables we want to join on from both tables, as designated by ON. In this case, the variable is the county name, which is referred to as area_text in the LAUS data and ctyname in the crosswalk. We use an LEFT JOIN, to keep all observations that appear in the “left” dataset (the one specified in the FROM call). If the join command does not contain a specific type, such as LEFT, an inner join will be performed by default, which will drop all observations that don’t appear in both datasets.

SELECT l.*, x.stname, x.stwibname
FROM ds_public_1.laus l
JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
LIMIT 5

The one wrinkle with the output above is that we use a specific type of join, called a left join. As described in the pre-lecture video for this unit, a left join keeps all observations which appear in our “left” dataset - in this case, our LAUS data. This means we are not losing any information from our LAUS dataset, but instead just adding on WIB information for those counties which do appear in our crosswalk.

We can verify this by calling upon our aggregation data operation to count the number of rows before and after the join. First before:

SELECT COUNT(*)
FROM ds_public_1.laus l

Then after:

SELECT COUNT(*)
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname

It’s usually a good idea to these kind of checks after any join, because it is easy for observations to get dropped, either by accident or on purpose. For example, if we had instead used an inner join, all the statewide level unemployment rates would have been dropped, because these states don’t map to any counties in our crosswalk. Instead, because we used a left join, the statewide observations remain in our output dataset, with missing values for the st and stwib variables. Observe:

SELECT area_text, st, stname, stwib, stwibname
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
WHERE areatype_text = 'Statewide'
LIMIT 5

3.3 Limiting decisions: getting what we want

Now that we’ve got our join working, we have a ton of information available to us. Let’s recall our evaluation of the necessary data elements to create the WORKR line graph. We now have the unemployment rates linked to state workforce innovation boards, which should allow us to (eventually) wrap things up to subregions. We also have kept our statewide observations in the dataset. So many of our geographic concerns have been managed!

But, with all this information, we actually have a little too much. If you’ll recall, we wanted to limit our observations to just those from Kentucky. We also had several other filters that we want to put in place. These are things we still need tackle!

Thus, we will turn now to limiting decisions on our dataset. Remember than an important part of creating our analytic frame is ensuring our dataset has everything we need for our main analysis and nothing more. Extraneous variables or observations can make later analysis much more complicated and challenging, and may result in more inefficient procedures.

Recall that the WORKR dashboard:

Only covers a single state (Kentucky)
Only has data available at the monthly level
Only has data going back a little over a year
Only has non-adjusted rates

But, as we saw in the previous notebook, our data covers every state, contains data both at the monthly level and as annual averages, goes back to the 1970s (for state level data), and contains adjusted and non-adjusted rates (for state level data). We can verify that this is still true even after our join like so, using our distinct version of our filtering fundamental data operation:

SELECT DISTINCT l.areatype_text, x.stname, l.year, l.seasonal_code
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname

By dropping the statewide observations, we have already limited ourselves to only non-adjusted rates, but there are several more limiting decisions we need to make, which we can carry out using our filtering fundamental data operations. Specifically, we need to

Filter for only county and state observations from Kentucky
Filter out “Annual Average” observations
Filter for only non-adjusted rates
Filter for only observations since January 2022 (we are adding a few months for completeness)

Let’s do that now:

Code Explanation

Here we combine many different filtering combinations in our WHERE clause. We separate each of these conditions by the AND keyword to ensure that they all are true. In the last line, we group two conditions with an OR keyword in parentheses, to show that either one could be true. In this case, this means that the state name (according to the crosswalk) is equal to Kentucky, which will be true for the county-level observations, or that the area name (according to the LAUS data) is equal to Kentucky, which will be true for the statewide observations.

SELECT l.*, x.stname, x.stwibname
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
WHERE 
  l.year >= 2022 AND 
  period_name != 'Annual Average' AND 
  seasonal_code != 'S' AND 
  (x.stname = 'Kentucky' OR l.area_text = 'Kentucky')

This is really starting to look like something we can use for our analysis! The final step is selecting just the variables we want from the table above; this should give us the analytic frame we can take forward and use for the remainder of this module.

SELECT l.area_text,
       x.stwibname,
       l.year, 
       l.period_name,
       l.unemployment_rate,
       l.unemployment,
       l.employment,
       l.labor_force
FROM ds_public_1.laus l
LEFT JOIN tr_foundations_module.xwalk x ON l.area_text = x.ctyname 
WHERE 
  l.year >= 2022 AND 
  period_name != 'Annual Average' AND 
  seasonal_code != 'S' AND 
  (x.stname = 'Kentucky' OR l.area_text = 'Kentucky')

Checkpoint

Can you tell from the query variables did we decide to drop? Do you think they should be dropped? Why or why not?

Use the DISTINCT query to verify we aren’t creating duplicate observations by selecting only these variables.

4 Conclusion

After working through this notebook, you should have a better understanding of the process of data evaluation: how we decide on the data we need for our descriptive analysis, and how we going about assembling that data into a single, streamlined analytic frame.

This process is scalable across different data sources. In the next module, it is likely that you will have access to a large mass of dense administrative data, and it can be overwhelming to start joining and filtering this data without a plan in mind. Always try to imagine the dataset that you ideally would have to answer your research question.

Of course, this process is iterative, and the structure of that dataset might change, but having some idea of what you are working toward will ensure that you construct an analytic frame that can meet the needs of your project. Ideally, if your analytic frame is well planned, the actual process of descriptive analysis should be relatively straightforward by comparison!

You will see how this works in the next unit of this module. There, we will use the analytic frame that we have created here to carry out a descriptive analysis, reading this table into R and developing a cohesive analytic plan to obtain an appropriate output for recreating a portion of the WORKR dashboard.

Checkpoint

How else might we have structured our analytic frame for our recreation of the WORKR dashboard? Would it include more information or less? What benefits or drawbacks might there be? For example: imagine if our analytic frame was “wide” instead of “long” - that is, if it had a separate column for each quarter of our output, instead of a separate row? How might that help or hurt our ability to carry out our analysis in the next notebook?