Exploratory Data Analysis (EDA)

Module 2: Notebook 1

Author

Corey Sparks, Joshua Edelmann, Benjamin Feder

2 Introduction

Welcome to the first workbook for Module 2 of this course, covering Exploratory Data Analysis, or EDA.

In the Foundations Module, we introduced EDA within the larger process of data discovery. EDA helps you ensure that the provided data suits your desired analysis and that you have the necessary understanding to make informed analytic decisions as you work through your project. In this workbook, we’re going to be discovering the primary employment outcome-focused datasets we will use in this course: the Arkansas Unemployment Insurance (UI) wage records and the Arkansas Quarterly Census of Employment and Wages (QCEW) data. Both datasets are available in standard formats across states, so any code introduced in this notebook should be portable to your own state data. As you know, there are many other data sources available for you as part of this training - we will walk through the primary ones in optional supplemental sections. For each dataset, we will work to answer the following key data discovery questions:

  • What is the structure of the dataset? What is the layout of the data? What variables are available? What are their types? How do we interpret the meaning of a single row?

  • What is the coverage of the dataset? What years, geographic regions, and other subgroups are covered by the data?

  • What is the distribution of the variables in the dataset? Are there missing observations, outliers, and other aspects we need to adjust for later in analysis? Are there any interesting patterns worth further investigation?

The purpose of these workbooks

You will now have the opportunity to apply the skills covered in both modules thus far to restricted use Arkansas data. With your team, you will carry out a detailed analysis of this data and prepare a final project showcasing your results.

These workbooks are here to help you along in this project development by showcasing how to apply the techniques discussed in class to the Arkansas microdata. Part of this workbook will be technical - providing basic code snippets your team can modify to begin parsing through the data. But, as always, there will also be an applied data literacy component of these workbooks, and it should help you develop a better understanding of the structure and use of the underlying data even if you never write a line of code.

The timeline for completing these workbooks will be provided on the training website and communicated to you in class.

3 Technical setup

This workbook will leverage both SQL and R coding concepts, so we need to set up our environment to connect to the proper database and run R code only accessible in packages external to the basic R environment. Typically, throughout these workbooks, we use SQL for the majority of data exploration and creation of the analytic frame, and then read that analytic frame into R for the descriptive analysis and visualization.

Note: If you would like to view the material to establish your own environment for running the code displayed in this notebook, you can expand the following “Environment Setup” section by clicking on its heading.

Installing packages

Because we have left the world of the Foundations Module and entered a new workspace, we need to re-install the Coleridge Initiative package (which will then help us install several of the other packages we will use in these notebooks).

To install the package, at the Console in RStudio

type (or copy) the following:

install.packages('P:/tr-enrollment-to-employment/packages/ColeridgeInitiative_0.1.0.tar.gz', type = 'source', repos = NULL)

The package will install and when it is finished , you should see:

Next, load the package by running

library(ColeridgeInitiative)

Next, to install the basic packages needed for the course, run the following command in the console:

install_new()

This will take a minute or so, but when it is done, you should see the prompt returned to you

Now you can use the packages as you would normally use them. For example. to load the dplyr package, simply type (at the console, or in a script)

library(dplyr)

and the functions in that package will be available to you.

Load libraries

Just like we did in the Foundations Module, in running SQL and R code together through R, we need to load the RJDBC package. In addition, we will load the tidyverse suite of packages, as they will help us implement some of our fundamental data operations while maintaining a consistent syntax. Lastly, to enable an option discussed in the coming sections, we will load a new package for working with databases in R, dbplyr.

Every time you create a new R file, you should copy and run the following code snippet. You can easily copy the entire snippet by hovering over the cell and clicking the clipboard icon on the top-right edge.

options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
library(dbplyr)
library(zoo) # time/date manipulations

Establish Database Connection

To load data from the Redshift server into R, we need to first set up a connection to the database. The following set of commands accomplish this:

dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")

url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"

driver <- JDBC(
  "com.amazon.redshift.jdbc42.Driver",
  classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
  identifier.quote="`"
)

con <- dbConnect(driver, url, dbusr, dbpswd)

As a reminder, don’t worry too much about the details of this connection - you can simply copy and paste this code each time you want to connect your R script to the Redshift database.

New .Renviron

For this code to work, you need to create a new .Renviron file in your user folder (i.e. U:\\John.Doe.P00002) that contains the following:

DBUSER='adrf\John.Doe.P00002'
DBPASSWD='xxxxxxxxxxxx'

where John.Doe.P00002 is replaced with your username and xxxxxxxxxx is replaced with your password (both still in quotes!) The setup of this code is nearly identical to that required in the Foundations Module workspace - however, DBUSER should now end with .T00113 instead of .T00112.

A detailed video from the Foundations Module, “Introduction to RStudio,” demonstrating how to create an .Renviron file is available on the Resources page on class website in the subsection “Quick Links.”

Note

Much of the remaining code in this notebook will not work unless you first complete the steps above. Please let us know if you have any issues installing the necessary packages.

4 Arkansas UI wage data: ds_ar_dws.ui_wages_lehd

The Arkansas UI wage data, stored on Redshift as ds_ar_dws.ui_wages_lehd, provides information on all UI-covered employment in Arkansas. This file is supplied to the U.S. Census Bureau for use in processing the Quarterly Workforce Indicators and the Longitudinal Employer-Household Dynamics (LEHD) program, and represents approximately 95% of private non-farm wage and salary employment for establishments located in the state.

New package dbplyr

In the examples that follow, a new package for interacting with databases without writing SQL code is also shown. The dbplyr package interfaces with a database using standard dplyr and tidyverse workflows, and complementary code is shown for all SQL query examples. Just like the SQL code, this should be run and executed from R using the connection we established in the collapsed “Environment Setup” section above.

Structure

Let’s begin answering our key data discovery questions. First - what is the structure of the UI wage data? There are two dimensions to this question:

  • What variables are available in the UI Wage data?
  • What does each row in the UI Wage data represent?

To start, just like in the Foundations Module, let’s glance at a few rows of the UI Wage data:

Note: Because the output is easier to parse through on an HTML document, we have the SQL option written so that you can directly copy and paste them into DBeaver. If you would like to run them in RStudio, and not use the dbplyr code, you can do so by using the workflow we introduced in the Foundations Module by first running qry <- "INSERT SQL CODE" and then dbGetQuery(con, qry).

SELECT * 
FROM ds_ar_dws.ui_wages_lehd 
LIMIT 5
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "ui_wages_lehd")) %>%
  head(n=5)

We can see that there are 16 columns available in the UI Wage data. The UI Wage data dictionary (Right-click on link to open) on the P: drive contains detailed descriptions of each variable.

Just like with the LAUS data, the data dictionary descriptions, while helpful, do not provide the entire context for these variables. We also need to have a clear definition of what each observation - or row - in the UI wage records represents.

To know what we are aiming for, we can find the total number of rows in the UI Wage data:

SELECT COUNT(*)
FROM ds_ar_dws.ui_wages_lehd
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "ui_wages_lehd")) %>%
  tally()

Let’s think about the three categories of variables that might appear in our dataset to uniquely define our rows.

Unit of observation

The first category to consider is variables that describe the unit of observation. Recall that the unit of observation refers to the type of entity or object about which data is collected. This could be a person, a organization, a state, or any other entity that is the focus of data collection.

In the case of the UI Wage data, our unit of observation is the person - the individual receiving wages in the dataset. Besides leveraging background information on the nature of the data, we can identify this because ssn is a person-level variable in our dataset representing the smallest single unit identified in the data.

Importantly, recall that the unit of observation alone does not necessarily define a row for our dataset. We can test this by count the number of unique hashed social security numbers in our data relative to the total number of rows:

SELECT COUNT(*), COUNT(DISTINCT(employee_ssn))
FROM ds_ar_dws.ui_wages_lehd
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "ui_wages_lehd")) %>%
  summarise(
    n_rows = n(),
    n_people =  n_distinct(employee_ssn)
  )

We see that there are far fewer unique ssn values than total rows in the table. This indicates that ssn alone does not define a row of our data - some individuals must appear in the data more than once. Why might there be multiple observations for the same individual? One of the common reasons is what we will discuss next: time.

Period of observation

One of the most common reasons why this might be the case is that data about each unit of observation is observed at multiple points in time, with information stored in separate rows corresponding to each distinct point in time. Recall that we can refer to the variables describing the period of observation as the time dimension(s). They represent how often the data is captured or updated. However, note that not every dataset will have variables representing the period of observation.

In the UI Wage data, the period of observation is represented by the variables reporting_period_year and reporting_period_quarter, which indicate the calendar year and quarter associated with the employment observation, respectively. Note that in other instances, date variables often encode information about the period of observation.

Let’s look at the amount of unique combinations of employee_ssn, reporting_period_year, and reporting_period_quarter combined:

WITH dis as (
  SELECT DISTINCT reporting_period_year, reporting_period_quarter, employee_ssn
  FROM ds_ar_dws.ui_wages_lehd
)
SELECT COUNT(*)
FROM DIS
# tally finds the number of rows
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "ui_wages_lehd")) %>%

  distinct(reporting_period_year, reporting_period_quarter, employee_ssn) %>% 
  tally()

We still see a difference here between the number of unique person-quarters and the total number of rows from our first query. Perhaps there are some other variables that help constitute a row in the data, or there may be duplicate observations.

Attributes

What about the other 13 variables described in the data dictionary?

These remaining variables represent the information that the wage records contains about our person-quarter observations. There are several different types of attributes that we can see in our data dictionary:

  • Individual attributes: nhr_ui_wage_id, employee_first_name (hashed), employee_middle_name (hashed), and employee_last_name (hashed) describe the individual associated with the employment observation
  • Employer attributes: ui_account_number, reporting_unit_number, federal_ein, and state_ein describe the employer associated with the employment observation
  • Employment attributes: employee_wage_amount, hours, and weeks describe information about the employment. hours and weeks are both not populated.
  • Miscellaneous: reference_state, reporting_period (concatenated version of reporting_period_year and reporting_period_quarter)

Here, with the UI wage records, since an individual could have multiple employers within our period of observation, a unique observation should also include the employer. We can verify that claim with the code below:

WITH dis as (
  SELECT DISTINCT reporting_period_year, reporting_period_quarter, employee_ssn, federal_ein 
  FROM ds_ar_dws.ui_wages_lehd
)
SELECT COUNT(*)
FROM dis
con %>%
  tbl(in_schema(
    schema = "ds_ar_dws",
    table = "ui_wages_lehd"
  )) %>%
  distinct(reporting_period_year, reporting_period_quarter, employee_ssn, federal_ein) %>%
  tally()

Any duplication at this granularity should be further investigated.

Coverage

The next step of our exploratory data analysis is to determine the data coverage: what time span, region, and other important subgroups are captured in the data? Luckily, this is pretty straightforward for the UI Wage data: we know that the geographical area is the entire state of Arkansas, and that the group covered is those employed by UI-covered employers. We can confirm this by finding unique values of reference_state in the data:


SELECT DISTINCT(reference_state)
FROM ds_ar_dws.ui_wages_lehd
con %>%
  tbl(in_schema(
    schema = "ds_ar_dws",
    table = "ui_wages_lehd"
  )) %>%
  distinct(reference_state)

We can follow a similar approach for time periods - that being said, because we know the data is tracked over time, it might be helpful to find the number of rows corresponding to each time frame. To do so, we can group our data by quarter, this time using reporting_period, which combines reporting_period_year and reporting_period_quarter into a single string, and aggregate the number of rows.

Note: On the SQL tab, since we are planning on reading this output into R, we have used the qry <- "INSERT SQL CODE" and dbGetQuery(con, qry) workflow.

qry <- "
SELECT reporting_period, COUNT(*)
FROM ds_ar_dws.ui_wages_lehd 
GROUP BY reporting_period 
ORDER BY reporting_period
"

counts_by_qtr <- dbGetQuery(con, qry)

counts_by_qtr
counts_by_qtr <- con %>%
  tbl(in_schema(
    schema = "ds_ar_dws",
    table = "ui_wages_lehd"
  )) %>%
  select(reporting_period) %>%
  group_by(reporting_period) %>%
  tally() %>%
  ungroup() %>%
  arrange(reporting_period) %>%
  collect() # extract data into memory

counts_by_qtr

We can read through the table above and begin to get a feeling for the number of individuals covered in each quarter, but you can also use this same summary dataset to quickly calculate our coverage:

counts_by_qtr %>%
  summarize(
    min_qtr = min(reporting_period),
    max_qtr = max(reporting_period)
  )

From this, we see that the UI Wage data begins in the Q1 2011, and ends in Q4 2022.

We can also find the quarter with the highest number of observations:

counts_by_qtr %>%
  filter(
    n == max(n)
  )

Select Variable Distribution

The final part of data discovery is investigating the distribution of key variables in our dataset, and documenting any irregularities such as missingness, outliers, invalid values, or other issues that need to be addressed in an analysis leveraging these data components. We’ll look at a few of the variables from the UI Wage data now, but this type of exploration is more of an art than a science, and the continued exploration of these variables will be an essential component of your project in this program. As a reminder, you are encouraged to use this code as inspiration in your own exploration.

reporting_period

Let’s begin by taking the same data pull we just used to examine the data coverage, and plot the data using a simple line plot:

counts_by_qtr %>%
  ggplot(aes(
    x = reporting_period,
    y = as.numeric(n),
    group = 1
  )) +
  geom_line() +
  labs(x = "Quarter", y = "Count", title = "Count of UI Wage Recipients by Quarter") +
  # rotate x-axis text to avoid quarter overlap
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Immediately, we can see a seasonal pattern with higher values in Q2 & Q3 annually, and a decline in records in 2020, which should make sense given Arkansas’s economic situation at the time.

Does anything else catch your eye? Take some time to examine this graph.

employee_wage_amount

We can also examine the patterns in the average wage received by employee during each period. Here is a query to perform this:

qry <-  "
SELECT  AVG(employee_wage_amount) as avg_wage, reporting_period
FROM ds_ar_dws.ui_wages_lehd 
GROUP BY reporting_period 
ORDER BY reporting_period
"

avg_by_qtr <- dbGetQuery(con, qry)

avg_by_qtr
avg_by_qtr <- con %>%
  tbl(
    in_schema(schema = "ds_ar_dws", table = "ui_wages_lehd")
  ) %>%
  select(employee_wage_amount, reporting_period) %>%
  group_by(reporting_period) %>%
  summarize(
    avg_wage = mean(employee_wage_amount)
  ) %>%
  ungroup() %>%
  arrange(reporting_period) %>%
  collect()

avg_by_qtr

And we can visualize the data using a simple line plot:

avg_by_qtr %>%
  ggplot(aes(
    x = reporting_period,
    y = as.numeric(avg_wage),
    group = 1
  )) +
  geom_line() +
  labs(x = "Quarter", y = "Average Wage", title = "Average of UI Wages Received by Quarter") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We also see similar seasonality in average wages received, with Q4 and Q1 within each year showing higher average wages, but unlike the total number of recipients in the previous plot, we do not see any decline in 2020.

5 Arkansas QCEW data: ds_ar_dws.qcew

The QCEW, or Quarterly Census of Employment and Wages, is a dataset containing information on UI-covered employers in Arkansas, and often works in tandem with the UI wage data. You can find the QCEW data dictionary on the P drive.

Structure

First, just as we did in the previous section, we will explore the basic structure of the data by examining the first few rows of the table:

SELECT *
FROM ds_ar_dws.qcew
LIMIT 5
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>% 
  head(5)

We see there are 113 columns in the table. Consult the data dictionary for descriptions of these columns.

Similar to the one for the UI wage records, the QCEW data dictionary does not specifically define a row. As a reference point, we can find the total number of rows in the data:

SELECT COUNT(*)
FROM ds_ar_dws.qcew
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  tally()

Unit of observation

QCEW data is reported by employer, so the unit of observation is therefore the employer. The employer, though, is referenced in a handful of variables-including the federal Employer Identification Number (EIN), UI account number (or state ein), and reporting unit number-and may defined differently based on the situation. Here, we will define an employer by their federal EIN (ein in the data), as this corresponds to the employer paying into the UI trust fund and also allows for cross-state linkages. That being said, multi-unit employers may appear as separate rows with the same federal EIN but different reporting unit numbers.

Let’s find the number of unique employers in the data and compare that to the total number of rows:

SELECT COUNT(*), COUNT(DISTINCT(ein))
FROM ds_ar_dws.qcew
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  summarise(
    n_rows = n(),
    n_employers = n_distinct(ein)
  )

We can go a step further and include the reporting unit number to account for multi-unit establishments:

WITH dis AS (
  SELECT DISTINCT ein, reporting_unit_number
  FROM ds_ar_dws.qcew
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  distinct(ein, reporting_unit_number) %>%
  tally()

Beyond this, based on our exploration of the UI wage records, it makes sense that a row is not solely defined by an employer or an employer unit - recall that this information is tracked over time.

Period of observation

Whereas the period of observation in the UI Wage data is represented by reporting_period_year and reporting_year_quarter, the QCEW contains reference_year and reference_quarter variables.

We can find the number of unique combinations of ein, reporting_unit_number, reference_year, and reference_quarter combined:

WITH dis AS (
  SELECT DISTINCT ein, reporting_unit_number, reference_year, reference_quarter
  FROM ds_ar_dws.qcew
)
SELECT COUNT(*)
FROM dis
con %>%
  tbl(in_schema(
    schema = "ds_ar_dws",
    table = "qcew"
  )) %>%
  distinct(ein, reporting_unit_number, reference_year, reference_quarter) %>%
  tally()

As you can see, we don’t have complete de-duplication at this level. Upon further investigation, you may find some federal eins that correspond to multiple state eins, or vice versa, or some other examples of potential duplication. That is all to say that if you plan to use the QCEW data, just be careful for any misaligned duplication!

Attributes

As we saw earlier, the QCEW has a large number of fields, and you should consult the data dictionary for the definitions of these fields. To see the column titles, we can use the queries below.

SHOW COLUMNS from table projects.ds_ar_dws.qcew
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  names()

Between looking at the list of column names above and the data dictionary, you can see that these other variables contain a wealth of employer-related information, such as industry, location, and employment statistics.

Coverage

We will proceed by assessing the coverage of the QCEW data, first to confirm that all employers are located in Arkansas, which is tracked with the state_fips variable:


SELECT DISTINCT(state_fips)
FROM ds_ar_dws.qcew
con %>%
  tbl(in_schema(
    schema = "ds_ar_dws",
    table = "qcew"
  )) %>%
  distinct(state_fips)

You may have noticed some other variables in the QCEW data that characterize an employer’s location, beyond the state. In fact, one of those variables is county_code, which as the name implies, contains the county of the corresponding employer. We can see if we have data on employers in each county in Arkansas by counting the number of unique counties:


SELECT COUNT(DISTINCT(county_code))
FROM ds_ar_dws.qcew
con %>%
  tbl(in_schema(
    schema = "ds_ar_dws",
    table = "qcew"
  )) %>%
  summarise(
    n_counties = n_distinct(county_code)
  )

For reference, there are 75 counties in Arkansas, and none of the county codes have changed at least since the 1970s. But there seem to be more county codes here, so we can examine these specific codes:

SELECT DISTINCT(county_code)
FROM ds_ar_dws.qcew
GROUP BY county_code
ORDER BY county_code
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  distinct(county_code) %>% 
  arrange(county_code)

The county_code values of 000, 900 - 999 are not real county fips codes, and employers with these codes need to be further investigated if using employer location information in your analysis.

We can also look at coverage from a time perspective, counting the number of observations in the QCEW file by reference_year and reference_quarter:

qry <- "
SELECT count(*) as n,
q.reference_year, q.reference_quarter
FROM ds_ar_dws.qcew q
GROUP BY q.reference_year, q.reference_quarter
ORDER BY q.reference_year, q.reference_quarter
"

cnt_by_quarter <- dbGetQuery(con, qry)

cnt_by_quarter
cnt_by_quarter <- con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  group_by(reference_year, reference_quarter) %>% 
  summarise(n=n()) %>% 
  ungroup() %>% 
  arrange(reference_year, reference_quarter) %>%
  collect()

cnt_by_quarter
cnt_by_quarter %>% 
  # combine year and quarter into one variable separated by "-"
  mutate(period = paste(reference_year, reference_quarter, sep = "-")) %>% 
  ggplot(aes(x = period, y = as.numeric(n), group = 1))+
  geom_line()+
  theme(axis.text.x = element_text(angle =45, hjust=1))

In this plot, we see continuous growth into 2020, when we see a large spike in the number of observations - this might indicate an underlying data quality issue in 2020.

Select Variable Distribution

naics_code

One of the commonly-used variables from the QCEW table is naics_code, which contains the North American Industry Classification System (NAICS) codes that correspond to specific industries. The codes traditionally exist at the six-digit level, and can be truncated into broader codes (ex. 624120 coverted to 62).

At the six-digit level, let’s see how many unique NAICS codes exist in the QCEW table:

SELECT COUNT(DISTINCT(naics_code))
FROM ds_ar_dws.qcew
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  summarize(num_codes = n_distinct(naics_code))

That’s a lot of codes! In contrast, let’s see if working with the 2-digit codes might be a bit more feasible:

SELECT COUNT(DISTINCT(SUBSTRING(naics_code, 1, 2)))
FROM ds_ar_dws.qcew
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  distinct(naics_code) %>%
  collect() %>% #substring doesn't work when translating to postgresql so need to pull into R first
  summarize(n_distinct(substring(naics_code, 1, 2)))

This is a bit more workable. Let’s find the most common two-digit NAICS codes across all rows:

SELECT SUBSTRING(naics_code, 1, 2), COUNT(*)
FROM ds_ar_dws.qcew
GROUP BY SUBSTRING(naics_code, 1, 2)
ORDER BY 2 DESC --2 references the second column in the sELECT statement
LIMIT 5
con %>% 
  tbl(in_schema(schema = "ds_ar_dws",
                table = "qcew")) %>%
  group_by(naics_code) %>%
  summarize(n = n()) %>%
  ungroup() %>%
  collect() %>%
  #substring doesn't work when translating to postgresql so need to pull into R first
  # also convert n to numeric since it pulls in as a character variable
  mutate(
    two_digit = substring(naics_code, 1, 2),
    n = as.numeric(n)
  ) %>%
  group_by(two_digit) %>%
  summarize(tot = sum(n)) %>%
  arrange(desc(tot)) %>%
  head(5)

There are a handful of NAICS-related lookup tables available for you in the ds_ar_dws schema. At the two-digit level, the most relevant table will be naics_sector, but just keep in mind that the current table has ranges of codes in some places (ex. 44-45), which make it a bit more complicated to join. For reference, the two-digit NAICS code of 62 corresponds to Health Care and Social Assistance.

6 Linking UI Wages and QCEW

By joining the QCEW data to the UI wage records, we can look at wage patterns while taking into account the characteristics of the employers, such as their industry, for example. In this section, we will do just that, merge the ds_ar_dws.ui_wages_lehd table to the QCEW table ds_ar_dws.qcew.

To join the two tables, we will use the federal_ein in the UI wages and the ein in the QCEW, which both correspond to the federal EIN. We also need to join on the reference period, or the year and quarter combination to ensure we are getting the wages for the correct period. Before doing so, though, to ensure we are not joining multiple NAICS codes per employer, we will master the QCEW table to find the most common industry code for each EIN in a given quarter (and year). This approach is not perfect, but does still join for cases where there are missing reporting unit number values in the UI wage records but not in the QCEW.

And remember that as we saw earlier, there may still be some duplication in the UI wage records unaccounted for in this approach, which could be mediated through additional mastering.

We can use the first two digits of the NAICS codes to obtain the industrial sector for each business, then average the wages across these sectors for each quarter in 2019.

Note: Due to the complexity of the join, we have not provided dbplyr code in this example.

qry <- "
with qcew_rank as (
    select ein, reference_year, reference_quarter, substring(naics_code, 1, 2) as naics_sub, 
        ROW_NUMBER ()  OVER (PARTITION BY  ein, reference_year, reference_quarter order by count(*) desc) as rank 
    from ds_ar_dws.qcew 
    group by ein, reference_year, reference_quarter, substring(naics_code, 1, 2)
),
qcew_mastered as (
    select * 
    from qcew_rank 
    where rank = 1
)
select avg(uwl.employee_wage_amount) as avg_wage,
 count(distinct uwl.employee_ssn) as n_employee,
 count(distinct uwl.federal_ein) as n_employers,
 q.reference_year,
 q.reference_quarter,
 naics_sub
from ds_ar_dws.ui_wages_lehd uwl
left join qcew_mastered q
  on (uwl.federal_ein=q.ein 
    and uwl.reporting_period_year = q.reference_year 
    and uwl.reporting_period_quarter = q.reference_quarter)
where q.reference_year = '2019' and naics_sub !='00'
group by naics_sub, q.reference_year, q.reference_quarter 
order by naics_sub, q.reference_year, q.reference_quarter
"
avg_by_naics <- dbGetQuery(con, qry)

head(avg_by_naics)

For the sake of this example, we will identify the specific sectors by two-digit NAICS codes:

  • Agriculture, forestry, fishing and hunting: 11
  • Retail Trade: 44-45
  • Professional, scientific and technical: 54
  • Healthcare and social assistance: 62
avg_by_naics %>%
  filter(naics_sub %in% c("11", "62", "44", "45", "54")) %>%
  mutate(sector = case_when(
    naics_sub == "11" ~ "Agriculture",
    naics_sub == "44" ~ "Retail - 44",
    naics_sub == "45" ~ "Retail - 45",
    naics_sub == "54" ~ "Technical & Professional",
    naics_sub == "62" ~ "Healthcare"
  ), 
  reporting_period = paste(reference_year, reference_quarter, sep = '-')) %>%
  ggplot(aes(
    x = reporting_period,
    y = as.numeric(avg_wage),
    group = sector,
    color = sector
  )) +
  geom_line() +
  labs(
    x = "Quarter",
    y = "Average Quarterly Wage",
    title = "Average Wages Received by Quarter",
    subtitle = "By NAICS Sector in 2019"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We see that some industries show marked seasonal variation in wages in 2019, while others do not. On average, we can see that one of the Retail industry workers have the lowest wages on average.

Checkpoint

Use the UI Wage and QCEW data dictionaries to identify one or more further variables that might be relevant to your group’s analysis. Think through what these variables should look like, as well as what issues might arise. Working individually or with your group, examine the distribution of these variables. Document any EDA-related concerns and findings in your project template. Brainstorm as to what the cause of these issues might be, and how they could impact your analysis.

7 Next Steps: Applying the workbook to your project

The workbook provides a structure for you to start your EDA process on the data within the scope of your project. The data coverage and row definition for two of the primary datasets in this training is available, allowing you to focus on evaluating the distribution of variables potentially relevant to your analysis. Below, in the Additional Data Sources section, we apply the same approach to the other key tables we have access to in this workspace. You do not need to work through each section, but should review the content at least for the tables you plan to use in your project. You can expand a specific section by clicking on its header below.

As you evaluate variable distributions, you can start by repurposing the code in these sections. There are code snippets for distributions of numeric, time-based, and categorical variables that may be appropriate depending on the type of column you are interested in exploring.

In doing so, as recommended in the checkpoint, note your findings in your team’s project template. As your project progresses, it will be helpful to look back at these notes, especially in thinking through how to most accurately and best communicate your team’s final product to an external audience. Ultimately, the EDA process is an essential step in the project development lifecycle, as it provides helpful contextual information on the variables you may choose to use (or not use) in your analysis.

For all of these steps, remember not to take notes or discuss exact results outside the ADRF. Instead, create notes or output inside the ADRF, and store them either in your U: drive or in your team’s folder on the P: drive. When discussing results with your team, remember to speak broadly, and instead direct them to look at specific findings within the ADRF. And, as always, feel free to reach out to the Coleridge team if you have any questions as you get used to this workflow!

8 Additional Data Sources

NOTE: This section is subject to minor changes, as we are in contact with the data provider. We will let you know if we make any modifications to the code.

The Arkansas TANF data has been provided by the Arkansas Department of Human Services and includes all records of TANF receipt in the state. These records have been transformed into the Chapin Hall data model layout, separated into two files - (1) ds_ar_dhs.tanf_case, which contains information by TANF case and (2) ds_ar_dhs.tanf_member, which records the TANF data for each person on a case. The data dictionary for these two files is available on the P drive.

Structure

tanf_case

Let’s take a look at the first few rows of the tanf_case table to start:

SELECT * 
FROM ds_ar_dhs.tanf_case
LIMIT 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_case")) %>%
  head(5)

We can see that there are a lot of columns, and we know that not all may be necessary in defining a row. Based on background information, we know that the unit of observation in this table should be the case_id - we can see if a case_id may be present in multiple rows by comparing the total number of observations to the number of unique cases present in the data:

SELECT COUNT(*), COUNT(DISTINCT(case_id))
FROM ds_ar_dhs.tanf_case;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_case")) %>%
  summarise(
    n_rows = n(),
    n_cases = n_distinct(case_id)
  )

Like with the QCEW and UI wage data, which is tracked longitudinally, the tanf_case file is as well. You may have spotted the second column in the data, reporting_month, which tracks cases over time. Let’s see if the combination of case_id and reporting_month defines a row in tanf_case:

WITH dis AS (
  SELECT DISTINCT case_id, reporting_month
  FROM ds_ar_dhs.tanf_case
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_case")) %>%
  distinct(case_id, reporting_month) %>%
  tally()

Perfect! There’s no duplication in the file on the case/month level. Note that for each case/month combination, there are 70 other columns with information at this granularity.

tanf_member

Let’s repeat the same approach for the tanf_member table, first looking at the first few rows:

SELECT * 
FROM ds_ar_dhs.tanf_member
LIMIT 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_member")) %>%
  head(5)

The unit of observation in this table should be the social_security_number, a hashed value pertaining to a unique individual. We can see that this table also has a reporting_month variable, and we can check if a row can be uniquely defined by combinations at the social_security_number/reporting_month level:

WITH dis AS (
  SELECT DISTINCT social_security_number, reporting_month
  FROM ds_ar_dhs.tanf_member
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_member")) %>%
  distinct(social_security_number, reporting_month) %>%
  tally()

There appears to be some duplication at this level - in fact, with a little additional code, you can see that there are some instances where an individual may appear to be on multiple cases in a given month (perhaps a data quality issue) and others where the row seems to be exactly duplicated besides one or two columns not pertaining to the case.

Additionally, by comparing the total number of rows in the tanf_member and tanf_case tables, we can see that tanf_member has more records. This should intuitively make sense, as multiple individuals can be on the same case - if you want to examine individuals, rather than households, enrolled in the program, keep in mind that you should be primarily using the tanf_member file.

Coverage

In examining the QCEW and UI wage tables, we started by looking at the geographic coverage. You may have identified the variable state_fips_code existing in both tanf_case and tanf_member - we have current concerns about their data quality and are working with the data provider to rectify these issues. In the meantime, we will not explore these columns, and just focus on the time-based coverage.

tanf_case

We can start by finding the minimum and maximum values of reporting_month:

SELECT MIN(reporting_month), MAX(reporting_month)
FROM ds_ar_dhs.tanf_case
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_case")) %>%
  summarize(
    min_month = min(reporting_month),
    max_month = max(reporting_month)
  )

Recall that we can separately review the time coverage of a dataset by counting the number of observations within each time frame. We can take this a step further in the tanf_case data by counting the number of cases receiving TANF (tanf = TRUE) by reporting_month:

Note: Every row in tanf_case has tanf = TRUE!

qry <- "
SELECT reporting_month, COUNT(*) as n_cases
FROM ds_ar_dhs.tanf_case 
WHERE tanf = 'TRUE'
GROUP BY reporting_month
ORDER BY reporting_month"

cases_by_month <- dbGetQuery(con, qry)

cases_by_month
cases_by_month <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_case")) %>%
  filter(tanf == 'TRUE') %>%
  group_by(reporting_month) %>%
  summarize(
    n_cases = n()
  ) %>%
  ungroup() %>%
  arrange(reporting_month) %>%
  collect()

cases_by_month

We can then graph this in a line chart:

cases_by_month %>%
  mutate(reformat_date = as.yearmon(as.character(reporting_month), "%Y%m")) %>%
  ggplot(aes(x = reformat_date, y = as.numeric(n_cases) )) +
  geom_line()

There do not appear to be any gaps in coverage within this time, but it is interesting to see the number of TANF cases declining in the data from 2018 onward.

tanf_member

We can apply the same approach to the tanf_member table:

SELECT MIN(reporting_month), MAX(reporting_month)
FROM ds_ar_dhs.tanf_member
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_member")) %>%
  summarize(
    min_month = min(reporting_month),
    max_month = max(reporting_month)
  )

As expected, tanf_member has the same reporting_month range as tanf_case.

qry <- "
SELECT reporting_month, COUNT(*) as n
FROM ds_ar_dhs.tanf_member
WHERE tanf = 'TRUE'
GROUP BY reporting_month
ORDER BY reporting_month"

rows_by_month <- dbGetQuery(con, qry)

rows_by_month
rows_by_month <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_member")) %>%
  filter(tanf == 'TRUE') %>%
  group_by(reporting_month) %>%
  summarize(
    n = n()
  ) %>%
  ungroup() %>%
  arrange(reporting_month) %>%
  collect()

rows_by_month

We can then graph this in a line chart:

rows_by_month %>%
  mutate(reformat_date = as.yearmon(as.character(reporting_month), "%Y%m")) %>%
  ggplot(aes(x = reformat_date, y = as.numeric(n) )) +
  geom_line()

We can see that the same relative pattern in the number of observations over time is present in the tanf_member data as well.

Select Variable Distribution

tanf_case - child_only_case

One variable in the tanf_case file that might be of particular interest is child_only_case, which is an indicator variable tracking if a case is child-only. Given the focus of the class on enrollment to employment, in analyzing those receiving TANF benefits, it may be useful to filter out these cases.

We can look at the distribution of observations by this variable:

SELECT child_only_case, COUNT(*) 
FROM ds_ar_dhs.tanf_case
GROUP BY child_only_case 
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_case")) %>%
  group_by(child_only_case) %>%
  summarize(n = n())

There are cases, though, that might be child-only at certain points but at others. Keep this in mind if you plan to use the TANF data, as this variable, perhaps leveraged with some in the tanf_member table mentioned in the following section, could be useful in narrowing in on a population of interest.

tanf_member - work_eligible_individual

The tanf_member table allows for a more individualized view of those on the TANF cases. Building off of the previous section, a separate variable that might be of use, this time in tanf_member is work_eligible_individual. This variable is categorical, with different numerical values corresponding to separate categories describing the individual’s work eligibility status.

Let’s look at the possible values within work_eligible_individual across all observations in the table:

SELECT work_eligible_individual, COUNT(*)  as n
FROM ds_ar_dhs.tanf_member 
GROUP BY work_eligible_individual 
ORDER BY COUNT(*) DESC
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "tanf_member")) %>%
  group_by(work_eligible_individual) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

By referencing the data dictionary (WELIG in the file), we can see that the most common value corresponds to work-eligible and either an adult or minor head of household receiving assistance. Observations with some of these other values may be worth filtering out, though.

There are additional variables tracking related characteristics, such as family_affiliation, relationship_to_head_of_household, and date_of_birth that could be useful to consider as well when honing in on an analytic frame.

NOTE: This section is subject to minor changes, as we are in contact with the data provider. We will let you know if we make any modifications to the code.

The Arkansas SNAP data has been provided by the Arkansas Department of Human Services and includes all records of SNAP receipt in the state. These records have been transformed into two files similar to the Chapin Hall data model layout for TANF data - (1) ds_ar_dhs.snap_case, which contains information by SNAP case and (2) ds_ar_dhs.snap_individual, which breaks the SNAP data down by person within a case. The data dictionary for the two files is available on the P drive, starting with the case-level data on p.13.

Structure

snap_case

Let’s take a look at the first few rows of the snap_case file to start:

SELECT * 
FROM ds_ar_dhs.snap_case
LIMIT 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  head(5)

We can see that there are a lot of columns, and we know that not all may be necessary in defining a row. Based on background information, we know that the unit of observation in this table should be the case_unit_id - we can see if a case_unit_id may be present in multiple rows by comparing the total number of observations to the number of unique cases present in the data:

SELECT COUNT(*), COUNT(DISTINCT(case_unit_id))
FROM ds_ar_dhs.snap_case;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  summarise(
    n_rows = n(),
    n_cases = n_distinct(case_unit_id)
  )

Like with the QCEW and UI wage data, which is tracked longitudinally, the snap_case file is as well. You may have spotted the second column in the data, file_month, which tracks cases over time. Let’s see if the combination of case_unit_id and file_month defines a row in snap_case:

WITH dis as (
  SELECT DISTINCT case_unit_id, file_month
  FROM ds_ar_dhs.snap_case
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  distinct(case_unit_id, file_month) %>%
  tally()

There appears to be a minor amount of duplication at this level when comparing this output to that of the total number of rows. If you plan on using the SNAP caseload file, just be aware that you might need to devise an approach to handle multiple observations at the case/month level. Note that these duplicates may have contrasting values for the other 27 variables in the table.

snap_individual

Let’s repeat the same approach for the snap_individual table, first looking at the first few rows:

SELECT * 
FROM ds_ar_dhs.snap_individual
LIMIT 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_individual")) %>%
  head(5)

The unit of observation in this table should be the individual. However, parsing through the output, you might recognize a few variables that could be referencing an individual - in particular, individual_id and ssn. Because we eventually seek to link SNAP recipients to data from other sources, all of which contain the same hashed social security number, we will opt to work with ssn. We can see that this table also has a file_month variable, as well as a case_unit_id variable, and can check if a row is uniquely defined by combinations at the ssn/case_unit_id/file_month level:

Note: There are actually more distinct combinations of individual_id/case_unit_id/file_month than ssn/case_unit_id/file_month. There may also be some cases where the file_month is outside the bounds of the cert_start_date and cert_end_date variables.

WITH dis as (
  SELECT DISTINCT ssn, case_unit_id, file_month
  FROM ds_ar_dhs.snap_individual
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_individual")) %>%
  distinct(ssn, case_unit_id, file_month) %>%
  tally()

For reference, we can compare this count to the total number of observations in snap_individual:

SELECT COUNT(*)
FROM ds_ar_dhs.snap_individual
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_individual")) %>%
  tally()

Like with snap_case, there appears to be some duplication, which is probably expected if there is known duplication at the case_unit_id/file_month granularity. This does not render the table unusable; just proceed with caution.

Additionally, by comparing the total number of rows in the snap_individual and snap_case tables, we can see that snap_individual has more records. This should intuitively make sense, as multiple individuals can be on the same case - if you want to examine individuals, rather than households, enrolled in the program, keep in mind that you should be primarily using the snap_individual file.

You might be wondering why the observations should be unique at the individual/case/month level, rather than at the indivdual/month level. Stay tuned, as we will be covering one of the reasons in the next subsection!

Coverage

In examining the QCEW and UI wage tables, we started by looking at the geographic coverage. We can do the same here before moving on to time coverage.

snap_case

There are a handful of variables pertaining to location in the snap_case table. We can start at the broadest level, looking at the distribution of values for address_state:

SELECT address_state, COUNT(*)
FROM ds_ar_dhs.snap_case
GROUP BY address_state
ORDER BY COUNT(*) DESC
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  group_by(address_state) %>%
  summarize(
    n = n()
  ) %>%
  arrange(desc(n))

Interesting! There appear to be a small amount of cases with an address_state outside of Arkansas. We are investigating this further with SNAP data experts in Arkansas. We could proceed to looking at a more granular variable such as address_zip_code, but without further information on the address_state distribution, will not do so here.

Moving on to time coverage, we can start by finding the minimum and maximum values of file_month:

SELECT MIN(file_month), MAX(file_month)
FROM ds_ar_dhs.snap_case
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  summarize(
    min_month = min(file_month),
    max_month = max(file_month)
  )

Recall that we can separately review the time coverage of a dataset by counting the number of observations within each time frame. We can take this a step further in the snap_case data by counting the number of cases receiving SNAP (active_participant = '1') by file_month:

As you might have guessed, one of the reasons an individual can be present on multiple cases in the same month in the snap_individual table is that all cases may not be active at the same time.

qry <- "
--NEED TO USE COUNT(DISTINCT(...)) BECAUSE WE KNOW THAT A ROW IS NOT ALWAYS 
--UNIQUE AT CASE/MONTH
SELECT file_month, COUNT(DISTINCT(case_unit_id)) as n_cases
FROM ds_ar_dhs.snap_case
WHERE active_participant = '1'
GROUP BY file_month
ORDER BY file_month
"

cases_by_month_snap <- dbGetQuery(con, qry)

cases_by_month_snap
# need to use n_distinct(case_unit_id) because a row is not always unique at
# the case/month level
cases_by_month_snap <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  filter(active_participant == '1') %>%
  group_by(file_month) %>%
  summarize(
    n_cases = n_distinct(case_unit_id)
  ) %>%
  ungroup() %>%
  arrange(file_month) %>%
  collect()

cases_by_month_snap

We can then graph this in a line chart:

Note: In the table above, we can see that there are active cases corresponding to our earliest and latest file_month values in the data.

cases_by_month_snap %>%
  ggplot(aes(x=file_month, y =as.numeric( n_cases) )) +
  geom_line()

While there do not appear to be any gaps in coverage, we can see a massive dip in unique cases in the latter half of 2019. In the next subsection exploring coverage in snap_individual, we will see if we uncover a similar pattern at the person level. We are in contact with the data providers about this potential issue to confirm it is not a data quality one.

snap_individual

Like in snap_case, a similar variable exists in snap_individual: state. Let’s explore its possible values:

SELECT state, COUNT(*)
FROM ds_ar_dhs.snap_individual
GROUP BY state
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_individual")) %>%
  group_by(state) %>%
  summarize(
    n = n()
  )

As opposed to the caseload file, which tracks the case unit’s current address, the individual table’s one pertains to the state extract of the data. As we can see, in snap_individual, this value is always “AR.” Unfortunately, there are not any other variables tracking an individual’s location in snap_individual, so we cannot look at the geographical coverage at a more granular level.

Turning to the time coverage, we can apply the same approach we did to the snap_case table, first finding the earliest and latest file_month values:

SELECT MIN(file_month), MAX(file_month)
FROM ds_ar_dhs.snap_individual;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_individual")) %>%
  summarise(
    min_month = min(file_month),
    max_month = max(file_month)
  )

Oddly enough, the earliest record in the individualized table is a month later than that of the caseload table. Their latest records align, though. We are investigating this with the data provider.

In looking at the snap_individual table in isolation, we cannot identify those on inactive cases. That said, we can isolate records for those on active SNAP cases by joining to the snap_case table, and then aggregating at the file_month level.

Note: There are a small subset of observations in snap_individual that do not exist in snap_case at the case/month level. We will be missing those observations here, because even if we used a LEFT JOIN, we still would not be able to confirm these records existing in snap_individual but not snap_case are active cases.

qry <- "
SELECT si.file_month, COUNT(DISTINCT(si.ssn)) as n_people
FROM ds_ar_dhs.snap_individual si 
JOIN ds_ar_dhs.snap_case sc on (si.case_unit_id = sc.case_unit_id and si.file_month = sc.file_month and sc.active_participant = '1')
GROUP BY si.file_month
ORDER BY si.file_month
"

people_by_month_snap <- dbGetQuery(con, qry)

people_by_month_snap
person_snap <- con %>%
  tbl(in_schema(schema = "ds_ar_dhs",
      table = "snap_individual"))

case_snap <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  filter(active_participant == '1') 

# can join in R using ..._join functions in tidyverse
people_by_month_snap <- person_snap %>%
  inner_join(case_snap, by = c("case_unit_id", "file_month")) %>%
  group_by(file_month) %>%
  summarize(
    n_people = n_distinct(ssn)
  ) %>%
  ungroup() %>%
  arrange(file_month) %>%
  collect()

people_by_month_snap

As before, it might be a bit easier to get a sense of trends in a plot:

people_by_month_snap %>%
  ggplot(aes(x = file_month, y = as.numeric(n_people) )) +
  geom_line()

We can see a similar trend, albeit on a different magnitude, with a massive decline in late 2019. Ignoring the massive drop followed by a spike in 2020, there appears to a decline in the number of people receiving SNAP benefits over the lifetime of the data. Interestingly enough, even though we see a similar trend with the number of cases as well, the number of cases appear to be decreasing at a slower rate relative to the number of people.

Select Variable Distribution

snap_case - benefit_amount_issued_mo

The benefit_amount_issued_mo variable in the snap_case table may be of interest to those planning on working with the SNAP data, as it contains the monthly sum of all SNAP benefits issued to a case unit. It is possible to then compare this column to benefit_amount_redeemed, either within a month or over the lifetime of a case to determine if a case (likely a household) used their entire allotment.

In isolation, though, we can still learn more about the distribution of this variable. Although it might be tempting to read the entire table into R, doing so may crash our R session by exhausting our workspace memory allocation. We can still get a rough idea of the distribution of the data by finding the minimum and maximum values, as well as values for regular percentiles:

Note: A k percentile is a value that corresponds to the point at which k percentage of observations fall at or below the given value. For example, the median, or 50th percentile, corresponds to the point at which half of the observations fall at or below that point.

--THE FUNCTION TO CALCULATE PERCENTILES IN REDSHIFT IS APPROXIMATE PERCENTILE_DISC
SELECT MIN(benefit_amount_issued_mo) AS min_val, 
    APPROXIMATE PERCENTILE_DISC(.25) WITHIN GROUP (ORDER BY benefit_amount_issued_mo) AS percentile_25,
    APPROXIMATE PERCENTILE_DISC(.5) WITHIN GROUP (ORDER BY benefit_amount_issued_mo) AS median,
    AVG(benefit_amount_issued_mo) AS average,
    APPROXIMATE PERCENTILE_DISC(.75) WITHIN GROUP (ORDER BY benefit_amount_issued_mo) AS percentile_75,
    MAX(benefit_amount_issued_mo) AS max_val
FROM ds_ar_dhs.snap_case sc 
WHERE active_participant = '1'
# quantile will find percentiles in R
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_case")) %>%
  filter(active_participant == '1') %>%
  summarise(
    min_val = min(benefit_amount_issued_mo),
    percentile_25 = quantile(benefit_amount_issued_mo, .25),
    median = median(benefit_amount_issued_mo),
    average = mean(benefit_amount_issued_mo),
    percentile_75 = quantile(benefit_amount_issued_mo, .75),
    max_val = max(benefit_amount_issued_mo)
  )

Be careful in interpreting this distribution. According to the data dictionary, the values imply two decimals, meaning that 500 = $5.00.

snap_individual - age

Given the focus of the class on enrollment to employment, in analyzing those receiving SNAP benefits, it might be useful to filter out those below a certain age. Let’s compare the number of records where an individual may be considered working age (for this example using 16+):

SELECT COUNT(*), CASE 
  WHEN age >= 16 then 'yes'
  WHEN age < 16 then 'no'
  ELSE 'missing' END AS working_age_ind
from ds_ar_dhs.snap_individual 
GROUP BY CASE WHEN age >= 16 then 'yes' WHEN age < 16 then 'no' ELSE 'missing' END
con %>% 
  tbl(in_schema(schema = "ds_ar_dhs",
                table = "snap_individual")) %>%
  mutate(
    working_age_ind = case_when(
      age >= 16 ~ "yes",
      age < 16 ~ "no",
      TRUE ~ "missing"
    )
  ) %>%
  group_by(working_age_ind) %>%
  summarize(
    n = n()
  )

We can see that while the majority of records are of individuals that are at least 16, there are many that are not. With this column in particular, keep an eye out for unrealistic birth dates, such as those in the 1800s.

The Arkansas higher education data has been provided by the Arkansas Department of Higher Education and includes records of students enrolled in public and private colleges and universities in the state of Arkansas. There are many tables within this schema, majority of them being reference tables. The two tables we will focus on in this notebook are the (1) ds_ar_dhe.student_enrollment_table, which contains information on students enrolled in public and private for-credit postsecondary programs that report to the Arkansas Division of Higher Education and (2) ds_ar_dhe.graduated_student_table, which includes data on all graduates in public and private for-credit programs. The data documentation for these two files, as well as the other files available in the schema, is available on the P drive.

Structure

student_enrollment_table

Let’s take a look at the first few rows of the student_enrollment_table table to start:

SELECT * 
FROM ds_ar_dhe.student_enrollment_table
LIMIT 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  head(5)

We can see that there are a lot of columns, and we know that not all may be necessary in defining a row. Based on background information, we know that the unit of observation in this table should be the ssn_id - a hashed value pertaining to a unique individual. We can see if a ssn_id may be present in multiple rows by comparing the total number of observations to the number of unique cases present in the data:

SELECT COUNT(*), COUNT(DISTINCT(ssn_id))
FROM ds_ar_dhe.student_enrollment_table;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  summarise(
    n_rows = n(),
    n_people = n_distinct(ssn_id)
  )

Like with the QCEW and UI wage data, which is tracked longitudinally, the student_enrollment_table file is as well. You may have spotted the second and third columns in the data, academic_year and term, respectively, which track enrollment data by academic year and semester over time. Let’s see if the combination of ssn_id, academic_year, and term defines a row in student_enrollment_table:

WITH dis AS (
  SELECT DISTINCT ssn_id, academic_year, term
  FROM ds_ar_dhe.student_enrollment_table
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  distinct(ssn_id, academic_year, term) %>%
  tally()

We’re so close! There is actually one more variable we need to include to define a row. You may have noticed the first column in our data, fice_code which is a code used to identify higher education institutions. Adding the fice_code should capture individuals enrolled at multiple institution in the same academic term. Let’s see if the combination of ssn_id, academic_year, term, and fice_code defines a row in student_enrollment_table:

WITH dis AS (
  SELECT DISTINCT ssn_id, academic_year, term, fice_code
  FROM ds_ar_dhe.student_enrollment_table
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  distinct(ssn_id, academic_year, term, fice_code) %>%
  tally()

Perfect! There’s no duplication in the file at the person/time/institution level. Note that for each person/time/institution combination, there are 29 other columns with information at this granularity.

graduated_student_table

Let’s repeat the same approach for the graduated_student_table table, first looking at the first few rows:

SELECT * 
FROM ds_ar_dhe.graduated_student_table
LIMIT 5;
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  head(5)

The unit of observation in this table should be the ssn_id as well. We can see that this table also has a fice_code variable, and instead of having both academic_year and academic_term, the table just has academic_year. We can check if a row can be uniquely defined by combinations at the ssn_id/academic_year/fice_code level:

WITH dis AS (
  SELECT DISTINCT ssn_id, academic_year, fice_code
  FROM ds_ar_dhe.graduated_student_table
)
SELECT COUNT(*) as unique_count
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  distinct(ssn_id, academic_year, fice_code) %>%
  tally()

Since this table contains graduation data, perhaps we need to include another variable that differentiates between credentials. Let’s see if adding the column degree_1, which is the degree code for the student’s first degree earned from the institution during the reporting year, might satisfy our row definition:

WITH dis AS (
  SELECT DISTINCT ssn_id, academic_year, fice_code, degree_1
  FROM ds_ar_dhe.graduated_student_table
)
SELECT COUNT(*)
FROM dis
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  distinct(ssn_id, academic_year, fice_code, degree_1) %>%
  tally()

We have a perfect match! There’s no duplication in the file at the person/time/institution/degree level. Note that for each person/time/institution/degree combination, there are 25 other columns with information at this granularity.

Coverage

In examining the QCEW and UI wage tables, we started by looking at the geographic coverage. You may have identified a few of the geographic variables, such resident_state, geo_state, and geo_county in student_enrollment_table. We know that the student_enrollment_table and the graduated_student_table consist of students that are either enrolled in an Arkansas higher education institution or graduated from an Arkansas higher education institution, so we will not explore these columns, and just focus on the time-based coverage.

student_enrollment_table

We can start by finding the minimum and maximum values of academic_year:

Note: Academic years start in July of the previous year and end in June. For example, the 2011 academic year spans from 7/1/2010 to 6/30/2011.

SELECT MIN(academic_year), MAX(academic_year)
FROM ds_ar_dhe.student_enrollment_table
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  summarize(
    min_year = min(academic_year),
    max_year = max(academic_year)
  )

Since the enrollment file provides additional time elements besides academic_year, let’s see if all term values are covered in the earliest and most recent academic years in the data:

-- START WITH SAME CODE AS PREVIOUS CELL
-- RESTRICT ENROLLMENT DATA TO THESE YEARS
WITH min_max AS (
  SELECT MIN(academic_year) as min_year, MAX(academic_year) as max_year
  FROM ds_ar_dhe.student_enrollment_table
)
SELECT DISTINCT academic_year, term
FROM ds_ar_dhe.student_enrollment_table set2 
WHERE academic_year = (SELECT min_year FROM min_max) OR 
  academic_year = (SELECT max_year FROM min_max)
ORDER BY academic_year, term 
# same code as previous cell, just assigning to object
min_max <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  summarize(
    min_year = min(academic_year),
    max_year = max(academic_year)
  )

# take min value
min_year <- min_max %>% pull(min_year)

# take max value
max_year <- min_max %>% pull(max_year)

con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  filter(academic_year == min_year | academic_year == max_year) %>%
  distinct(academic_year, term) %>%
  arrange(academic_year, term)

Interesting - in the first year, only 4 of the possible 8 terms appear. The 4 term values correspond to the Spring and Summer I sessions, and then in the final year, the data is missing off-schedule Summer I enrollees. We have separately confirmed that all other academic years have records with the complete suite of academic terms in the data.

Armed with this information, recall that we can separately review the time coverage of a dataset by counting the number of observations within each time period.

qry <- "
SELECT academic_year, COUNT(*)
FROM ds_ar_dhe.student_enrollment_table
GROUP BY academic_year
ORDER BY academic_year;"

enrollment_by_year <- dbGetQuery(con, qry)

enrollment_by_year
enrollment_by_year <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  group_by(academic_year) %>%
  summarize(
    n = n()
  ) %>%
  ungroup() %>%
  arrange(academic_year) %>%
  collect()

enrollment_by_year

We can then graph this in a line chart:

enrollment_by_year %>%
  ggplot(aes(x = academic_year, y = n, group = 1)) +
  geom_line()

Outside of the sharp declines in the first and last years in the data that can be potentially explained by the lack of records for certain academic terms, there do not appear to be any additional irregularities within this time. It is interesting to see the number of enrollments steadily decline in the data, though.

graduated_student_table

We can apply the same approach to the graduated_student_table table:

SELECT MIN(academic_year), MAX(academic_year)
FROM ds_ar_dhe.graduated_student_table
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  summarize(
    min_month = min(academic_year),
    max_month = max(academic_year)
  )

As expected, graduated_student_table has the same academic_year range as student_enrollment_table. As an optional technical challenge, you can try to find the coverage within academic_year by using the graduation_date variable.

HINT: You can convert graduation_date to an actual date using the to_date() function in Redshift. Prepare to see some funky values!

qry <- "
SELECT academic_year, COUNT(*)
FROM ds_ar_dhe.graduated_student_table gst 
GROUP BY academic_year
ORDER BY academic_year;"

graduation_by_year <- dbGetQuery(con, qry)

graduation_by_year
graduation_by_year <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  group_by(academic_year) %>%
  summarize(
    n = n()
  ) %>%
  ungroup() %>%
  arrange(academic_year) %>%
  collect()

graduation_by_year

We can then graph this in a line chart:

graduation_by_year %>%
  ggplot(aes(x = academic_year, y = n, group = 1)) +
  geom_line()

While we see a roughly inverse pattern in the number of observations over time in the graduated_student_table compared to the student_enrollment_table, there might be more to the story. In fact, what you should see from the optional technical challenge above is that not only is there only partial coverage for our first and last academic years in the data, but also there are some graduation dates that do not fully align with the proper academic year.

Select Variable Distribution

student_enrollment_table - student_level

One variable in the student_enrollment_table file that might be of particular interest is student_level. According to the data dictionary, student_level is a code describing the level of total requirements the student has finished toward the completion of the certificate or degree program in which the student is enrolled.

We can look at the distribution of observations by this variable:

SELECT student_level, COUNT(*) 
FROM ds_ar_dhe.student_enrollment_table
GROUP BY student_level 
ORDER BY student_level
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  group_by(student_level) %>%
  summarize(n = n()) %>%
  arrange(student_level)

These student level codes do not tell us much. Let’s join to the refstudentlevel reference table to bring in the code descriptions.

--LEFT JOIN FOR NULL OR BLANK STUDENT_LEVEL VALUES
SELECT b.descr, b.abbr, COUNT(*) 
FROM ds_ar_dhe.student_enrollment_table a
LEFT JOIN ds_ar_dhe.refstudentlevel b 
ON a.student_level = b.studentlevelid 
GROUP BY a.student_level, b.descr, b.abbr
ORDER by 3 desc;
enroll <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "student_enrollment_table")) %>%
  group_by(student_level) %>%
  summarize(n = n()) %>%
  collect()

ref <- con %>%
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "refstudentlevel")) %>%
  select(studentlevelid, descr, abbr) %>%
  collect() %>%
  # brings in extra space on each value for studentlevelid
  mutate(studentlevelid = str_trim(studentlevelid))

# left join to account for blank and NULL student_level values
enroll %>%
  left_join(ref, by = c("student_level" = "studentlevelid")) %>%
  arrange(desc(n))

We can see that freshman is the most common student level in the student_enrollment_table.

graduated_student_table - degree_level

Similar to the student_enrollment_table, a variable in the graduated_student_table file that might be of particular interest is degree_level, which refers to the type of degree a student received. Ideally, this should align with an individual’s student_level in the enrollment table when accounting for institution and time.

SELECT degree_level, COUNT(*) 
FROM ds_ar_dhe.graduated_student_table
GROUP BY degree_level 
ORDER BY degree_level
con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  group_by(degree_level) %>%
  summarize(n = n()) %>%
  arrange(degree_level)

These degree level options seem to differ from the student_level. Let’s bring in a separate reference table, refdegreetype, to bring in the degree type descriptions:

SELECT COUNT(*), r.degreetypedescr, r.rptgdegtype
FROM ds_ar_dhe.graduated_student_table a 
LEFT JOIN ds_ar_dhe.refdegreetype r 
ON a.degree_level = r.degreetypeid
GROUP BY r.degreetypedescr, r.rptgdegtype
ORDER BY 1 desc;
grad <- con %>% 
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "graduated_student_table")) %>%
  group_by(degree_level) %>%
  summarize(n = n()) %>%
  collect()

ref_deg <- con %>%
  tbl(in_schema(schema = "ds_ar_dhe",
                table = "refdegreetype")) %>%
  select(degreetypeid, degreetypedescr, rptgdegtype) %>%
  collect()

# left join to account for degree_level values that don't occur in reference table
grad %>%
  left_join(ref_deg, by = c("degree_level" = "degreetypeid")) %>%
  arrange(desc(n))

Perhaps not surprisingly given the enrollment data exploration, a baccalaureate degree is the most common degree type earned in our data.

graduated_student_table - degree_1

Beyond degree_level, we can also evaluate a student’s primary earned degree by looking at degree_1. The degree_1 column represents degree codes, rather than descriptive names. In the future, we will be able to use the degree_fice_year_table reference table to match these codes to specific degree names, but have identified data quality concerns in the meantime. We will update this section once the reference table is updated.

9 Citations

Arkansas WDQI ADA Training - EDA Notebook (Official citation to come)